Penalized Regression Methods for Association Studies
Posted on May 15, 2015
The default graphical display of most plotting functions in R is very limited (and usually not very pretty). But that doesn’t mean that we should conform with those crude figures.
Computer Practical Exercise on Penalized Regression Methods for Association Studies
Overview
Purpose
In this exercise you will be carrying out case/control association analysis on a gene region assumed to have previously shown association with disease. The purpose is to determine which of the associated loci in the region are likely to be the variants causing disease (or be most associated with those variants causing the disease) using penalized regression methods to perform model selection.
Methodology
We will use penalized regression analysis as implemented in the R package glmnet and grpreg, and the command line software HyperLasso. The glmnet package can perform lasso penalized regression, ridge regression, and the elastic net. The lasso (L1) penalty encourages sparsity while the ridge (L2) penalty encourages highly correlated variables to have similar coefficients. The elastic net is a linear combination of these two penalties.
The grpreg package allows the user to group variables that are to be encouraged in and out of the model together. However, we will not use this functionality in this tutorial. Instead, we will use the minimax-concave penalty (MCP) provided by the package, which is a penalty that provides similar penalization as the lasso but has flat tails to give constant penalization after a user defined threshold. The HyperLasso uses a Bayesian inspired NEG penalty, which imposes heavy shrinkage on small coefficients, and relatively less shrinkage on large coefficients.
Program documentation
####glmnet:
The R package glmnet has R documentation including a pdf manual, and can be downloaded at: http://cran.r-project.org/web/packages/glmnet/index.html
grpreg
The R package grpreg has R documentation including a pdf manual, and can be downloaded at: http://cran.r-project.org/web/packages/grpreg/index.html
R documentation:
The R website is at http://www.r-project.org/
From within R, one can obtain help on any command xxxx by typing ‘help(xxxx)’
####HyperLasso: Full documentation can be found at:
[http://www.ebi.ac.uk/projects/BARGEN/] (http://www.ebi.ac.uk/projects/BARGEN/)
#### PUMA
The PUMA software can be found at: http://mezeylab.cb.bscb.cornell.edu/Software.aspx
Data overview
We will be analyzing simulated data consisting of 228 SNP markers from the CTLA4 gene region. We have 1000 cases and 1000 controls. Since this is simulated data, we know where the true causal loci are located.
Appropriate data
Appropriate data for this exercise is genotype data from a region of interest, such as fine mapping, re-sequencing, or imputation data, typed in a number of unrelated individuals. These analyses can be done for either a dichotomous or a quantitative trait. Here we will consider case/control data.
Instructions
Data files
You should ensure you have the following files saved to an appropriate directory (folder) on your machine:
Within Linux, open a terminal window, and move into the directory where the data files are e.g. by typing
cd xxxxx
(where xxxxx is replaced by the name of the appropriate folder).
Data format
We will need to reformat the data into the format each particular software requires. Currently, the genotype file contains one row per individual and one column per marker. The genotypes are coded as 0/1/2 to refer to the number of reference alleles. The phenotype file contains one row per individual, with controls coded as 0 and cases coded as 1.
Take a look at the data files Genotypes.txt and Phenotypes.txt (e.g. using the more command), and check that you understand how the data is coded.
###Step-by-step instructions
####1. Analysis with glmnet
To perform the analysis, we will need to open an R terminal. Open up a new terminal window, move to the directory where your files are located, and start R (by typing R). To exit R at anytime type ** quit()**.
Now (within R) read in the genotype data by typing:
geno<-read.table("Genotypes.txt")
This command reads the genotypes into a dataframe named “geno”. To see the size of the data frame, type:
dim(geno)
The data frame has 2000 rows, one for each individual, and 228 columns, one for each marker. To see the top five lines and first ten columns of this dataframe, type:
Next, we can read in the phenotype file into the dataframe “pheno”:
To look at the first few lines of the file, type:
This file should have 2000 rows. Before we use penalized regression methods to analyze our data, let's look at why these methods may be beneficial. We have previously run single marker analysis (the Armitage Trend test) on our data and saved the p-values in the file “SingleMarkerPvalues.txt”. We can open the file in R and plot the -log10 p-values. Since the data were simulated, we know at which marker the causal loci are located.
To read the single marker results into R, type:
To plot these, type:
This command will work provided you have an X-windows connection to the Linux server (or if you are running R on your own laptop). Otherwise, to see this plot you will have to save it as a file and transfer it back to your own laptop to view. To save the plot as a pdf file in R (rather than plotting it on the screen), type:
You will have to use a similar set of commands (pdf(“file.pdf”) and dev.off() ) before and after each plot command, if you need to save the plot as a file in order to transfer it over to your laptop. To make things easier, we have included these commands before and after each plot command below. If you have an X-windows connection to the Linux server, you can just miss out the commands
Is the number and location of the causal loci obvious? We can store the known location of each causal locus in the vector “locus” and draw vertical lines at each causal locus:
The entire region is very significant and it is difficult not only to distinguish which loci are the causal loci but to see that there are 5 different causal loci. Hopefully, our penalized regression methods will select a group of markers that can explain most of the signal.
To run glmnet, we first need to open the R library:
First, we need to convert our genotype data from a dataframe into a matrix by typing:
We can run the lasso, elastic net, and ridge regression in glmnet. Recall that the elastic net penalty is: lambda*{(1-alpha)/2 ||beta||22 + alpha ||beta||1. To run the lasso, we set alpha=1. To analyze a dichotomous outcome such as case/control status we use family=”binomial”. We will attempt to run the lasso at 100 different values of lambda (the penalty strength) by setting nlambda=100. The glmnet default is to standardize the genotypes. We can store our results in the variable “fit_lasso”:
This lists all the different pieces of information that are stored in the dataframe “fit_lasso”. We are interested in the coefficients stored in “fit_lasso$beta”. To see these, and the values of lambda to which they correspond, type:
Although we asked the program to try 100 different values of lambda, in fact only 99 values were evaluated. This is because glmnet sometimes terminates before nlambda values of lambda have been used, because of numerical instabilities near a saturated fit.
The information in “fit_lasso$beta” is hard to see as it corresponds to 228 coefficients (one for each marker) at each of 99 values of lambda. You can check this by typing:
To plot the coefficents for several (six) different values of lambda indexed by “select_l”, adding lines at each causal variant, we type:
In the lasso, the coefficents of many variables are driven to zero. As you can see, as lambda becomes smaller (our penalty becomes weaker), not only do more variables enter the model, the coefficients become larger (further from zero). It looks as if, for these data, a value of lambda of around 0.02-0.05 does quite well at picking out a single marker to represent each true causal variant.
Next, we can run ridge regression by setting alpha=0 and plot our results:
Although ridge regression does not perform model selection, for large values of lambda, some of the coefficients have so much shrinkage that we cannot distinguish them from zero. (Note that the y axes in these plots have very different ranges). Notice that as lambda is relaxed, most variables have non-zero coefficients, although they still may be small! Additionally, ridge regression assigns similar coefficients to highly correlated variables rather than selecting one of the group as the lasso does.
We can combine sparsity of lasso regression and the grouping effect of ridge regression by using the elastic net. If we set the mixing parameter alpha=0.4 and plot:
We can see that test elastic net allows for clustering, but the clusters are not as dense as the ridge. The second and third plots (lambda=0.1258 and 0.0496) give us the clearest idea of the number of independent signals and their approximate locations. Additionally, one can produce a coefficient profile plot of the coefficient paths for a fitted glmnet object using plot, eg.
However, these plots are very messy when you have a large number of variables since the path is plotted for each variable.
####2. Analysis with grpreg
The package grpreg fit paths for group lasso, group bridge, or group MCP at a grid of values of the penalty parameter lambda for linear or logistic regression models. Recall that MCP is similar to the lasso except that its flat tails apply less shrinkage to larger coefficients. First, we need to load the grpreg package into R:
We can run grpreg on the same genotype matrix and phenotype vector we used for glmnet:
Note: The grpreg package has an option to select the lambda satisfying the BIC, AIC or GVC criterion using e.g.
Using BIC, the “best” value of lambda seems to be 0.1862722. This fits with the fact that, visually, it seems as if the second and third plots (lambda=0.219 and lambda=0.174) do quite well at picking out a single marker to represent each true causal variant.
####3. Analysis with HyperLasso The HyperLasso software requires a genotype file with one row per individual and one column per SNP, along with a header row of snp names. While we are still in R, we can create a genotype file named “HlassoGenotypes.txt” in the appropriate format from our current genotype file stored in the variable “geno”. We have made a file of SNP names called “SNP_Names.txt” to add as a header line to the “geno” variable:
If we now look the the first few lines and columns of our “geno” dataframe (e.g. by typing geno[1:10,1:10]), we can see that the column names are now our SNP names.
Make sure to leave your R terminal open so that we can use our previous results to compare the methods. The HyperLasso is a command line software, so we will need to open a new terminal (i.e. open a new connection to the Linux server) and change into the appropriate directory.
Previously, we analyzed the data over a range of lambda values, but how do we know which lambda is appropriate for model selection? One way to select lambda is to choose that value of lambda that gives the appropriate false positive rate for your data set. This can be done by permuting case/control status (so generating data under the null hypothesis) and determining the number of false positives for each value of lambda. We can repeat this process many times and use the mean, median, or maximum lambda to give us the false positive rate we desire. We have already done this for you and selected lambda=90.0. In Linux, type:
The Hyperlasso has two penalty parameters, the shape and the scale parameter. However, setting the scale parameter is equivalent to setting a value for lambda, as the two are related. We have set the shape parameter to 1.0 as recommended by Vignal et. al. (2011). They use the scale parameter to control for the false positive rate, whereas we have chosen to do the same using lambda.
The output file has a special format so we need use the R command, dget, to read in the file. Then we will need to do a little fiddling to extract the information we require.
First look at the file in Linux:
Within R, open the file in R using dget and look at it:
The first column records the number of cycles required to find the mode, the log-posterior and the log-likelihood of the mode (the final entry of this column is always 0). Subsequent columns describe the covariates selected in the model:
The last column contains the information for the intercept. We want to exclude the first and the last column by creating a vector “exclude” that contains the number 1 and the number of the last column. To extract the information we need, type:
The HyperLasso appears to have similar sparsity to the lasso and the MCP. To see any subtle difference between all the methods, we can plot them together, trying to use the “best” value of lambda for each method:
We can see quite strong similarity between the ATT and the ridge, and between the lasso, MCP, and HyperLasso. The elastic net falls in the middle of these 2 groups.
4. Analysis with PUMA
WARNING: This last part of the exercise is very slow to run! I’ve included instructions in case you want to try it out at home, but I don’t recommend running it during the course. The most computationally intensive part is the calculation using the NEG penalty (which involves a 2-dimensional search over penalty parameter values). If you want to implement a faster analysis, just use the lasso penalty (i.e. omit the word “NEG” at the end of the puma command line below).
The PUMA software requires input files in PLINK’s transposed (.tfam and .tped) format. See
http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#tr
for details. We have prepared these files (pumadata.tfam and pumadata.tped) for you, as this file format is quite different from the format we have been using for the other programs.
To run (penalized) logistic regression in PUMA, assuming lasso and NEG penalties, from the linux command line type:
Once this has finished running, the lasso results should be in the file results_pumaresults_LASSO.R and the NEG results should be in the file results_pumaresults_NEG.R
To read in and visualise the lasso results (for example), use the following R script:
You can use a similar sequence of commands to read in and visualise the results from PUMA using the NEG penalty.
Other Methods
A variety of other methods and software packages exist for performing penalized regression with various different penalty functions. Model selection can be done by either cross validation, permutation, goodness of fit or parsimony. Other related methods include stepwise regression, forward/backward regression, and subset selection.
##References
- Ayers KL, and Cordell HJ (2010) SNP Selection in Genome-Wide and Candidate Gene Studies via Penalized Logistic Regression. Genet Epidemiology 34:879-891.
- Breheny P, Huang J (2009) Penalized methods for bi-level variable selection. Statistics and its interface 2:369-380.
- Friedman J, Hastie T and Tibshirani R (2008) Regularization Paths for Generalized Linear Models via Coordinate Descent. J Statist Software 33:1-22.
- Hoffman GE, Logsdon BA, Mezey JG (2013) PUMA: A unified framework for penalized multiple regression analysis of GWAS data. PLOS Computational Biology in press.
- Hoggart CJ, Whittaker JC, De Iorio M, and Balding DJ (2008) Simultaneous analysis of all SNPs in genome-wide and re-sequencing studies. PLoS Genet 4(7):e10000130.
- Vignal CM, Bansal AT, and Balding DJ (2011) Using Penalised Logistic Regression to Fine Map HLA Variants for Rheumatoid Arthritis. Annals of Human Genet 75:655-664.