GWAS prephasing and imputation
Posted on February 20, 2017
Below shows a general workflow for carrying out a GWAS prephasing and imputation using 1000GP phase3. In this guide, I will focus on the processing of GWAS imputation in a detailed manner.
A major use of phasing is haplotype estimation of GWAS samples in order to speed up imputation from large reference panel of haplotypes such as 1000 Genomes. The current recommendation is that GWAS samples are first ‘pre-phased’ using the most accurate method available. The subsequent imputation step (which involves imputing alleles from one set of haplotypes into another set) is fast. As new haplotype reference sets become available imputation can be re-run much more efficiently. The approach we recommend is:
Step1: Alignment of the SNPs
SNP positions in build 37
The most recent 1,000 genomes haplotypes are defined at SNPs that use build37 coordinates. You have thus to make sure that your GWAS SNPs use also the same version. If it is not the case, you can use the UCSC liftOver tool to perform the conversion to build37 coordinates
This is a crucial step of prephasing/imputation to make sure that the GWAS dataset is well aligned with the reference panel of haplotypes. Correcting strand A/T and G/C SNPs is a big concern. Genotype Harmonizer is an easy to use tool helping you accomplish this job.
Assume we have study gwas file and 1000 genome refernce file in PLINK format.
It will generate harmonized all_chrs.bed all_chrs.bim and all_chrs.fam in the alignment folder. Genotype Harmonizer uses Linkage disequilibrium (LD) patterns to determine the correct strand G/C and A/T SNPs.
Step2: Phasing the GWAS samples
Once you GWAS dataset correctly aligned to the reference panel, we strongly recommend to phase each chromosome in a single run instead of making chunks. It makes the procedure much easier and increase downstream imputation quality.
You can write a small BASH script to run the phasing.
Notes on clusters
Suppose that you want to prephase your GWAS on a cluster where each node has X CPU cores. In this case, the approach we recommend is:
- To reserve a complete cluster node for each SHAPEIT job
- To run each SHAPEIT job with X threads to fully load the CPU-cores of a node
Step3: Imputation of the GWAS samples
Once SHAPEIT has produced haplotype estimates, you can use IMPUTE2 to impute untyped genotypes using the latest release of the 1000 Genomes haplotypes.
Several comments on the previous command line:
- Prephased GWAS haplotypes are specified using -known_haps_g
- The flag -use_prephased_g is used to set IMPUTE2 in the prephasing mode
- The option -int 5000001 10000001 is used to specify the region to be imputed. Combine all the chunks
X chromosome prephasing and imputation
Step 1: Prephasing using SHAPEIT
Step 2: Imputation using IMPUTE2
- You must use the -chrX flag for IMPUTE2 to proceed with X chromosome imputation
- You must give the SAMPLE file generated by SHAPEIT to IMPUTE2. This SAMPLE has a sex column that gives the gender of the GWAS individuals.