RNA-seq data analysis
Posted on September 13, 2016
Below shows a general workflow for carrying out a RNA-Seq experiment. In this guide, I will focus on the pre-processing of NGS raw reads, mapping, quantification and identification of differentially expressed genes and transcripts.
RNA-Seq Analysis Workflow
Annotation-based genome-guide assembly
Prepare data and software
Genome sequence and annotation (GRCh37)
We download the human genome FASTA sequences and annotation GTF file from the Ensembl FTP.
It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should not be included in the genome.
Create Mapping Indices
Before we can perform NGS read mapping, we will create the genome indices using the genome FASTA file as input. You can re-use these indices in all your future short read mapping. However, if you wish to map to a different genome build/assembly, you have to re-run this step using different genome sequences and save the indices in a different directory. Here, we will create indices for STAR and RSEM
STAR (Spliced Transcripts Alignment to a Reference)
rsem-prepare-reference [options] reference_fasta_file(s) reference_name
--gtf option specifies path to the gene annotations (in GTF format), and RSEM assumes the FASTA file contains sequence of a genome. If this option is off, RSEM will assume the FASTA file contains the reference transcripts. The name of each sequence in the Multi-FASTA files is its transcript_id.
Mapping with STAR (2-pass mode)
Issues in using STAR
If you encount error “terminate called after throwing an instance of ‘std::bad_alloc’“ you have adjust some parameters to downsize the memory you are using.
--genomeSAindexNbases 10 default: 14 int: length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches.
--genomeSAsparseD 2 default: 1 int>0: suffix array sparsity, i.e. distance between indices: use bigger numbers to decrease needed RAM at the cost of mapping speed reduction
Quantification with RSEM
In this tutorial, we use RSEM to quantify the expression of genes and transcript. In the previous step, we instruct STAR to output genomic alignments in transcriptomic coordinates (i.e. Aligned.toTranscriptome.out.bam). We input this file to RSEM to produce gene and transcript expression levels.
--bam Input file is in BAM format.
--no-bam-output Do not output any BAM file.
-p Number of threads to use.
--paired-end Input reads are paired-end reads.
--forward-prob Probability of generating a read from the forward strand of a transcript. 1: strand-specific protocol where all (upstream) reads are derived from the forward strand (Ligation method); 0: strand-specific protocol where all (upstream) read are derived from the reverse strand (dUTP method); 0.5: non-strand-specific protocol. For strand issues discussion
RSEM generates 2 result files:
Prepare input matrix
We use paste command to join the rsem.genes.results files side-by-side, then use cut to select the columns containing the expected_count information, and place them into a final output file. Repeat the same step for isoforms.+
This one-line command assumes the genes (and transcripts) in each files are in the same order. If they are not, you will have to sort the files before joining them together.
Why we use expected_count provided by RSEM?
The problem with using raw read counts is that the origin of some reads cannot always be uniquely determined. If two or more distinct transcripts in a particular sample share some common sequence (for example, if they are alternatively spliced mRNAs or mRNAs derived from paralogous genes), then sequence alignment may not be sufficient to discriminate the true origin of reads mapping to these transcripts. One approach to addressing this issue involves discarding these multiple-mapped reads entirely. Another involves partitioning and distributing portions of a multiple-mapped read’s expression value between all of the transcripts to which it maps. So-called “rescue” methods implement this second approach in a naive fashion. RSEM improves upon this approach, utilizing an Expectation-Maximization (EM) algorithm to estimate maximum likelihood expression levels. These “expected counts” can then be provided as a matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes. In terms of “expected counts” in paired-end data, RSEM treats each pair of reads as a single unit.
Quantification with HTSeq
--format=<format> Format of the input data. Possible values are sam (for text SAM files) and bam (for binary BAM files). Default is sam.
--order=<order> For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for
--stranded=<yes/no/reverse> whether the data is from a strand-specific assay (default: yes)
For stranded=no, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand (Ligation method). For stranded=reverse, these rules are reversed (dUTP method). For strand issues discussion
--a=<minaqual> skip all reads with alignment quality lower than the given minimum value (default: 10 — Note: the default used to be 0 until version 0.5.4.)
--type=<feature type> feature type (3rd column in GFF file) to be used, all features of other type are ignored (default, suitable for RNA-Seq analysis using an Ensembl GTF file: exon)
--idattr=<id attribute> GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table. The default, suitable for RNA-Seq analysis using an Ensembl GTF file, is gene_id.
--mode=<mode> Mode to handle reads overlapping more than one feature. Possible values for