# RNA-seq data analysis

##### Posted on September 13, 2016

Below shows a general workflow for carrying out a RNA-Seq experiment. In this guide, I will focus on the pre-processing of NGS raw reads, mapping, quantification and identification of differentially expressed genes and transcripts.

## Annotation-based genome-guide assembly

### Prepare data and software

#### Genome sequence and annotation (GRCh37)

We download the human genome FASTA sequences and annotation GTF file from the Ensembl FTP.

It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should not be included in the genome.

### Create Mapping Indices

Before we can perform NGS read mapping, we will create the genome indices using the genome FASTA file as input. You can re-use these indices in all your future short read mapping. However, if you wish to map to a different genome build/assembly, you have to re-run this step using different genome sequences and save the indices in a different directory. Here, we will create indices for STAR and RSEM

#### Usage

rsem-prepare-reference [options] reference_fasta_file(s) reference_name

#### Options

--gtf option specifies path to the gene annotations (in GTF format), and RSEM assumes the FASTA file contains sequence of a genome. If this option is off, RSEM will assume the FASTA file contains the reference transcripts. The name of each sequence in the Multi-FASTA files is its transcript_id.

### Mapping with STAR (2-pass mode)

#### Issues in using STAR

If you encount error “terminate called after throwing an instance of ‘std::bad_alloc’“ you have adjust some parameters to downsize the memory you are using.

--genomeSAindexNbases 10 default: 14 int: length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches.

--genomeSAsparseD 2 default: 1 int>0: suffix array sparsity, i.e. distance between indices: use bigger numbers to decrease needed RAM at the cost of mapping speed reduction

### Quantification with RSEM

In this tutorial, we use RSEM to quantify the expression of genes and transcript. In the previous step, we instruct STAR to output genomic alignments in transcriptomic coordinates (i.e. Aligned.toTranscriptome.out.bam). We input this file to RSEM to produce gene and transcript expression levels.

#### Options

--bam Input file is in BAM format.

--no-bam-output Do not output any BAM file.

-p Number of threads to use.

--paired-end Input reads are paired-end reads.

--forward-prob Probability of generating a read from the forward strand of a transcript. 1: strand-specific protocol where all (upstream) reads are derived from the forward strand (Ligation method); 0: strand-specific protocol where all (upstream) read are derived from the reverse strand (dUTP method); 0.5: non-strand-specific protocol. For strand issues discussion

#### Output

RSEM generates 2 result files:

1. rsem.genes.results
2. rsem.isoforms.results.

#### Prepare input matrix

prepare input matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes

We use paste command to join the rsem.genes.results files side-by-side, then use cut to select the columns containing the expected_count information, and place them into a final output file. Repeat the same step for isoforms.+

This one-line command assumes the genes (and transcripts) in each files are in the same order. If they are not, you will have to sort the files before joining them together.

#### Why we use expected_count provided by RSEM?

The problem with using raw read counts is that the origin of some reads cannot always be uniquely determined. If two or more distinct transcripts in a particular sample share some common sequence (for example, if they are alternatively spliced mRNAs or mRNAs derived from paralogous genes), then sequence alignment may not be sufficient to discriminate the true origin of reads mapping to these transcripts. One approach to addressing this issue involves discarding these multiple-mapped reads entirely. Another involves partitioning and distributing portions of a multiple-mapped read’s expression value between all of the transcripts to which it maps. So-called “rescue” methods implement this second approach in a naive fashion. RSEM improves upon this approach, utilizing an Expectation-Maximization (EM) algorithm to estimate maximum likelihood expression levels. These “expected counts” can then be provided as a matrix (rows = mRNAs, columns = samples) to programs such as EBSeq, DESeq, or edgeR to identify differentially expressed genes. In terms of “expected counts” in paired-end data, RSEM treats each pair of reads as a single unit.

### Quantification with HTSeq

#### Options

--format=<format> Format of the input data. Possible values are sam (for text SAM files) and bam (for binary BAM files). Default is sam.

--order=<order> For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for to indicate how the input data has been sorted. The default is name. If name is indicated, htseq-count expects all the alignments for the reads of a given read pair to appear in adjacent records in the input data. For pos, this is not expected; rather, read alignments whose mate alignment have not yet been seen are kept in a buffer in memory until the mate is found. While, strictly speaking, the latter will also work with unsorted data, sorting ensures that most alignment mates appear close to each other in the data and hence the buffer is much less likely to overflow.

--stranded=<yes/no/reverse> whether the data is from a strand-specific assay (default: yes) For stranded=no, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand (Ligation method). For stranded=reverse, these rules are reversed (dUTP method). For strand issues discussion

--a=<minaqual> skip all reads with alignment quality lower than the given minimum value (default: 10 — Note: the default used to be 0 until version 0.5.4.)

--type=<feature type> feature type (3rd column in GFF file) to be used, all features of other type are ignored (default, suitable for RNA-Seq analysis using an Ensembl GTF file: exon)

--idattr=<id attribute> GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table. The default, suitable for RNA-Seq analysis using an Ensembl GTF file, is gene_id.

--mode=<mode> Mode to handle reads overlapping more than one feature. Possible values for are union, intersection-strict and intersection-nonempty (default: union)

## De novo assembly

Published in categories tutorial  Tagged with RNA-seq  analysis  STAR  plot  R