Ting-You Wang

PacBio Iso-Seq data analysis

Posted on December 08, 2020

An Introduction to PacBio Iso-Seq data analysis.

First Post

PacBio Sequel Systems (Sequel I and Sequel II) have two modes, Continuous Long Reas (CLR) and Circular Consensus Sequencing (CCS). center

Software preparation

conda install -c bioconda pbccs # install CCS version 5.0 or above
conda install -c bioconda lima
conda install -c bioconda isoseq3 # install isoseq v3.4.0 or above
conda install -c bioconda pbcoretools # install dataset command
conda install -c bioconda bamtools
conda install -c bioconda minimap2
conda install -c bioconda samtools

Step1: Generate CCS (CCS reads)

If you don’t already have CCS reads, run ccs.

ccs [movie].subreads.bam [movie].ccs.bam 

Options

--all  # Emit all ZMWs
--min-rq # Minimum predicted accuracy in [0, 1]. using 0.99 by default.
--skip-polish  # Only output the initial draft template (faster, less accurate)

Note that ccs run polish by default unless you don’t want it (use --skip-polish ), so additional polishing step (isoseq3 polish) is no longer needed.

A typical command to run ccs will be like this, as followed.

ccs [movie].subreadset.xml [movie].consensusreadset.xml --log-level INFO --report-json [movie].report.json --hifi-summary-json [movie].hifi_summary.json --log-file [movie].ccs.log --report-file [movie].report.txt --metrics-json [movie].zmw_metrics.json.gz -j 64

OR

ccs [movie].subreadset.bam [movie].ccs.bam --log-level INFO --report-json [movie].report.json --hifi-summary-json [movie].hifi_summary.json --log-file [movie].ccs.log --report-file [movie].report.txt --metrics-json [movie].zmw_metrics.json.gz -j 64

One important changes for ccs (>=v5.0.0) is that it has the --all mode. In this mode, ccs outputs one representative sequence per productive ZMW, irrespective of quality and passes.

Note that ccs is now running on the Sequel IIe instrument, transferring HiFi reads directly off the instrument. The on-instrument ccs version and also SMRT Link ≥v10 run in the –all mode by default.

But don’t worry, if you want to only use HiFi reads, ccs automatically generates additional files for your convenience that only contain HiFi reads:

  • hifi_reads.fastq.gz
  • hifi_reads.fasta.gz
  • hifi_reads.bam

Step2: Barcode demultiplexer using LIMA (Full-length reads [FL reads])

We use the lima tool to remove the 5’ and 3’ cDNA primers

lima --isoseq --dump-clips [movie].ccs.bam primers.fasta [movie].fl.bam --peek-guess --log-file lima.log

Options

--isoseq  # Activate IsoSeq mode
--dump-clips # Dump clipped regions in a separate output file <prefix>.lima.clips
--peek-guess # Try to infer the used barcodes subset.

lima identifies and removes the 5’ and 3’ cDNA primers. If the sample is barcoded, include the barcode as part of the primer.

Example 1: Following is the primer.fasta for the Clontech SMARTer and NEB cDNA library prep, which are the officially recommended protocols:

>NEB_5p
GCAATGAAGTCGCAGGGTTGGG
>Clontech_5p
AAGCAGTGGTATCAACGCAGAGTACATGGGG
>NEB_Clontech_3p
GTACTCTGCGTTGATACCACTGCTT

If there are more than two sequences in your primer.fasta file or better said more than one pair of 5’ and 3’ primers in Example 1, please use lima with –peek-guess to remove spurious false positive signal. If you provided only one pair of 5’ and 3’ primers in the primer.fasta, –peak-guess become unnecessary.

Output files will be called according to their primer pair. Example for single sample libraries:

[movie].fl.NEB_5p--NEB_Clontech_3p.bam

If multiple 5’/3’ pairs of primers are given, lima will output one <prefix>.<5p>--<3p>.bam for each pair. If you want to analyze all the demultiplexed FL reads together to increase transcript recovery (Example: Same species, different tissues), you must make a combined data set:

    dataset create --type ConsensusReadSet combined_demux.consensusreadset.xml \
    prefix.5p--barcode1_3p.bam \
    prefix.5p--barcode2_3p.bam \
    prefix.5p--barcode3_3p.bam ...

Step3: Remove PolyA Tail and Artificial Concatemers (full-length, non-concatemer reads[FLNC reads])

We use isoseq3 refine to remove polyA tail and artificial concatemers:

isoseq3 refine --require-polya [movie].5p--3p.bam primers.fasta [movie].flnc.bam

Options

--min-polya-length  # Minimum poly(A) tail length. [20]
--require-polya     # Require FL reads to have a poly(A) tail and remove it.

Analyze multiple the demultiplexed FL reads together, showed above.

isoseq3 refine --require-polya combined_demux.consensusreadset.xml primers.fasta flnc.bam

Output

Output The following output files containing full-length non-concatemer reads:

 <movie>.flnc.bam
 <movie>.flnc.consensusreadset.xml

An intermediate flnc.bam file is produced which contains the FLNC reads. To convert to FASTA format, run:

bamtools convert -format fasta -in flnc.bam > flnc.fasta

Now, flnc.fasta is the full-length, non-concatemer FASTA file you can use to align back to the genome for analysis!

Step4: Cluster FLNC reads and generate polished transcripts (HQ Transcript Isoforms)

isoseq3 cluster [movie].flnc.bam [movie].polished.bam --verbose --use-qvs

Options

--use-qvs     # Use CCS QVs

Output

Note: Because the ccs was run with Polish, the isoseq3 cluster output is already polished! No additional polishing step is required. After completion, you will see the following files:

[movie].polished.hq.bam       
[movie].polished.hq.bam.pbi 
[movie].polished.lq.bam       
[movie].polished.lq.bam.pbi   
[movie].polished.hq.fasta.gz
[movie].polished.lq.fasta.gz
[movie].polished.cluster   
[moive].polished.transcriptset.xml

Step5: Align to Genome

We currently recommend using minimap2 to align to the reference genome.

minimap2 -t 8 -R "@RG\tID:Sample\tSM:hs\tLB:ga\tPL:PacBio" --MD -ax splice:hq -uf --secondary=no hg38.fasta polished.hq.fasta > aligned.sam
samtools sort -@ 8 -O BAM align.sam -o aligned.sort.bam
samtools index aligned.sort.bam

References


Published in categories tutorial  Tagged with PacBio  Long-reads  analysis