bio-long-read-sequencing-isoseq-analysis
$
npx mdskill add GPTomics/bioSkills/bio-long-read-sequencing-isoseq-analysisAnalyzes PacBio Iso-Seq data to discover and quantify full-length transcript isoforms
- Characterizes transcript diversity and identifies novel splice variants
- Uses IsoSeq3, pbmm2, and sqanti3_qc for processing and validation
- Processes data through CCS generation, clustering, and alignment workflows
- Delivers high-quality isoform consensus sequences and quantification reports
SKILL.md
.github/skills/bio-long-read-sequencing-isoseq-analysisView on GitHub ↗
---
name: bio-long-read-sequencing-isoseq-analysis
description: Analyze PacBio Iso-Seq data for full-length isoform discovery and quantification. Use when characterizing transcript diversity or identifying novel splice variants.
tool_type: cli
primary_tool: IsoSeq3
---
## Version Compatibility
Reference examples tested with: minimap2 2.26+, pandas 2.2+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Iso-Seq Analysis
**"Analyze full-length isoforms from my Iso-Seq data"** → Process PacBio HiFi reads through CCS generation, primer removal, clustering, and isoform classification to discover novel transcript variants.
- CLI: `isoseq3 refine` → `isoseq3 cluster` → `pbmm2 align` → `sqanti3_qc.py`
## IsoSeq3 Pipeline Overview
```bash
# Full pipeline: subreads -> HQ transcripts
# 1. CCS: Generate circular consensus sequences
# 2. Lima: Remove primers and demultiplex
# 3. Refine: Remove polyA and concatemers
# 4. Cluster: Group into isoforms
# 5. Polish: Generate high-quality consensus (optional with HiFi)
```
## CCS Generation
```bash
# Generate CCS from subreads (skip if using HiFi reads)
ccs input.subreads.bam ccs.bam \
--min-rq 0.9 \
--min-passes 3 \
--num-threads 32
# For HiFi reads, CCS is already done
# Start directly from HiFi reads
```
## Primer Removal with Lima
```bash
# Iso-Seq specific primer removal
lima ccs.bam primers.fasta demux.bam \
--isoseq \
--peek-guess \
--num-threads 16
# Output: demux.primer_5p--primer_3p.bam
# Lima reports also contain demux statistics
# Check lima report
cat demux.lima.summary
```
## Primer File Format
```fasta
>primer_5p
AAGCAGTGGTATCAACGCAGAGTACATGGG
>primer_3p
AAGCAGTGGTATCAACGCAGAGTAC
```
## Refine Full-Length Reads
```bash
# Remove polyA tails and concatemers
isoseq3 refine demux.primer_5p--primer_3p.bam primers.fasta refined.bam \
--require-polya \
--min-polya-length 20
# Output: refined.bam (full-length non-chimeric reads)
# Also: refined.filter_summary.json
# Check refinement stats
cat refined.filter_summary.json | jq
```
## Cluster Into Isoforms
```bash
# Cluster similar transcripts
isoseq3 cluster refined.bam clustered.bam \
--verbose \
--use-qvs \
--num-threads 32
# Output files:
# - clustered.bam: Clustered transcripts
# - clustered.hq_transcripts.fasta: High-quality consensus
# - clustered.lq_transcripts.fasta: Low-quality consensus
# - clustered.cluster_report.csv: Cluster membership
```
## Align to Reference
```bash
# Map HQ transcripts to reference genome
minimap2 -ax splice:hq \
-uf \
--secondary=no \
reference.fa \
clustered.hq_transcripts.fasta \
| samtools sort -o aligned.bam
samtools index aligned.bam
# For downstream analysis
pbmm2 align reference.fa clustered.bam aligned.bam \
--preset ISOSEQ \
--sort
```
## Collapse Redundant Isoforms
```bash
# Collapse mapped transcripts
isoseq3 collapse aligned.bam collapsed.gff
# Output:
# - collapsed.gff: Collapsed transcript models
# - collapsed.abundance.txt: Read counts per isoform
# - collapsed.group.txt: Isoform groupings
# Convert to GTF
gffread collapsed.gff -T -o collapsed.gtf
```
## SQANTI3 Quality Control
```bash
# Classify isoforms against reference annotation
sqanti3_qc.py \
clustered.hq_transcripts.fasta \
reference.gtf \
reference.fa \
-o sqanti_output \
--aligner_choice minimap2 \
--cage_peak cage_peaks.bed \
--polyA_motif_list polyA_motifs.txt \
--cpus 16
# Key output files:
# - sqanti_output_classification.txt: Per-isoform metrics
# - sqanti_output_junctions.txt: Splice junction details
# - sqanti_output.params.txt: Run parameters
```
## SQANTI3 Categories
| Category | Code | Description |
|----------|------|-------------|
| Full Splice Match | FSM | All junctions match reference |
| Incomplete Splice Match | ISM | Subset of reference junctions |
| Novel In Catalog | NIC | Novel combination of known junctions |
| Novel Not in Catalog | NNC | Contains novel junction |
| Antisense | AS | Overlaps gene on opposite strand |
| Genic | G | Within gene but no junction match |
| Intergenic | IR | Between genes |
| Fusion | FU | Spans multiple genes |
## SQANTI3 Filtering
```bash
# Filter artifacts using SQANTI3 rules
sqanti3_filter.py \
sqanti_output_classification.txt \
--isoforms clustered.hq_transcripts.fasta \
--gtf collapsed.gtf \
--faa predicted_proteins.faa \
-o sqanti_filtered
# Custom filtering
python << 'EOF'
import pandas as pd
classification = pd.read_csv('sqanti_output_classification.txt', sep='\t')
# Keep FSM, ISM, NIC with evidence
keep = classification[
(classification['structural_category'].isin(['full-splice_match', 'incomplete-splice_match', 'novel_in_catalog'])) &
(classification['FL'] >= 2) &
(classification['bite'] == 'FALSE')
]
keep['isoform'].to_csv('filtered_isoforms.txt', index=False, header=False)
EOF
```
## Quantification with Pigeon
```bash
# PacBio's isoform quantification tool
pigeon classify \
aligned.bam \
reference.gtf \
reference.fa \
-o pigeon_output
# Produces count matrix and classification
pigeon report pigeon_output_classification.txt
```
## TAMA for Annotation Merge
```bash
# Merge Iso-Seq with reference annotation
# First, convert to TAMA format
tama_format_convert.py \
-i collapsed.gtf \
-f gtf \
-o isoseq.bed
# Create file list
echo -e "isoseq.bed\tcapped\t1\t1" > file_list.txt
echo -e "reference.bed\tcapped\t1\t2" >> file_list.txt
# Merge annotations
tama_merge.py \
-f file_list.txt \
-p merged \
-a 50 \
-z 50 \
-m 10
```
## Python Processing
**Goal:** Summarize Iso-Seq clustering results including isoform counts, read support, and transcript lengths.
**Approach:** Parse the cluster report CSV for per-isoform read counts and extract sequence lengths from the HQ FASTA.
```python
import pysam
import pandas as pd
def parse_cluster_report(report_path):
df = pd.read_csv(report_path)
isoform_counts = df.groupby('cluster_id').size()
return isoform_counts
def get_transcript_lengths(fasta_path):
lengths = {}
with pysam.FastxFile(fasta_path) as fh:
for entry in fh:
lengths[entry.name] = len(entry.sequence)
return lengths
def summarize_isoseq(cluster_report, hq_fasta):
counts = parse_cluster_report(cluster_report)
lengths = get_transcript_lengths(hq_fasta)
print(f'Total isoforms: {len(counts)}')
print(f'Median support: {counts.median():.0f} reads')
print(f'Mean length: {sum(lengths.values())/len(lengths):.0f} bp')
return counts, lengths
```
## Differential Isoform Usage
```r
library(IsoformSwitchAnalyzeR)
# Import SQANTI3 results
switchList <- importIsoformExpression(
isoformCountMatrix = 'counts.txt',
isoformRepExpression = 'tpm.txt',
designMatrix = design
)
# Add SQANTI3 annotations
switchList <- addORFfromFASTA(
switchList,
orfs = 'sqanti_corrected.fasta'
)
# Analyze switching
switchList <- isoformSwitchTestDEXSeq(switchList)
# Extract significant switches
sig_switches <- switchList$isoformSwitchAnalysis[
switchList$isoformSwitchAnalysis$padj < 0.05,
]
```
## Quality Metrics
| Metric | Good | Acceptable | Poor |
|--------|------|------------|------|
| CCS passes | >3 | 2-3 | <2 |
| Full-length % | >80% | 60-80% | <60% |
| Clustering rate | >90% | 80-90% | <80% |
| FSM % | >50% | 30-50% | <30% |
| Novel isoforms | 10-30% | 30-50% | >50% (suspect) |
## Troubleshooting
| Issue | Possible Cause | Solution |
|-------|---------------|----------|
| Low full-length % | Primer issues | Check primer sequences |
| High concatemer % | Library prep | Increase SMRTbell cleanup |
| Few FSM | Poor reference | Use comprehensive GTF |
| High NNC % | Novel biology or artifacts | Validate with orthogonal data |
| Low clustering | High diversity | Reduce clustering stringency |
## Docker/Singularity
```bash
# Using PacBio Docker images
docker run -v /data:/data \
quay.io/biocontainers/isoseq3:4.0.0--h9ee0642_0 \
isoseq3 cluster /data/refined.bam /data/clustered.bam
# SQANTI3 in Singularity
singularity exec sqanti3.sif \
sqanti3_qc.py input.fa ref.gtf ref.fa -o output
```
## Related Skills
- long-read-sequencing/basecalling - ONT/PacBio basics
- rna-quantification/alignment-free-quant - Expression analysis
- genome-intervals/gtf-gff-handling - GTF/GFF handling
- differential-expression/timeseries-de - Differential isoform usage
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.