bio-variant-calling-structural-variant-calling
$
npx mdskill add GPTomics/bioSkills/bio-variant-calling-structural-variant-callingDetect large genomic rearrangements using Manta, Delly, GRIDSS, and LUMPY.
- Identify deletions, insertions, inversions, duplications, and translocations in sequencing data.
- Integrates Manta, Delly, GRIDSS, LUMPY, SURVIVOR, and Sniffles2 for variant analysis.
- Executes split-read, discordant-pair, and assembly-based evidence detection methods.
- Outputs structured callsets suitable for building consensus genomic variant lists.
SKILL.md
.github/skills/bio-variant-calling-structural-variant-callingView on GitHub ↗
---
name: bio-variant-calling-structural-variant-calling
description: Call structural variants (SVs) from sequencing data using Manta, Delly, GRIDSS, and LUMPY. Detects deletions, insertions, inversions, duplications, and translocations too large for standard SNV callers. Use when detecting structural variants from short-read or long-read data and building consensus callsets.
tool_type: cli
primary_tool: manta
---
## Version Compatibility
Reference examples tested with: Manta 1.6+, Delly 1.2+, GRIDSS 2.13+, bcftools 1.19+, samtools 1.19+, SURVIVOR 1.0.7+, Sniffles2 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Structural Variant Calling
**"Call structural variants from my WGS data"** -> Detect large genomic rearrangements (deletions, insertions, inversions, duplications, translocations) using split-read, discordant-pair, and assembly-based evidence.
- CLI: `configManta.py` (Manta), `delly call`, `gridss` (GRIDSS), `lumpyexpress`/`smoove call`
## SV Detection Limitations by Platform
Not all SV types are equally detectable across sequencing platforms. This table reflects practical detection performance, not theoretical capability:
| SV Type | Short-read Detection | Long-read Detection | Key Limitation |
|---------|---------------------|---------------------|----------------|
| Deletion | Good (read-pair + split-read) | Excellent | Short reads miss deletions in repetitive regions |
| Duplication | Moderate (read-pair + depth) | Good | Tandem vs dispersed distinction unreliable with short reads |
| Inversion | Moderate (read-pair) | Good | Breakpoints in repeats cause false negatives |
| Insertion | Poor (limited by read length) | Excellent | Short reads cannot resolve insertions >read length |
| Translocation | Moderate (discordant pairs) | Good | High false positive rate near centromeres/telomeres |
| Complex/nested | Poor | Good (with assembly) | Multiple overlapping SVs confound short-read signals |
## Caller Comparison
| Feature | Manta | Delly | GRIDSS | Smoove/LUMPY |
|---------|-------|-------|--------|--------------|
| Method | Read-pair + split-read + local assembly | Read-pair + split-read | Positional de Bruijn graph assembly | Read-pair + split-read |
| Speed | Fastest | Moderate | Slowest (2-5x Manta) | Moderate |
| DEL detection | Good | Good | Best precision | Good |
| INS detection | Good | Limited (small INS only) | Good | Cannot detect |
| Somatic mode | Yes | Yes | Yes (GRIDSS2/GRIPSS) | Limited |
| RNA-seq | Yes | No | No | No |
| Single breakends | No | No | Yes | No |
| Complex SVs | Limited | No | Yes (via LINX) | No |
GRIDSS produces the highest precision for deletions and uniquely detects single breakend events (one side of a breakpoint where the partner cannot be mapped). Manta provides the best speed-to-accuracy ratio for most applications. Delly excels at joint calling across cohorts. LUMPY/Smoove lacks insertion detection entirely.
## Consensus Calling Strategy
Current best practice: run Delly + GRIDSS + Manta + SvABA, require 2/4 caller agreement. This consensus approach yields best sensitivity with minimized false positives. Each caller has distinct algorithmic biases, so union sets are noisy while strict intersection is too conservative.
## Manta
```bash
configManta.py \
--bam sample.bam \
--referenceFasta reference.fa \
--runDir manta_run
manta_run/runWorkflow.py -j 8
# Output: manta_run/results/variants/
# - diploidSV.vcf.gz (germline SVs)
# - candidateSV.vcf.gz (all candidates before scoring)
# - candidateSmallIndels.vcf.gz (50-1000bp indels for Strelka input)
```
## Manta Tumor-Normal Mode
```bash
configManta.py \
--tumorBam tumor.bam \
--normalBam normal.bam \
--referenceFasta reference.fa \
--runDir manta_somatic
manta_somatic/runWorkflow.py -j 8
# Output includes:
# - somaticSV.vcf.gz (somatic SVs, scored by tumor/normal evidence ratio)
# - diploidSV.vcf.gz (germline SVs)
```
## Manta Options
```bash
# WES mode (adjusts depth filters for uneven exome coverage)
configManta.py \
--bam sample.bam \
--referenceFasta reference.fa \
--exome \
--callRegions regions.bed.gz \
--runDir manta_exome
# RNA-seq mode (handles split alignments across splice junctions)
configManta.py \
--bam rnaseq.bam \
--referenceFasta reference.fa \
--rna \
--runDir manta_rna
```
## Delly
```bash
delly call -g reference.fa -o sv_calls.bcf sample.bam
bcftools view sv_calls.bcf > sv_calls.vcf
# Joint calling across cohort (recommended for population studies)
delly call -g reference.fa -o joint_svs.bcf sample1.bam sample2.bam sample3.bam
```
## Delly Somatic Mode
```bash
delly call -g reference.fa -o svs.bcf tumor.bam normal.bam
echo -e "tumor\ttumor\nnormal\tcontrol" > samples.tsv
delly filter -f somatic -o somatic_svs.bcf -s samples.tsv svs.bcf
```
## Delly SV Types
```bash
delly call -t DEL -g ref.fa -o deletions.bcf sample.bam
delly call -t DUP -g ref.fa -o duplications.bcf sample.bam
delly call -t INV -g ref.fa -o inversions.bcf sample.bam
delly call -t BND -g ref.fa -o translocations.bcf sample.bam
delly call -t INS -g ref.fa -o insertions.bcf sample.bam
```
## GRIDSS
GRIDSS uses positional de Bruijn graph assembly to reconstruct breakpoints, producing the highest precision among short-read callers. It detects single breakend events where only one side of a rearrangement maps to the reference--critical for viral integrations, centromeric breakpoints, and highly rearranged cancer genomes.
```bash
gridss \
--reference reference.fa \
--output gridss_svs.vcf \
--assembly gridss_assembly.bam \
--threads 8 \
sample.bam
```
## GRIDSS Somatic Mode (GRIDSS2 + GRIPSS)
```bash
# GRIDSS2 with paired tumor-normal
gridss \
--reference reference.fa \
--output gridss_raw.vcf \
--assembly gridss_assembly.bam \
--labels normal,tumor \
--threads 8 \
normal.bam tumor.bam
# GRIPSS post-filtering (somatic/germline classification)
gripss \
-ref_genome reference.fa \
-ref_genome_version 38 \
-sample tumor \
-reference normal \
-vcf gridss_raw.vcf \
-output_dir gripss_output/
```
Complex rearrangement reconstruction is available via LINX, which interprets GRIDSS breakpoints into higher-order SV events (chromothripsis, breakage-fusion-bridge cycles).
## LUMPY
```bash
samtools view -b -F 1294 sample.bam > discordant.bam
samtools view -h sample.bam | \
/path/to/lumpy-sv/scripts/extractSplitReads_BwaMem -i stdin | \
samtools view -Sb - > splitters.bam
lumpyexpress \
-B sample.bam \
-S splitters.bam \
-D discordant.bam \
-o lumpy_svs.vcf
```
## Smoove (LUMPY Wrapper)
```bash
smoove call \
--name sample \
--fasta reference.fa \
--outdir smoove_output \
-p 8 \
sample.bam
# Output: smoove_output/sample-smoove.genotyped.vcf.gz
```
## Merge Multiple Callers with SURVIVOR
**Goal:** Increase confidence in SV calls by requiring support from multiple callers with distinct algorithmic approaches.
**Approach:** Run 2-4 callers independently, then merge callsets with SURVIVOR requiring agreement on breakpoint proximity and SV type. Using max_dist=1000bp allows for the breakpoint imprecision inherent in short-read callers while min_callers=2 filters false positives unique to any single algorithm.
```bash
ls manta_svs.vcf delly_svs.vcf gridss_svs.vcf smoove_svs.vcf > vcf_list.txt
# max_dist=1000 min_callers=2 type_agree=1 strand_agree=1 estimate_dist=0 min_size=50
SURVIVOR merge vcf_list.txt 1000 2 1 1 0 50 merged_svs.vcf
```
The 1000bp max_dist accounts for breakpoint position uncertainty across callers (Manta and GRIDSS resolve breakpoints more precisely than Delly/LUMPY). Requiring type_agree=1 prevents merging a deletion call with a duplication call at the same locus.
## Filter SV Calls
```bash
bcftools view -i 'QUAL >= 20' svs.vcf > svs.filtered.vcf
bcftools view -i 'ABS(SVLEN) >= 50' svs.vcf > svs.min50.vcf
# Filter by SV type
bcftools view -i 'SVTYPE="DEL"' svs.vcf > deletions.vcf
bcftools view -i 'SVTYPE="INS"' svs.vcf > insertions.vcf
bcftools view -i 'SVTYPE="INV"' svs.vcf > inversions.vcf
bcftools view -i 'SVTYPE="DUP"' svs.vcf > duplications.vcf
bcftools view -i 'SVTYPE="BND"' svs.vcf > translocations.vcf
bcftools view -f PASS svs.vcf > svs.pass.vcf
```
## Annotate SVs
```bash
AnnotSV \
-SVinputFile svs.vcf \
-genomeBuild GRCh38 \
-outputFile annotated_svs
# Output includes: gene overlap, DGV frequency, gnomAD-SV population AF, ClinVar pathogenicity
```
## SV Types
| Type | Code | Description | Typical Size Range |
|------|------|-------------|--------------------|
| Deletion | DEL | Sequence removed | 50bp - 100Mb |
| Insertion | INS | Novel sequence inserted | 50bp - 10kb (short-read); unlimited (long-read) |
| Inversion | INV | Sequence orientation reversed | 1kb - 10Mb |
| Duplication | DUP | Sequence copied (tandem or dispersed) | 1kb - 10Mb |
| Translocation | BND | Breakend connecting different chromosomes | N/A (inter-chromosomal) |
## Coverage Guidelines
| Coverage | Detection Ability | Practical Guidance |
|----------|-------------------|--------------------|
| 10x | Large SVs only (>1kb) | Limited breakpoint accuracy; high false negative rate for SVs <1kb; suitable only for large deletion screening |
| 30x | Most SVs detected | Standard for WGS; good sensitivity for DEL/DUP/INV >300bp; moderate INS detection |
| 50x+ | Small SVs, precise breakpoints | Better sensitivity near repetitive regions; resolves complex SVs; recommended for clinical applications |
Below 30x, split-read evidence becomes sparse and callers rely more heavily on read-pair signals, which have lower breakpoint resolution (~300-500bp uncertainty vs ~10bp for split-reads).
## Short-read vs Long-read Decision Framework
Short reads are sufficient for: deletions >300bp, balanced translocations, large tandem duplications, and population-scale screening where cost per sample matters.
Long reads are necessary for: insertions exceeding read length, complex/nested SVs, SVs in repetitive regions (segmental duplications, LINE/SINE elements), complete breakpoint resolution, and phased SV haplotyping.
Cost consideration: short reads for population-scale SV surveys (hundreds of samples), long reads for clinical-grade SV characterization where completeness matters more than throughput.
## Long-Read SV Callers
| Caller | Best For | Key Strengths |
|--------|----------|---------------|
| Sniffles2 | ONT/HiFi general | 11.8x faster than v1; population merging with `sniffles --merge`; mosaic SV detection; best overall accuracy |
| CuteSV2 | ONT data | Highest recall for ONT; signature-based clustering handles noisy reads |
| pbsv | PacBio HiFi | Official PacBio tool; best paired with PBMM2 aligner; tandem repeat aware |
| Severus | Somatic SVs | Phased breakpoint graph approach; resolves complex somatic rearrangements (Nature Biotechnology 2025) |
### Recommended Aligner-Caller Pairings
- Minimap2 + CuteSV2: ONT general purpose; fastest end-to-end
- Winnowmap + Sniffles2: high accuracy in repetitive regions (Winnowmap downweights repetitive k-mers)
- PBMM2 + pbsv: PacBio HiFi data; PBMM2 produces the CIGAR strings pbsv expects
See long-read-sequencing/structural-variants for long-read SV calling workflows with full pipeline examples.
## Related Skills
- long-read-sequencing/structural-variants - Long-read SV calling with Sniffles2, CuteSV, pbsv
- copy-number/cnvkit-analysis - Copy number variant detection (complements SV calling for dosage changes)
- variant-calling/filtering-best-practices - VCF filtering strategies applicable to SV callsets
- variant-calling/variant-annotation - Functional annotation of variants including SVs
- alignment-files/alignment-filtering - BAM preparation and quality filtering before SV calling
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.