bio-workflows-somatic-variant-pipeline
$
npx mdskill add GPTomics/bioSkills/bio-workflows-somatic-variant-pipelineCalls somatic mutations from tumor-normal paired samples using Mutect2 or Strelka2
- Solves the task of identifying cancer-specific genetic mutations from paired sequencing data
- Uses GATK Mutect2, Strelka2, and tools like BQSR, Funcotator, and VEP for variant calling and annotation
- Applies contamination filtering, orientation bias correction, and quality thresholds to prioritize true variants
- Delivers annotated variant calls, TMB scores, and mutational signatures in standard formats like VCF
SKILL.md
.github/skills/bio-workflows-somatic-variant-pipelineView on GitHub ↗
---
name: bio-workflows-somatic-variant-pipeline
description: End-to-end somatic variant calling from tumor-normal paired samples using Mutect2 or Strelka2. Covers preprocessing, variant calling, filtering, and annotation for cancer genomics. Use when calling somatic mutations from tumor-normal pairs.
tool_type: cli
primary_tool: GATK Mutect2
---
## Version Compatibility
Reference examples tested with: CNVkit 0.9+, Ensembl VEP 111+, GATK 4.5+, SnpEff 5.2+, bcftools 1.19+, picard 3.1+
Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Somatic Variant Pipeline
**"Call somatic mutations from my tumor-normal pair"** → Orchestrate alignment, Mutect2 somatic calling, contamination filtering, variant annotation (Funcotator/VEP), TMB calculation, and mutational signature analysis.
Complete workflow for calling somatic mutations from tumor-normal paired samples.
## Pipeline Overview
```
Tumor BAM + Normal BAM
│
├── Preprocessing (if needed)
│ └── MarkDuplicates, BQSR
│
├── Variant Calling
│ ├── Mutect2 (GATK) - SNVs + indels
│ └── Strelka2 - SNVs + indels (faster)
│
├── Filtering
│ ├── FilterMutectCalls
│ ├── Contamination estimation
│ └── Orientation bias filtering
│
├── Annotation
│ ├── Funcotator / VEP
│ └── Cancer-specific databases
│
└── Output: Filtered somatic VCF
```
## Mutect2 Workflow (GATK)
### Step 1: Panel of Normals (Optional but Recommended)
```bash
# Create PON from multiple normal samples
for normal in normal1.bam normal2.bam normal3.bam; do
sample=$(basename $normal .bam)
gatk Mutect2 \
-R reference.fa \
-I $normal \
--max-mnp-distance 0 \
-O ${sample}.vcf.gz
done
# Combine into PON
gatk GenomicsDBImport \
-R reference.fa \
--genomicsdb-workspace-path pon_db \
-V normal1.vcf.gz \
-V normal2.vcf.gz \
-V normal3.vcf.gz \
-L intervals.bed
gatk CreateSomaticPanelOfNormals \
-R reference.fa \
-V gendb://pon_db \
-O pon.vcf.gz
```
### Step 2: Call Somatic Variants
```bash
gatk Mutect2 \
-R reference.fa \
-I tumor.bam \
-I normal.bam \
-normal normal_sample_name \
--germline-resource af-only-gnomad.vcf.gz \
--panel-of-normals pon.vcf.gz \
--f1r2-tar-gz f1r2.tar.gz \
-O unfiltered.vcf.gz
```
### Step 3: Learn Orientation Bias
```bash
gatk LearnReadOrientationModel \
-I f1r2.tar.gz \
-O read-orientation-model.tar.gz
```
### Step 4: Calculate Contamination
```bash
gatk GetPileupSummaries \
-I tumor.bam \
-V small_exac_common.vcf.gz \
-L small_exac_common.vcf.gz \
-O tumor_pileups.table
gatk GetPileupSummaries \
-I normal.bam \
-V small_exac_common.vcf.gz \
-L small_exac_common.vcf.gz \
-O normal_pileups.table
gatk CalculateContamination \
-I tumor_pileups.table \
-matched normal_pileups.table \
-O contamination.table \
--tumor-segmentation segments.table
```
### Step 5: Filter Variants
```bash
gatk FilterMutectCalls \
-R reference.fa \
-V unfiltered.vcf.gz \
--contamination-table contamination.table \
--tumor-segmentation segments.table \
--ob-priors read-orientation-model.tar.gz \
-O filtered.vcf.gz
# Extract PASS variants
bcftools view -f PASS filtered.vcf.gz -Oz -o somatic_final.vcf.gz
```
## Strelka2 Workflow (Faster Alternative)
```bash
# Configure
configureStrelkaSomaticWorkflow.py \
--normalBam normal.bam \
--tumorBam tumor.bam \
--referenceFasta reference.fa \
--runDir strelka_run
# Execute
strelka_run/runWorkflow.py -m local -j 16
# Output files
# strelka_run/results/variants/somatic.snvs.vcf.gz
# strelka_run/results/variants/somatic.indels.vcf.gz
# Merge SNVs and indels
bcftools concat \
strelka_run/results/variants/somatic.snvs.vcf.gz \
strelka_run/results/variants/somatic.indels.vcf.gz \
-a -Oz -o strelka_somatic.vcf.gz
```
## Annotation
### Funcotator (GATK)
```bash
gatk Funcotator \
-R reference.fa \
-V somatic_final.vcf.gz \
-O annotated.vcf.gz \
--output-file-format VCF \
--data-sources-path funcotator_dataSources.v1.7 \
--ref-version hg38
```
### VEP with Cancer Databases
```bash
vep -i somatic_final.vcf.gz -o annotated.vcf \
--vcf --cache --offline \
--assembly GRCh38 \
--everything \
--plugin CADD,cadd_scores.tsv.gz \
--custom cosmic.vcf.gz,COSMIC,vcf,exact,0,CNT \
--fork 4
```
## Complete Pipeline Script
```bash
#!/bin/bash
set -euo pipefail
TUMOR_BAM=$1
NORMAL_BAM=$2
NORMAL_NAME=$3
REFERENCE=$4
OUTPUT_PREFIX=$5
GNOMAD=$6
PON=$7
THREADS=16
echo "=== Step 1: Mutect2 calling ==="
gatk Mutect2 \
-R $REFERENCE \
-I $TUMOR_BAM \
-I $NORMAL_BAM \
-normal $NORMAL_NAME \
--germline-resource $GNOMAD \
--panel-of-normals $PON \
--f1r2-tar-gz ${OUTPUT_PREFIX}_f1r2.tar.gz \
--native-pair-hmm-threads $THREADS \
-O ${OUTPUT_PREFIX}_unfiltered.vcf.gz
echo "=== Step 2: Learn orientation bias ==="
gatk LearnReadOrientationModel \
-I ${OUTPUT_PREFIX}_f1r2.tar.gz \
-O ${OUTPUT_PREFIX}_orientation.tar.gz
echo "=== Step 3: Pileup summaries ==="
gatk GetPileupSummaries \
-I $TUMOR_BAM \
-V $GNOMAD \
-L $GNOMAD \
-O ${OUTPUT_PREFIX}_tumor_pileups.table
gatk GetPileupSummaries \
-I $NORMAL_BAM \
-V $GNOMAD \
-L $GNOMAD \
-O ${OUTPUT_PREFIX}_normal_pileups.table
echo "=== Step 4: Calculate contamination ==="
gatk CalculateContamination \
-I ${OUTPUT_PREFIX}_tumor_pileups.table \
-matched ${OUTPUT_PREFIX}_normal_pileups.table \
-O ${OUTPUT_PREFIX}_contamination.table \
--tumor-segmentation ${OUTPUT_PREFIX}_segments.table
echo "=== Step 5: Filter variants ==="
gatk FilterMutectCalls \
-R $REFERENCE \
-V ${OUTPUT_PREFIX}_unfiltered.vcf.gz \
--contamination-table ${OUTPUT_PREFIX}_contamination.table \
--tumor-segmentation ${OUTPUT_PREFIX}_segments.table \
--ob-priors ${OUTPUT_PREFIX}_orientation.tar.gz \
-O ${OUTPUT_PREFIX}_filtered.vcf.gz
echo "=== Step 6: Extract PASS variants ==="
bcftools view -f PASS ${OUTPUT_PREFIX}_filtered.vcf.gz \
-Oz -o ${OUTPUT_PREFIX}_somatic.vcf.gz
bcftools index -t ${OUTPUT_PREFIX}_somatic.vcf.gz
echo "=== Step 7: Statistics ==="
bcftools stats ${OUTPUT_PREFIX}_somatic.vcf.gz > ${OUTPUT_PREFIX}_stats.txt
echo "=== Pipeline complete ==="
echo "Somatic variants: ${OUTPUT_PREFIX}_somatic.vcf.gz"
echo "Stats: ${OUTPUT_PREFIX}_stats.txt"
```
## Tumor-Only Mode
When matched normal is unavailable (e.g., archival FFPE, cell lines):
```bash
gatk Mutect2 \
-R reference.fa \
-I tumor.bam \
--germline-resource af-only-gnomad.vcf.gz \
--panel-of-normals pon.vcf.gz \
-O tumor_only.vcf.gz
```
Higher false positive rate without matched normal -- many germline variants will pass filters. The PoN and gnomAD germline resource become critical for artifact and germline removal respectively.
## Consensus Calling (Improved Accuracy)
Running multiple callers and requiring agreement improves both precision and recall:
```bash
# Run Mutect2, Strelka2, and MuSE independently, then intersect
# Majority voting (2/3 agreement) achieves F1 ~0.927 for SNVs
bcftools isec -n+2 -p consensus_dir \
mutect2_pass.vcf.gz strelka2_pass.vcf.gz muse_pass.vcf.gz
# For indels: Mutect2 + Strelka2 + VarScan2 with 2/3 agreement
```
Strict intersection (all agree) sacrifices too much recall; union includes too many false positives. Majority voting provides the best balance.
## Emerging: DeepSomatic
DeepSomatic extends DeepVariant's CNN approach to somatic calling with platform-specific models (Illumina, PacBio HiFi, ONT). Published Nature Biotechnology 2025, it achieves higher F1 than existing callers across all platforms and supports tumor-only and FFPE modes.
## Key Resources
| Resource | Purpose |
|----------|---------|
| gnomAD AF-only | Germline filtering |
| Panel of Normals | Technical artifact removal |
| COSMIC | Known cancer mutations |
| Funcotator data sources | Functional annotation |
## Quality Metrics
```bash
# Variant counts by filter status
bcftools query -f '%FILTER\n' filtered.vcf.gz | sort | uniq -c
# Ti/Tv ratio (expect ~2-3 for somatic)
bcftools stats filtered.vcf.gz | grep TSTV
# Variant allele frequency distribution
bcftools query -f '%AF\n' somatic_final.vcf.gz | \
awk '{print int($1*100)/100}' | sort -n | uniq -c
```
## Related Skills
- variant-calling/gatk-variant-calling - Germline variant calling
- variant-calling/filtering-best-practices - Filtering strategies
- variant-calling/variant-annotation - VEP/SnpEff annotation
- variant-calling/structural-variant-calling - Somatic SV detection (Manta tumor-normal mode)
- copy-number/cnvkit-analysis - Somatic CNV calling
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.