bio-workflows-somatic-variant-pipeline

Name: bio-workflows-somatic-variant-pipeline
Author: GPTomics/bioSkills
$npx mdskill add GPTomics/bioSkills/bio-workflows-somatic-variant-pipeline
Calls somatic mutations from tumor-normal paired samples using Mutect2 or Strelka2
Solves the task of identifying cancer-specific genetic mutations from paired sequencing data
Uses GATK Mutect2, Strelka2, and tools like BQSR, Funcotator, and VEP for variant calling and annotation
Applies contamination filtering, orientation bias correction, and quality thresholds to prioritize true variants
Delivers annotated variant calls, TMB scores, and mutational signatures in standard formats like VCF
SKILL.md
.github/skills/bio-workflows-somatic-variant-pipelineView on GitHub ↗
---
name: bio-workflows-somatic-variant-pipeline
description: End-to-end somatic variant calling from tumor-normal paired samples using Mutect2 or Strelka2. Covers preprocessing, variant calling, filtering, and annotation for cancer genomics. Use when calling somatic mutations from tumor-normal pairs.
tool_type: cli
primary_tool: GATK Mutect2
---

## Version Compatibility

Reference examples tested with: CNVkit 0.9+, Ensembl VEP 111+, GATK 4.5+, SnpEff 5.2+, bcftools 1.19+, picard 3.1+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Somatic Variant Pipeline

**"Call somatic mutations from my tumor-normal pair"** → Orchestrate alignment, Mutect2 somatic calling, contamination filtering, variant annotation (Funcotator/VEP), TMB calculation, and mutational signature analysis.

Complete workflow for calling somatic mutations from tumor-normal paired samples.

## Pipeline Overview

```
Tumor BAM + Normal BAM
    │
    ├── Preprocessing (if needed)
    │   └── MarkDuplicates, BQSR
    │
    ├── Variant Calling
    │   ├── Mutect2 (GATK) - SNVs + indels
    │   └── Strelka2 - SNVs + indels (faster)
    │
    ├── Filtering
    │   ├── FilterMutectCalls
    │   ├── Contamination estimation
    │   └── Orientation bias filtering
    │
    ├── Annotation
    │   ├── Funcotator / VEP
    │   └── Cancer-specific databases
    │
    └── Output: Filtered somatic VCF
```

## Mutect2 Workflow (GATK)

### Step 1: Panel of Normals (Optional but Recommended)

```bash
# Create PON from multiple normal samples
for normal in normal1.bam normal2.bam normal3.bam; do
    sample=$(basename $normal .bam)
    gatk Mutect2 \
        -R reference.fa \
        -I $normal \
        --max-mnp-distance 0 \
        -O ${sample}.vcf.gz
done

# Combine into PON
gatk GenomicsDBImport \
    -R reference.fa \
    --genomicsdb-workspace-path pon_db \
    -V normal1.vcf.gz \
    -V normal2.vcf.gz \
    -V normal3.vcf.gz \
    -L intervals.bed

gatk CreateSomaticPanelOfNormals \
    -R reference.fa \
    -V gendb://pon_db \
    -O pon.vcf.gz
```

### Step 2: Call Somatic Variants

```bash
gatk Mutect2 \
    -R reference.fa \
    -I tumor.bam \
    -I normal.bam \
    -normal normal_sample_name \
    --germline-resource af-only-gnomad.vcf.gz \
    --panel-of-normals pon.vcf.gz \
    --f1r2-tar-gz f1r2.tar.gz \
    -O unfiltered.vcf.gz
```

### Step 3: Learn Orientation Bias

```bash
gatk LearnReadOrientationModel \
    -I f1r2.tar.gz \
    -O read-orientation-model.tar.gz
```

### Step 4: Calculate Contamination

```bash
gatk GetPileupSummaries \
    -I tumor.bam \
    -V small_exac_common.vcf.gz \
    -L small_exac_common.vcf.gz \
    -O tumor_pileups.table

gatk GetPileupSummaries \
    -I normal.bam \
    -V small_exac_common.vcf.gz \
    -L small_exac_common.vcf.gz \
    -O normal_pileups.table

gatk CalculateContamination \
    -I tumor_pileups.table \
    -matched normal_pileups.table \
    -O contamination.table \
    --tumor-segmentation segments.table
```

### Step 5: Filter Variants

```bash
gatk FilterMutectCalls \
    -R reference.fa \
    -V unfiltered.vcf.gz \
    --contamination-table contamination.table \
    --tumor-segmentation segments.table \
    --ob-priors read-orientation-model.tar.gz \
    -O filtered.vcf.gz

# Extract PASS variants
bcftools view -f PASS filtered.vcf.gz -Oz -o somatic_final.vcf.gz
```

## Strelka2 Workflow (Faster Alternative)

```bash
# Configure
configureStrelkaSomaticWorkflow.py \
    --normalBam normal.bam \
    --tumorBam tumor.bam \
    --referenceFasta reference.fa \
    --runDir strelka_run

# Execute
strelka_run/runWorkflow.py -m local -j 16

# Output files
# strelka_run/results/variants/somatic.snvs.vcf.gz
# strelka_run/results/variants/somatic.indels.vcf.gz

# Merge SNVs and indels
bcftools concat \
    strelka_run/results/variants/somatic.snvs.vcf.gz \
    strelka_run/results/variants/somatic.indels.vcf.gz \
    -a -Oz -o strelka_somatic.vcf.gz
```

## Annotation

### Funcotator (GATK)

```bash
gatk Funcotator \
    -R reference.fa \
    -V somatic_final.vcf.gz \
    -O annotated.vcf.gz \
    --output-file-format VCF \
    --data-sources-path funcotator_dataSources.v1.7 \
    --ref-version hg38
```

### VEP with Cancer Databases

```bash
vep -i somatic_final.vcf.gz -o annotated.vcf \
    --vcf --cache --offline \
    --assembly GRCh38 \
    --everything \
    --plugin CADD,cadd_scores.tsv.gz \
    --custom cosmic.vcf.gz,COSMIC,vcf,exact,0,CNT \
    --fork 4
```

## Complete Pipeline Script

```bash
#!/bin/bash
set -euo pipefail

TUMOR_BAM=$1
NORMAL_BAM=$2
NORMAL_NAME=$3
REFERENCE=$4
OUTPUT_PREFIX=$5
GNOMAD=$6
PON=$7
THREADS=16

echo "=== Step 1: Mutect2 calling ==="
gatk Mutect2 \
    -R $REFERENCE \
    -I $TUMOR_BAM \
    -I $NORMAL_BAM \
    -normal $NORMAL_NAME \
    --germline-resource $GNOMAD \
    --panel-of-normals $PON \
    --f1r2-tar-gz ${OUTPUT_PREFIX}_f1r2.tar.gz \
    --native-pair-hmm-threads $THREADS \
    -O ${OUTPUT_PREFIX}_unfiltered.vcf.gz

echo "=== Step 2: Learn orientation bias ==="
gatk LearnReadOrientationModel \
    -I ${OUTPUT_PREFIX}_f1r2.tar.gz \
    -O ${OUTPUT_PREFIX}_orientation.tar.gz

echo "=== Step 3: Pileup summaries ==="
gatk GetPileupSummaries \
    -I $TUMOR_BAM \
    -V $GNOMAD \
    -L $GNOMAD \
    -O ${OUTPUT_PREFIX}_tumor_pileups.table

gatk GetPileupSummaries \
    -I $NORMAL_BAM \
    -V $GNOMAD \
    -L $GNOMAD \
    -O ${OUTPUT_PREFIX}_normal_pileups.table

echo "=== Step 4: Calculate contamination ==="
gatk CalculateContamination \
    -I ${OUTPUT_PREFIX}_tumor_pileups.table \
    -matched ${OUTPUT_PREFIX}_normal_pileups.table \
    -O ${OUTPUT_PREFIX}_contamination.table \
    --tumor-segmentation ${OUTPUT_PREFIX}_segments.table

echo "=== Step 5: Filter variants ==="
gatk FilterMutectCalls \
    -R $REFERENCE \
    -V ${OUTPUT_PREFIX}_unfiltered.vcf.gz \
    --contamination-table ${OUTPUT_PREFIX}_contamination.table \
    --tumor-segmentation ${OUTPUT_PREFIX}_segments.table \
    --ob-priors ${OUTPUT_PREFIX}_orientation.tar.gz \
    -O ${OUTPUT_PREFIX}_filtered.vcf.gz

echo "=== Step 6: Extract PASS variants ==="
bcftools view -f PASS ${OUTPUT_PREFIX}_filtered.vcf.gz \
    -Oz -o ${OUTPUT_PREFIX}_somatic.vcf.gz
bcftools index -t ${OUTPUT_PREFIX}_somatic.vcf.gz

echo "=== Step 7: Statistics ==="
bcftools stats ${OUTPUT_PREFIX}_somatic.vcf.gz > ${OUTPUT_PREFIX}_stats.txt

echo "=== Pipeline complete ==="
echo "Somatic variants: ${OUTPUT_PREFIX}_somatic.vcf.gz"
echo "Stats: ${OUTPUT_PREFIX}_stats.txt"
```

## Tumor-Only Mode

When matched normal is unavailable (e.g., archival FFPE, cell lines):

```bash
gatk Mutect2 \
    -R reference.fa \
    -I tumor.bam \
    --germline-resource af-only-gnomad.vcf.gz \
    --panel-of-normals pon.vcf.gz \
    -O tumor_only.vcf.gz
```

Higher false positive rate without matched normal -- many germline variants will pass filters. The PoN and gnomAD germline resource become critical for artifact and germline removal respectively.

## Consensus Calling (Improved Accuracy)

Running multiple callers and requiring agreement improves both precision and recall:

```bash
# Run Mutect2, Strelka2, and MuSE independently, then intersect
# Majority voting (2/3 agreement) achieves F1 ~0.927 for SNVs
bcftools isec -n+2 -p consensus_dir \
    mutect2_pass.vcf.gz strelka2_pass.vcf.gz muse_pass.vcf.gz

# For indels: Mutect2 + Strelka2 + VarScan2 with 2/3 agreement
```

Strict intersection (all agree) sacrifices too much recall; union includes too many false positives. Majority voting provides the best balance.

## Emerging: DeepSomatic

DeepSomatic extends DeepVariant's CNN approach to somatic calling with platform-specific models (Illumina, PacBio HiFi, ONT). Published Nature Biotechnology 2025, it achieves higher F1 than existing callers across all platforms and supports tumor-only and FFPE modes.

## Key Resources

| Resource | Purpose |
|----------|---------|
| gnomAD AF-only | Germline filtering |
| Panel of Normals | Technical artifact removal |
| COSMIC | Known cancer mutations |
| Funcotator data sources | Functional annotation |

## Quality Metrics

```bash
# Variant counts by filter status
bcftools query -f '%FILTER\n' filtered.vcf.gz | sort | uniq -c

# Ti/Tv ratio (expect ~2-3 for somatic)
bcftools stats filtered.vcf.gz | grep TSTV

# Variant allele frequency distribution
bcftools query -f '%AF\n' somatic_final.vcf.gz | \
    awk '{print int($1*100)/100}' | sort -n | uniq -c
```

## Related Skills

- variant-calling/gatk-variant-calling - Germline variant calling
- variant-calling/filtering-best-practices - Filtering strategies
- variant-calling/variant-annotation - VEP/SnpEff annotation
- variant-calling/structural-variant-calling - Somatic SV detection (Manta tumor-normal mode)
- copy-number/cnvkit-analysis - Somatic CNV calling