bio-ctdna-mutation-detection
$
npx mdskill add GPTomics/bioSkills/bio-ctdna-mutation-detectionDetects low-allele ctDNA mutations using VarDict and UMI suppression.
- Identifies somatic variants in plasma DNA at variant allele fractions above 0.5 percent.
- Integrates with VarDict, Ensembl VEP, SnpEff, pandas, and pysam for analysis.
- Requires targeted panels or whole-exome sequencing with depth exceeding 1000x.
- Outputs detected mutation calls for tracking specific variants in tumor samples.
SKILL.md
.github/skills/bio-ctdna-mutation-detectionView on GitHub ↗
---
name: bio-ctdna-mutation-detection
description: Detects somatic mutations in circulating tumor DNA using variant callers optimized for low allele fractions with UMI-based error suppression. Reliably detects mutations at VAF above 0.5 percent using consensus-based approaches. Use when identifying tumor mutations from plasma DNA or tracking specific variants.
tool_type: python
primary_tool: VarDict
---
## Version Compatibility
Reference examples tested with: Ensembl VEP 111+, SnpEff 5.2+, VarDict 1.8+, pandas 2.2+, pysam 0.22+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# ctDNA Mutation Detection
**"Detect mutations in my cfDNA sample"** → Identify somatic variants at low allele fractions (0.1-1%) from cell-free DNA using error-suppressed consensus calling and specialized callers.
- CLI: `vardict-java` for low-VAF variant calling from cfDNA
Detect somatic mutations in cfDNA at low variant allele fractions.
## Input Requirements
| Requirement | Specification |
|-------------|---------------|
| Data type | Targeted panel or WES (NOT sWGS) |
| Depth | >= 1000x for low VAF detection |
| UMIs | Highly recommended for < 1% VAF |
| Input | Preprocessed BAM (UMI consensus if available) |
## VAF Detection Limits
| VAF Range | Reliability | Notes |
|-----------|-------------|-------|
| > 1% | Reliable | Standard callers work |
| 0.5-1% | Good with UMIs | Requires error suppression |
| 0.1-0.5% | Challenging | Needs deep UMI consensus |
| < 0.1% | Unreliable | Near noise floor |
## VarDict for High Sensitivity (Ensembl VEP 111+)
```bash
# VarDict is highly sensitive for low VAF
# Use on UMI-consensus BAM for best results
vardict-java \
-G reference.fa \
-f 0.005 \ # Min VAF 0.5%
-N sample_id \
-b sample.bam \
-c 1 -S 2 -E 3 -g 4 \
regions.bed | \
teststrandbias.R | \
var2vcf_valid.pl \
-N sample_id \
-E \
-f 0.005 \
> sample.vcf
```
## Python Implementation
```python
import subprocess
import pandas as pd
import pysam
def call_variants_vardict(bam_file, reference, bed_file, output_vcf, min_vaf=0.005, min_depth=100):
'''
Call variants with VarDict.
Args:
bam_file: UMI-consensus BAM preferred
reference: Reference FASTA
bed_file: Target regions BED
output_vcf: Output VCF path
min_vaf: Minimum VAF (0.005 = 0.5%)
min_depth: Minimum read depth
'''
sample_id = bam_file.split('/')[-1].replace('.bam', '')
cmd = f'''
vardict-java \
-G {reference} \
-f {min_vaf} \
-N {sample_id} \
-b {bam_file} \
-c 1 -S 2 -E 3 -g 4 \
{bed_file} | \
teststrandbias.R | \
var2vcf_valid.pl \
-N {sample_id} \
-E \
-f {min_vaf} \
> {output_vcf}
'''
subprocess.run(cmd, shell=True, check=True)
return output_vcf
def filter_ctdna_variants(vcf_file, chip_genes=None):
'''
Filter ctDNA variants, removing CHIP.
CHIP genes commonly mutated in elderly:
DNMT3A, TET2, ASXL1, PPM1D, TP53, SF3B1, etc.
'''
if chip_genes is None:
chip_genes = ['DNMT3A', 'TET2', 'ASXL1', 'PPM1D', 'JAK2',
'SF3B1', 'SRSF2', 'TP53', 'CBL', 'BCOR']
import vcfpy
reader = vcfpy.Reader.from_path(vcf_file)
somatic = []
chip = []
for record in reader:
gene = record.INFO.get('GENE', [''])[0]
if gene in chip_genes:
chip.append(record)
else:
somatic.append(record)
print(f'Somatic variants: {len(somatic)}')
print(f'Potential CHIP variants: {len(chip)}')
return somatic, chip
```
## UMI-VarCal for Best Specificity (Ensembl VEP 111+)
```python
def call_with_umi_varcal(bam_file, reference, bed_file, output_vcf, min_vaf=0.005):
'''
UMI-VarCal: Best specificity with UMI data.
'''
subprocess.run([
'umi-varcal',
'--bam', bam_file,
'--ref', reference,
'--bed', bed_file,
'--out', output_vcf,
'--min-vaf', str(min_vaf),
'--min-alt-reads', '3',
'--min-depth', '100'
], check=True)
```
## Variant Annotation (Ensembl VEP 111+)
```python
def annotate_ctdna_variants(vcf_file, output_vcf):
'''Annotate variants with clinically relevant information.'''
# Use VEP or snpEff for annotation
subprocess.run([
'vep',
'--input_file', vcf_file,
'--output_file', output_vcf,
'--format', 'vcf',
'--vcf',
'--cache',
'--canonical',
'--protein',
'--sift', 'b',
'--polyphen', 'b',
'--af_gnomad'
], check=True)
```
## Tracking Known Mutations
**Goal:** Quantify the variant allele fraction of specific known mutations across serial liquid biopsy samples for minimal residual disease monitoring.
**Approach:** For each target mutation, pileup reads at the variant position, count reference and alternative alleles, and compute VAF with depth statistics.
```python
def track_specific_mutations(bam_file, mutations, min_depth=100):
'''
Track specific known mutations across samples.
Useful for MRD monitoring.
Args:
bam_file: Aligned BAM
mutations: List of (chrom, pos, ref, alt) tuples
'''
import pysam
bam = pysam.AlignmentFile(bam_file, 'rb')
results = []
for chrom, pos, ref, alt in mutations:
counts = {'ref': 0, 'alt': 0, 'other': 0}
for pileupcolumn in bam.pileup(chrom, pos-1, pos):
if pileupcolumn.pos != pos - 1:
continue
for read in pileupcolumn.pileups:
if read.is_del or read.is_refskip:
continue
base = read.alignment.query_sequence[read.query_position]
if base == ref:
counts['ref'] += 1
elif base == alt:
counts['alt'] += 1
else:
counts['other'] += 1
total = counts['ref'] + counts['alt'] + counts['other']
vaf = counts['alt'] / total if total > 0 else 0
results.append({
'chrom': chrom, 'pos': pos, 'ref': ref, 'alt': alt,
'depth': total, 'alt_count': counts['alt'], 'vaf': vaf
})
bam.close()
return pd.DataFrame(results)
```
## Related Skills
- cfdna-preprocessing - Preprocess with UMI consensus
- tumor-fraction-estimation - Estimate overall tumor burden
- longitudinal-monitoring - Track mutations over time
- variant-calling/variant-calling - General variant calling concepts
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.