bio-variant-calling
$
npx mdskill add GPTomics/bioSkills/bio-variant-callingCall SNPs and indels from BAM files using bcftools for variant detection
- Detect genetic variants from aligned sequencing reads in BAM format
- Uses bcftools mpileup and call for SNP and indel discovery
- Chooses calling parameters based on ploidy and organism type
- Outputs variant calls in VCF format for downstream analysis
SKILL.md
.github/skills/bio-variant-callingView on GitHub ↗
---
name: bio-variant-calling
description: Call SNPs and indels from aligned reads using bcftools mpileup and call. Use when detecting variants from BAM files or generating VCF from alignments.
tool_type: cli
primary_tool: bcftools
---
## Version Compatibility
Reference examples tested with: bcftools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Variant Calling with bcftools
Call SNPs and indels from aligned reads using bcftools mpileup/call.
## When to Use bcftools vs Other Callers
| Scenario | Recommended Caller | Rationale |
|----------|-------------------|-----------|
| Quick exploratory analysis | bcftools | Fast, minimal setup, good accuracy for SNPs |
| Non-model organisms | bcftools or FreeBayes | No training data required (unlike VQSR); adjustable ploidy |
| Haploid/bacterial genomes | bcftools with `--ploidy 1` | Native ploidy support; simpler model appropriate for haploids |
| Highest accuracy (human) | DeepVariant or GATK DRAGEN-mode | Deep learning or local assembly outperforms pileup-based calling |
| Cohort joint genotyping | GATK (GVCF workflow) | bcftools multi-sample calling does not scale well beyond ~100 samples |
| Somatic variants | Mutect2 or DeepSomatic | bcftools is not designed for low-VAF somatic detection |
bcftools mpileup/call uses a Bayesian statistical model on pileup data. It is fast and effective for germline SNPs, but its indel calling is less accurate than local-assembly callers (GATK HaplotypeCaller, DeepVariant) in homopolymer and low-complexity regions. For production human germline pipelines, prefer DeepVariant or GATK; use bcftools for rapid exploration, non-model organisms, or when computational resources are limited.
## Basic Workflow
```
BAM file + Reference FASTA
|
v
bcftools mpileup (generate pileup)
|
v
bcftools call (call variants)
|
v
VCF file
```
## bcftools mpileup + call
**Goal:** Detect SNPs and indels from aligned reads using the bcftools pileup-and-call pipeline.
**Approach:** Generate per-position pileup likelihoods with mpileup, then call genotypes with the multiallelic caller.
**"Call variants from my BAM file"** → Generate genotype likelihoods from aligned reads and identify variant sites using a Bayesian caller.
### Basic Variant Calling
```bash
bcftools mpileup -f reference.fa input.bam | bcftools call -mv -o variants.vcf
```
### Output Compressed VCF
```bash
bcftools mpileup -f reference.fa input.bam | bcftools call -mv -Oz -o variants.vcf.gz
bcftools index variants.vcf.gz
```
### Call Specific Region
```bash
bcftools mpileup -f reference.fa -r chr1:1000000-2000000 input.bam | \
bcftools call -mv -o region.vcf
```
### Call from Multiple BAMs
```bash
bcftools mpileup -f reference.fa sample1.bam sample2.bam sample3.bam | \
bcftools call -mv -o variants.vcf
```
### BAM List File
```bash
# bams.txt: one BAM path per line
bcftools mpileup -f reference.fa -b bams.txt | bcftools call -mv -o variants.vcf
```
## mpileup Options
**Goal:** Control pileup generation with quality thresholds, annotations, and region restrictions.
**Approach:** Set minimum mapping/base quality, request specific FORMAT/INFO tags, and restrict to target regions.
### Quality Filtering
```bash
bcftools mpileup -f reference.fa \
-q 20 \ # Min mapping quality
-Q 20 \ # Min base quality
input.bam | bcftools call -mv -o variants.vcf
```
### Annotate with Read Depth
```bash
bcftools mpileup -f reference.fa -a DP,AD input.bam | bcftools call -mv -o variants.vcf
```
### Full Annotation Set
```bash
bcftools mpileup -f reference.fa \
-a FORMAT/DP,FORMAT/AD,FORMAT/ADF,FORMAT/ADR,INFO/AD \
input.bam | bcftools call -mv -o variants.vcf
```
### Target Regions (BED)
```bash
bcftools mpileup -f reference.fa -R targets.bed input.bam | \
bcftools call -mv -o variants.vcf
```
### Max Depth
```bash
bcftools mpileup -f reference.fa -d 1000 input.bam | bcftools call -mv -o variants.vcf
```
## call Options
### Calling Models
| Flag | Model | Use Case |
|------|-------|----------|
| `-m` | Multiallelic caller | Default, recommended for all new analyses |
| `-c` | Consensus caller | Legacy; only for backward compatibility with older pipelines |
The multiallelic caller (`-m`) handles sites with multiple ALT alleles natively and is statistically superior. The consensus caller (`-c`) is retained only for reproducibility of legacy analyses.
### Output Variants Only
```bash
bcftools mpileup -f reference.fa input.bam | bcftools call -mv -o variants.vcf
# -v outputs variant sites only (not reference calls)
```
### Output All Sites
```bash
bcftools mpileup -f reference.fa input.bam | bcftools call -m -o all_sites.vcf
# Without -v, outputs all sites including reference
```
### Ploidy
Correct ploidy is critical -- calling a diploid organism as haploid halves heterozygous sensitivity; calling haploid as diploid produces false heterozygous calls.
```bash
# Haploid calling (bacteria, mitochondria, chrY in males)
bcftools mpileup -f reference.fa input.bam | bcftools call -m --ploidy 1 -o variants.vcf
# Ploidy file for mixed-ploidy organisms (e.g., sex chromosomes, polyploids)
# Format: CHROM FROM TO SEX PLOIDY
bcftools mpileup -f reference.fa input.bam | bcftools call -m --ploidy-file ploidy.txt -o variants.vcf
# Built-in ploidy presets
bcftools call -m --ploidy GRCh38 ... # Human with sex chromosome handling
```
### Prior Probability
```bash
# Adjust variant prior (default 1.1e-3 for human)
# Lower for highly inbred lines; higher for diverse/outbred populations
bcftools mpileup -f reference.fa input.bam | bcftools call -m -P 0.001 -o variants.vcf
```
## Common Pipelines
**Goal:** Run production-ready variant calling workflows for single-sample and multi-sample analyses.
**Approach:** Chain mpileup and call with quality filters, annotations, and compressed output, optionally parallelized by chromosome.
### Standard SNP/Indel Calling
```bash
bcftools mpileup -Ou -f reference.fa \
-q 20 -Q 20 \
-a FORMAT/DP,FORMAT/AD \
input.bam | \
bcftools call -mv -Oz -o variants.vcf.gz
bcftools index variants.vcf.gz
```
### Multi-sample Calling
```bash
bcftools mpileup -Ou -f reference.fa \
-a FORMAT/DP,FORMAT/AD \
sample1.bam sample2.bam sample3.bam | \
bcftools call -mv -Oz -o cohort.vcf.gz
bcftools index cohort.vcf.gz
```
### Calling with Regions
```bash
bcftools mpileup -Ou -f reference.fa \
-R targets.bed \
-a FORMAT/DP,FORMAT/AD \
input.bam | \
bcftools call -mv -Oz -o targets.vcf.gz
```
### Parallel by Chromosome
```bash
for chr in chr1 chr2 chr3; do
bcftools mpileup -Ou -f reference.fa -r "$chr" input.bam | \
bcftools call -mv -Oz -o "${chr}.vcf.gz" &
done
wait
# Concatenate results
bcftools concat -Oz -o all.vcf.gz chr*.vcf.gz
bcftools index all.vcf.gz
```
## Annotation Tags
### INFO Tags
| Tag | Description |
|-----|-------------|
| `DP` | Total read depth |
| `AD` | Allelic depths |
| `MQ` | Mapping quality |
| `FS` | Fisher strand bias |
| `SGB` | Segregation based metric |
### FORMAT Tags
| Tag | Description |
|-----|-------------|
| `GT` | Genotype |
| `DP` | Read depth per sample |
| `AD` | Allelic depths per sample |
| `ADF` | Forward strand allelic depths |
| `ADR` | Reverse strand allelic depths |
| `GQ` | Genotype quality |
| `PL` | Phred-scaled likelihoods |
### Request Specific Annotations
```bash
bcftools mpileup -f reference.fa \
-a FORMAT/DP,FORMAT/AD,FORMAT/SP,INFO/AD \
input.bam | bcftools call -mv -o variants.vcf
```
## Performance Options
**Goal:** Speed up variant calling for large datasets.
**Approach:** Use multi-threading and uncompressed BCF piping to reduce I/O overhead.
### Multi-threading
```bash
bcftools mpileup -f reference.fa --threads 4 input.bam | \
bcftools call -mv --threads 4 -o variants.vcf
```
### Uncompressed BCF for Speed
```bash
bcftools mpileup -Ou -f reference.fa input.bam | bcftools call -mv -Ou | \
bcftools filter -Oz -o filtered.vcf.gz
```
## Quick Reference
| Task | Command |
|------|---------|
| Basic calling | `bcftools mpileup -f ref.fa in.bam \| bcftools call -mv -o out.vcf` |
| With quality filter | `bcftools mpileup -f ref.fa -q 20 -Q 20 in.bam \| bcftools call -mv` |
| Region | `bcftools mpileup -f ref.fa -r chr1:1-1000 in.bam \| bcftools call -mv` |
| Multi-sample | `bcftools mpileup -f ref.fa s1.bam s2.bam \| bcftools call -mv` |
| With annotations | `bcftools mpileup -f ref.fa -a DP,AD in.bam \| bcftools call -mv` |
## Difficult Region Considerations
bcftools pileup-based calling is particularly vulnerable in:
- **Homopolymers**: Runs of identical bases cause indel false positives. Validate indel calls in homopolymers >6bp with a second caller or manual review.
- **Low-complexity regions**: ~35 Mb of GRCh38 is flagged as LCR. Consider excluding with `-T ^lcr.bed` or flagging calls in these regions.
- **Segmental duplications**: Reads from paralogs pile up, creating false SNPs. Check MQ values -- sites with mean MQ <40 suggest ambiguous mapping.
- **High-depth regions**: Set `-d` (max depth) to 3-4x the expected mean coverage. The default (250) may be too low for targeted panels or too high for low-coverage WGS.
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `no FASTA reference` | Missing -f | Add `-f reference.fa` |
| `reference mismatch` | Wrong reference | Use same reference as alignment |
| `no variants called` | Low quality/depth | Lower quality thresholds or check BAM is not empty |
## Related Skills
- vcf-basics - View and query resulting VCF
- filtering-best-practices - Filter variants by quality
- variant-normalization - Normalize indels
- variant-calling/gatk-variant-calling - Higher accuracy with local assembly
- variant-calling/deepvariant - Deep learning alternative
- alignment-files/pileup-generation - Alternative pileup generation
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.