bio-variant-calling

$npx mdskill add GPTomics/bioSkills/bio-variant-calling

Call SNPs and indels from BAM files using bcftools for variant detection

  • Detect genetic variants from aligned sequencing reads in BAM format
  • Uses bcftools mpileup and call for SNP and indel discovery
  • Chooses calling parameters based on ploidy and organism type
  • Outputs variant calls in VCF format for downstream analysis
SKILL.md
.github/skills/bio-variant-callingView on GitHub ↗
---
name: bio-variant-calling
description: Call SNPs and indels from aligned reads using bcftools mpileup and call. Use when detecting variants from BAM files or generating VCF from alignments.
tool_type: cli
primary_tool: bcftools
---

## Version Compatibility

Reference examples tested with: bcftools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Variant Calling with bcftools

Call SNPs and indels from aligned reads using bcftools mpileup/call.

## When to Use bcftools vs Other Callers

| Scenario | Recommended Caller | Rationale |
|----------|-------------------|-----------|
| Quick exploratory analysis | bcftools | Fast, minimal setup, good accuracy for SNPs |
| Non-model organisms | bcftools or FreeBayes | No training data required (unlike VQSR); adjustable ploidy |
| Haploid/bacterial genomes | bcftools with `--ploidy 1` | Native ploidy support; simpler model appropriate for haploids |
| Highest accuracy (human) | DeepVariant or GATK DRAGEN-mode | Deep learning or local assembly outperforms pileup-based calling |
| Cohort joint genotyping | GATK (GVCF workflow) | bcftools multi-sample calling does not scale well beyond ~100 samples |
| Somatic variants | Mutect2 or DeepSomatic | bcftools is not designed for low-VAF somatic detection |

bcftools mpileup/call uses a Bayesian statistical model on pileup data. It is fast and effective for germline SNPs, but its indel calling is less accurate than local-assembly callers (GATK HaplotypeCaller, DeepVariant) in homopolymer and low-complexity regions. For production human germline pipelines, prefer DeepVariant or GATK; use bcftools for rapid exploration, non-model organisms, or when computational resources are limited.

## Basic Workflow

```
BAM file + Reference FASTA
         |
         v
   bcftools mpileup (generate pileup)
         |
         v
   bcftools call (call variants)
         |
         v
   VCF file
```

## bcftools mpileup + call

**Goal:** Detect SNPs and indels from aligned reads using the bcftools pileup-and-call pipeline.

**Approach:** Generate per-position pileup likelihoods with mpileup, then call genotypes with the multiallelic caller.

**"Call variants from my BAM file"** → Generate genotype likelihoods from aligned reads and identify variant sites using a Bayesian caller.

### Basic Variant Calling
```bash
bcftools mpileup -f reference.fa input.bam | bcftools call -mv -o variants.vcf
```

### Output Compressed VCF
```bash
bcftools mpileup -f reference.fa input.bam | bcftools call -mv -Oz -o variants.vcf.gz
bcftools index variants.vcf.gz
```

### Call Specific Region
```bash
bcftools mpileup -f reference.fa -r chr1:1000000-2000000 input.bam | \
    bcftools call -mv -o region.vcf
```

### Call from Multiple BAMs
```bash
bcftools mpileup -f reference.fa sample1.bam sample2.bam sample3.bam | \
    bcftools call -mv -o variants.vcf
```

### BAM List File
```bash
# bams.txt: one BAM path per line
bcftools mpileup -f reference.fa -b bams.txt | bcftools call -mv -o variants.vcf
```

## mpileup Options

**Goal:** Control pileup generation with quality thresholds, annotations, and region restrictions.

**Approach:** Set minimum mapping/base quality, request specific FORMAT/INFO tags, and restrict to target regions.

### Quality Filtering
```bash
bcftools mpileup -f reference.fa \
    -q 20 \           # Min mapping quality
    -Q 20 \           # Min base quality
    input.bam | bcftools call -mv -o variants.vcf
```

### Annotate with Read Depth
```bash
bcftools mpileup -f reference.fa -a DP,AD input.bam | bcftools call -mv -o variants.vcf
```

### Full Annotation Set
```bash
bcftools mpileup -f reference.fa \
    -a FORMAT/DP,FORMAT/AD,FORMAT/ADF,FORMAT/ADR,INFO/AD \
    input.bam | bcftools call -mv -o variants.vcf
```

### Target Regions (BED)
```bash
bcftools mpileup -f reference.fa -R targets.bed input.bam | \
    bcftools call -mv -o variants.vcf
```

### Max Depth
```bash
bcftools mpileup -f reference.fa -d 1000 input.bam | bcftools call -mv -o variants.vcf
```

## call Options

### Calling Models

| Flag | Model | Use Case |
|------|-------|----------|
| `-m` | Multiallelic caller | Default, recommended for all new analyses |
| `-c` | Consensus caller | Legacy; only for backward compatibility with older pipelines |

The multiallelic caller (`-m`) handles sites with multiple ALT alleles natively and is statistically superior. The consensus caller (`-c`) is retained only for reproducibility of legacy analyses.

### Output Variants Only
```bash
bcftools mpileup -f reference.fa input.bam | bcftools call -mv -o variants.vcf
# -v outputs variant sites only (not reference calls)
```

### Output All Sites
```bash
bcftools mpileup -f reference.fa input.bam | bcftools call -m -o all_sites.vcf
# Without -v, outputs all sites including reference
```

### Ploidy

Correct ploidy is critical -- calling a diploid organism as haploid halves heterozygous sensitivity; calling haploid as diploid produces false heterozygous calls.

```bash
# Haploid calling (bacteria, mitochondria, chrY in males)
bcftools mpileup -f reference.fa input.bam | bcftools call -m --ploidy 1 -o variants.vcf

# Ploidy file for mixed-ploidy organisms (e.g., sex chromosomes, polyploids)
# Format: CHROM FROM TO SEX PLOIDY
bcftools mpileup -f reference.fa input.bam | bcftools call -m --ploidy-file ploidy.txt -o variants.vcf

# Built-in ploidy presets
bcftools call -m --ploidy GRCh38 ...  # Human with sex chromosome handling
```

### Prior Probability
```bash
# Adjust variant prior (default 1.1e-3 for human)
# Lower for highly inbred lines; higher for diverse/outbred populations
bcftools mpileup -f reference.fa input.bam | bcftools call -m -P 0.001 -o variants.vcf
```

## Common Pipelines

**Goal:** Run production-ready variant calling workflows for single-sample and multi-sample analyses.

**Approach:** Chain mpileup and call with quality filters, annotations, and compressed output, optionally parallelized by chromosome.

### Standard SNP/Indel Calling
```bash
bcftools mpileup -Ou -f reference.fa \
    -q 20 -Q 20 \
    -a FORMAT/DP,FORMAT/AD \
    input.bam | \
bcftools call -mv -Oz -o variants.vcf.gz

bcftools index variants.vcf.gz
```

### Multi-sample Calling
```bash
bcftools mpileup -Ou -f reference.fa \
    -a FORMAT/DP,FORMAT/AD \
    sample1.bam sample2.bam sample3.bam | \
bcftools call -mv -Oz -o cohort.vcf.gz

bcftools index cohort.vcf.gz
```

### Calling with Regions
```bash
bcftools mpileup -Ou -f reference.fa \
    -R targets.bed \
    -a FORMAT/DP,FORMAT/AD \
    input.bam | \
bcftools call -mv -Oz -o targets.vcf.gz
```

### Parallel by Chromosome
```bash
for chr in chr1 chr2 chr3; do
    bcftools mpileup -Ou -f reference.fa -r "$chr" input.bam | \
        bcftools call -mv -Oz -o "${chr}.vcf.gz" &
done
wait

# Concatenate results
bcftools concat -Oz -o all.vcf.gz chr*.vcf.gz
bcftools index all.vcf.gz
```

## Annotation Tags

### INFO Tags
| Tag | Description |
|-----|-------------|
| `DP` | Total read depth |
| `AD` | Allelic depths |
| `MQ` | Mapping quality |
| `FS` | Fisher strand bias |
| `SGB` | Segregation based metric |

### FORMAT Tags
| Tag | Description |
|-----|-------------|
| `GT` | Genotype |
| `DP` | Read depth per sample |
| `AD` | Allelic depths per sample |
| `ADF` | Forward strand allelic depths |
| `ADR` | Reverse strand allelic depths |
| `GQ` | Genotype quality |
| `PL` | Phred-scaled likelihoods |

### Request Specific Annotations
```bash
bcftools mpileup -f reference.fa \
    -a FORMAT/DP,FORMAT/AD,FORMAT/SP,INFO/AD \
    input.bam | bcftools call -mv -o variants.vcf
```

## Performance Options

**Goal:** Speed up variant calling for large datasets.

**Approach:** Use multi-threading and uncompressed BCF piping to reduce I/O overhead.

### Multi-threading
```bash
bcftools mpileup -f reference.fa --threads 4 input.bam | \
    bcftools call -mv --threads 4 -o variants.vcf
```

### Uncompressed BCF for Speed
```bash
bcftools mpileup -Ou -f reference.fa input.bam | bcftools call -mv -Ou | \
    bcftools filter -Oz -o filtered.vcf.gz
```

## Quick Reference

| Task | Command |
|------|---------|
| Basic calling | `bcftools mpileup -f ref.fa in.bam \| bcftools call -mv -o out.vcf` |
| With quality filter | `bcftools mpileup -f ref.fa -q 20 -Q 20 in.bam \| bcftools call -mv` |
| Region | `bcftools mpileup -f ref.fa -r chr1:1-1000 in.bam \| bcftools call -mv` |
| Multi-sample | `bcftools mpileup -f ref.fa s1.bam s2.bam \| bcftools call -mv` |
| With annotations | `bcftools mpileup -f ref.fa -a DP,AD in.bam \| bcftools call -mv` |

## Difficult Region Considerations

bcftools pileup-based calling is particularly vulnerable in:
- **Homopolymers**: Runs of identical bases cause indel false positives. Validate indel calls in homopolymers >6bp with a second caller or manual review.
- **Low-complexity regions**: ~35 Mb of GRCh38 is flagged as LCR. Consider excluding with `-T ^lcr.bed` or flagging calls in these regions.
- **Segmental duplications**: Reads from paralogs pile up, creating false SNPs. Check MQ values -- sites with mean MQ <40 suggest ambiguous mapping.
- **High-depth regions**: Set `-d` (max depth) to 3-4x the expected mean coverage. The default (250) may be too low for targeted panels or too high for low-coverage WGS.

## Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `no FASTA reference` | Missing -f | Add `-f reference.fa` |
| `reference mismatch` | Wrong reference | Use same reference as alignment |
| `no variants called` | Low quality/depth | Lower quality thresholds or check BAM is not empty |

## Related Skills

- vcf-basics - View and query resulting VCF
- filtering-best-practices - Filter variants by quality
- variant-normalization - Normalize indels
- variant-calling/gatk-variant-calling - Higher accuracy with local assembly
- variant-calling/deepvariant - Deep learning alternative
- alignment-files/pileup-generation - Alternative pileup generation
More from GPTomics/bioSkills