bio-pileup-generation
$
npx mdskill add GPTomics/bioSkills/bio-pileup-generationGenerate per-position read summaries for variant calling and analysis.
- Enables preparation of data for variant calling and allele frequency calculation.
- Integrates samtools mpileup and pysam for read data processing.
- Executes commands based on input BAM files and reference sequences.
- Outputs depth, bases, and quality metrics for each genomic position.
SKILL.md
.github/skills/bio-pileup-generationView on GitHub ↗
---
name: bio-pileup-generation
description: Generate pileup data for variant calling using samtools mpileup and pysam. Use when preparing data for variant calling, analyzing per-position read data, or calculating allele frequencies.
tool_type: cli
primary_tool: samtools
---
## Version Compatibility
Reference examples tested with: bcftools 1.19+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Pileup Generation
Generate pileup data for variant calling and position-level analysis.
**"Generate pileup from BAM"** → Produce per-position read summaries showing depth, bases, and qualities.
- CLI: `samtools mpileup -f ref.fa input.bam`
- Python: `bam.pileup(chrom, start, end)` (pysam)
**"Count alleles at a position"** → Extract per-base read support at a specific genomic coordinate.
- Python: iterate `pileup_column.pileups` and count bases (pysam)
## What is Pileup?
Pileup shows all reads covering each position in the reference, used for:
- Variant calling (with bcftools)
- Coverage analysis
- Allele frequency calculation
- SNP/indel detection
## samtools mpileup vs bcftools mpileup (Deprecation)
`samtools mpileup -g/-u` (BCF output for variant calling) has been **deprecated since samtools 1.9** -- the genotype-likelihood code now lives in `bcftools mpileup`, which keeps mpileup logic versioned alongside `bcftools call` and avoids version-skew bugs.
| Use case | Recommended tool |
|----------|------------------|
| Quick allele counts at known sites | `samtools mpileup` or pysam pileup |
| Germline variant calling (small genomes, simple cohorts) | `bcftools mpileup` -> `bcftools call` |
| Germline WGS / WES production | DeepVariant or HaplotypeCaller (not mpileup) |
| Somatic SNV/indel | Mutect2 / VarDict / VarScan2 (direct from BAM) |
| Long-read small variants | clair3 / DeepVariant ONT (direct from BAM) |
| Long-read SV | Sniffles / cuteSV (direct from BAM) |
| Ultra-low-frequency (ctDNA / MRD) | fgbio consensus -> `bcftools call` or hot-spot Mutect2 |
| Per-position allele counts (custom) | pysam pileup |
`samtools mpileup` (without `-g`) is still the standard tool for human-readable per-position read summaries.
### Basic Pileup
```bash
samtools mpileup -f reference.fa input.bam > pileup.txt
```
### Pileup Specific Region
```bash
samtools mpileup -f reference.fa -r chr1:1000000-2000000 input.bam
```
### Regions from BED
```bash
samtools mpileup -f reference.fa -l targets.bed input.bam
```
### Multiple BAM Files
```bash
samtools mpileup -f reference.fa sample1.bam sample2.bam sample3.bam > pileup.txt
```
## Output Format
Text pileup format (6 columns per sample):
```
chr1 1000 A 15 ............... FFFFFFFFFFF
chr1 1001 T 12 ............ FFFFFFFFFFFF
```
| Column | Description |
|--------|-------------|
| 1 | Chromosome |
| 2 | Position (1-based) |
| 3 | Reference base |
| 4 | Read depth |
| 5 | Read bases |
| 6 | Base qualities |
### Read Bases Encoding
| Symbol | Meaning |
|--------|---------|
| `.` | Match on forward strand |
| `,` | Match on reverse strand |
| `ACGT` | Mismatch (uppercase = forward) |
| `acgt` | Mismatch (lowercase = reverse) |
| `^Q` | Start of read (Q = MAPQ as ASCII) |
| `$` | End of read |
| `+NNN` | Insertion of N bases |
| `-NNN` | Deletion of N bases |
| `*` | Deleted base |
| `>` / `<` | Reference skip (intron) |
## Quality Filtering Options
### Minimum Mapping Quality
```bash
samtools mpileup -f reference.fa -q 20 input.bam
```
### Minimum Base Quality
```bash
samtools mpileup -f reference.fa -Q 20 input.bam
```
### Combined Quality Filters
```bash
samtools mpileup -f reference.fa -q 20 -Q 20 input.bam
```
### Maximum Depth (Critical Trap)
```bash
# samtools mpileup default -d 8000 silently truncates targeted / mt-DNA / amplicon / UMI-deduped data
# bcftools mpileup default -d 250 is far lower; both must be set explicitly when piping
samtools mpileup -f reference.fa -d 0 input.bam # no cap
samtools mpileup -f reference.fa -d 1000000 input.bam # explicit high cap
# WRONG -- samtools 8000 cap, then bcftools 250 cap re-applied
samtools mpileup -f ref.fa in.bam | bcftools call -mv
# RIGHT -- single tool, explicit -d
bcftools mpileup -d 1000000 -f ref.fa in.bam | bcftools call -mv
```
## BAQ: Base Alignment Quality (Critical Default)
When `-f ref.fa` is passed, BAQ is enabled by default. BAQ Phred-scales the probability that a base is misaligned (HMM realignment over a small window) and reduces base quality near indels. Tradeoffs: ~30% slower; suppresses FP SNVs near indels; hurts indel detection sensitivity.
| Flag | Behavior |
|------|----------|
| (default with `-f`) | BAQ on (computed from CIGAR if MD missing) |
| `-B` / `--no-BAQ` | Disable BAQ -- raw qualities |
| `-E` / `--redo-BAQ` | Force recompute (after BQSR; if MD stale) |
**BAQ ON for:** short-read germline SNV (BWA, Bowtie2, HISAT2), short-read somatic SNV.
**BAQ OFF (`-B`) for:** long-read variant calling (ONT, PacBio HiFi), SV calling, RNA-seq near splice junctions, viral / amplicon, ultra-deep ctDNA from consensus reads (consensus quality already inflated), aDNA (qualities pre-rescaled by mapDamage).
`-A` (count anomalous read pairs / orphans) is required for amplicon -- amplicon reads are by design not properly paired.
`-aa` (output all positions, including zero-coverage) is required for ARTIC SARS-CoV-2 consensus generation.
### Library-Typed Flags Cheat Sheet
| Library | Flags |
|---------|-------|
| Short-read germline WGS (BWA) | `-q 20 -Q 20 -d 0` (BAQ on default) |
| Short-read tumor WGS | `-q 1 -Q 13 -d 0 -B` (low MAPQ kept; BAQ off) |
| Amplicon viral (ARTIC) | `-aa -A -d 600000 -B -Q 20` |
| Capture / exome | `-q 20 -Q 20 -d 250` |
| Long-read ONT R10.4+ | `-q 30 -Q 0 -B -d 0 --max-BQ 50` |
| PacBio HiFi | `-q 20 -Q 0 -B -d 0` |
| RNA-seq variants | `-q 20 -Q 20 -B -d 0` |
| Forensic / aDNA | `-q 0 -Q 0 -A -d 0 -B` |
## Variant Calling Pipeline (Modern: bcftools mpileup)
**Goal:** Call variants from alignment data using the pileup-based approach.
**Approach:** Use `bcftools mpileup` (not `samtools mpileup -g`) so genotype-likelihood code is co-versioned with `bcftools call`. Apply quality and depth caps explicitly; annotate FORMAT fields needed for downstream filtering.
### Modern Germline Calling
```bash
bcftools mpileup -f reference.fa -d 1000000 -q 20 -Q 20 \
--annotate FORMAT/AD,FORMAT/DP,FORMAT/SP,INFO/AD \
input.bam | \
bcftools call -mv -Oz -o variants.vcf.gz
bcftools index -t variants.vcf.gz
```
### Multi-Sample Joint Calling
```bash
bcftools mpileup -f reference.fa --threads 4 -d 250 -q 20 -Q 20 \
-a FORMAT/AD,FORMAT/DP s1.bam s2.bam s3.bam | \
bcftools call -mv --threads 4 -Oz -o joint.vcf.gz
```
For somatic / low-VAF, prefer Mutect2 / Strelka2 / DeepVariant -- materially better than mpileup-based callers.
### Overlap Detection Defaults
When fragment length < 2 * read_length, R1 and R2 overlap. `bcftools mpileup` detects overlap and counts once by default; `samtools mpileup` does not. Doubled support inflates somatic VAFs at sites covered by overlapping pairs (especially in cfDNA / FFPE). Prefer `bcftools mpileup` or use `samtools mpileup --ignore-overlaps`.
## pysam Python Alternative
### Basic Pileup
```python
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for pileup_column in bam.pileup('chr1', 1000000, 1001000):
print(f'{pileup_column.reference_name}:{pileup_column.pos} depth={pileup_column.n}')
```
### Access Reads at Position
```python
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for pileup_column in bam.pileup('chr1', 1000000, 1000001, truncate=True):
print(f'Position: {pileup_column.pos}')
print(f'Depth: {pileup_column.n}')
for pileup_read in pileup_column.pileups:
if pileup_read.is_del:
print(' Deletion')
elif pileup_read.is_refskip:
print(' Reference skip')
else:
qpos = pileup_read.query_position
base = pileup_read.alignment.query_sequence[qpos]
qual = pileup_read.alignment.query_qualities[qpos]
print(f' {base} (Q{qual})')
```
### Count Alleles at Position
```python
import pysam
from collections import Counter
def allele_counts(bam_path, chrom, pos):
counts = Counter()
with pysam.AlignmentFile(bam_path, 'rb') as bam:
for pileup_column in bam.pileup(chrom, pos, pos + 1, truncate=True):
if pileup_column.pos != pos:
continue
for pileup_read in pileup_column.pileups:
if pileup_read.is_del:
counts['DEL'] += 1
elif pileup_read.is_refskip:
continue
else:
qpos = pileup_read.query_position
base = pileup_read.alignment.query_sequence[qpos]
counts[base.upper()] += 1
return dict(counts)
counts = allele_counts('input.bam', 'chr1', 1000000)
print(counts) # {'A': 45, 'G': 5}
```
### Calculate Allele Frequency
```python
import pysam
from collections import Counter
def allele_frequency(bam_path, chrom, pos, min_qual=20):
counts = Counter()
with pysam.AlignmentFile(bam_path, 'rb') as bam:
for pileup_column in bam.pileup(chrom, pos, pos + 1, truncate=True,
min_base_quality=min_qual):
if pileup_column.pos != pos:
continue
for pileup_read in pileup_column.pileups:
if pileup_read.is_del or pileup_read.is_refskip:
continue
qpos = pileup_read.query_position
base = pileup_read.alignment.query_sequence[qpos]
counts[base.upper()] += 1
total = sum(counts.values())
if total == 0:
return {}
return {base: count / total for base, count in counts.items()}
freq = allele_frequency('input.bam', 'chr1', 1000000)
for base, f in sorted(freq.items(), key=lambda x: -x[1]):
print(f'{base}: {f:.1%}')
```
### Pileup with Quality Filtering
```python
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for pileup_column in bam.pileup('chr1', 1000000, 1001000,
truncate=True,
min_mapping_quality=20,
min_base_quality=20):
print(f'{pileup_column.pos}: {pileup_column.n}')
```
### Generate Pileup Text
```python
import pysam
def pileup_text(bam_path, ref_path, chrom, start, end):
with pysam.AlignmentFile(bam_path, 'rb') as bam:
with pysam.FastaFile(ref_path) as ref:
for pileup_column in bam.pileup(chrom, start, end, truncate=True):
pos = pileup_column.pos
ref_base = ref.fetch(chrom, pos, pos + 1)
depth = pileup_column.n
bases = []
for pileup_read in pileup_column.pileups:
if pileup_read.is_del:
bases.append('*')
elif pileup_read.is_refskip:
bases.append('>')
else:
qpos = pileup_read.query_position
base = pileup_read.alignment.query_sequence[qpos]
if base.upper() == ref_base.upper():
bases.append('.' if not pileup_read.alignment.is_reverse else ',')
else:
bases.append(base.upper() if not pileup_read.alignment.is_reverse else base.lower())
print(f'{chrom}\t{pos+1}\t{ref_base}\t{depth}\t{"".join(bases)}')
pileup_text('input.bam', 'reference.fa', 'chr1', 1000000, 1000100)
```
## Pileup Options Summary
| Option | Description | Common pitfall |
|--------|-------------|----------------|
| `-f FILE` | Reference FASTA | Triggers BAQ ON by default |
| `-r REGION` | Restrict to region | |
| `-l FILE` | BED file of regions | |
| `-q INT` | Min mapping quality | Aligner-dependent semantics |
| `-Q INT` | Min base quality | `-Q 0` with default overlap detection has subtle behavior |
| `-d INT` | Max depth | **Default 8000 silently truncates**; bcftools mpileup default is 250 |
| `-B` | Disable BAQ | Often correct for long reads, SV, viral, consensus |
| `-A` | Count anomalous pairs | Required for amplicon |
| `-aa` | Output all positions | Required for consensus generation |
| `--ignore-overlaps` | Disable mate-overlap correction | Rarely correct |
| `--max-BQ INT` | Cap BQ (default 60) | Useful for ONT (Q values inflated) |
| `-g` (DEPRECATED) | Old BCF output | Use `bcftools mpileup` instead |
## Quick Reference
| Task | Command |
|------|---------|
| Basic pileup | `samtools mpileup -f ref.fa in.bam` |
| Quality filter | `samtools mpileup -f ref.fa -q 20 -Q 20 in.bam` |
| Region | `samtools mpileup -f ref.fa -r chr1:1-1000 in.bam` |
| To bcftools | `bcftools mpileup -f ref.fa -d 1000000 in.bam \| bcftools call -mv` |
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `No FASTA reference` | Missing -f option | Add `-f reference.fa` |
| `Reference mismatch` | Wrong reference | Use same reference as alignment |
| Out of memory | High coverage region | Use `-d` to cap depth |
## Related Skills
- alignment-filtering - Filter BAM before pileup
- reference-operations - Index reference for pileup; M5 cross-check
- bam-statistics - mosdepth, depth tool selection
- variant-calling/variant-calling - Full variant calling workflows
- variant-calling/vcf-basics - VCF/BCF I/O
- variant-calling/joint-calling - Multi-sample joint calling
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.