bio-consensus-sequences
$
npx mdskill add GPTomics/bioSkills/bio-consensus-sequencesApply VCF variants to reference FASTA for sample-specific genomes
- Creates haplotype-reconstructed sequences from genetic variant calls
- Depends on bcftools CLI and Python BioPython libraries
- Executes consensus generation by parsing VCF files against reference
- Outputs final FASTA sequences to standard output or files
SKILL.md
.github/skills/bio-consensus-sequencesView on GitHub ↗
---
name: bio-consensus-sequences
description: Generate consensus FASTA sequences by applying VCF variants to a reference using bcftools consensus. Use when creating sample-specific reference sequences or reconstructing haplotypes.
tool_type: cli
primary_tool: bcftools
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, bcftools 1.19+, bedtools 2.31+, minimap2 2.26+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Consensus Sequences
**"Generate a consensus sequence from my VCF"** → Apply called variants to a reference FASTA, producing a sample-specific genome with optional haplotype selection and low-coverage masking.
- CLI: `bcftools consensus -f reference.fa input.vcf.gz`
- Python: `cyvcf2` + `Bio.SeqIO` for simple SNP-only cases
## Basic Usage
### Generate Consensus
```bash
bcftools consensus -f reference.fa input.vcf.gz > consensus.fa
```
### Specify Sample
```bash
bcftools consensus -f reference.fa -s sample1 input.vcf.gz > sample1.fa
```
### Output to File
```bash
bcftools consensus -f reference.fa -o consensus.fa input.vcf.gz
```
## Haplotype Selection
### First Haplotype Only
```bash
bcftools consensus -f reference.fa -H 1 input.vcf.gz > haplotype1.fa
```
### Second Haplotype Only
```bash
bcftools consensus -f reference.fa -H 2 input.vcf.gz > haplotype2.fa
```
### Haplotype Options
| Option | Description |
|--------|-------------|
| `-H 1` | First haplotype |
| `-H 2` | Second haplotype |
| `-H A` | Apply all ALT alleles |
| `-H R` | Apply REF alleles where heterozygous |
| `-I` | Apply IUPAC ambiguity codes (separate flag) |
### Phasing Requirements
The `-H 1` and `-H 2` flags select haplotypes based on the phased genotype separator (`|`). With unphased genotypes (using `/`, e.g. `0/1`), the assignment of alleles to haplotype 1 vs 2 is arbitrary and does not reflect true chromosomal phase. Verify phasing status before haplotype extraction:
```bash
bcftools query -f '%CHROM\t%POS[\t%GT]\n' input.vcf.gz | head
```
Phased genotypes appear as `0|1` or `1|0`; unphased as `0/1`. Sources of phased genotypes:
- Read-backed phasing: WhatsHap, HapCUT2 (requires aligned reads)
- Trio phasing: Mendelian inheritance with parental genotypes
- Statistical phasing: SHAPEIT, Eagle (population-level, less accurate for rare variants)
- Long-read phasing: direct observation of haplotype blocks from PacBio/ONT reads
## IUPAC Codes for Heterozygous Sites
```bash
bcftools consensus -f reference.fa -I input.vcf.gz > consensus_iupac.fa
```
Heterozygous sites encoded with IUPAC ambiguity codes:
- A/G → R
- C/T → Y
- A/C → M
- G/T → K
- A/T → W
- C/G → S
## Missing Data Handling
### Mark Missing as N
```bash
bcftools consensus -f reference.fa -M N input.vcf.gz > consensus.fa
```
### Mark Low Coverage as N
Using a mask BED file:
```bash
# Create mask from depth
samtools depth input.bam | awk '$3<10 {print $1"\t"$2-1"\t"$2}' > low_coverage.bed
# Apply mask
bcftools consensus -f reference.fa -m low_coverage.bed input.vcf.gz > consensus.fa
```
### Mask Options
| Option | Description |
|--------|-------------|
| `-m FILE` | Mask regions in BED file with N |
| `-M CHAR` | Character for masked regions (default N) |
## Region Selection
### Specific Region
```bash
bcftools consensus -f reference.fa -r chr1:1000-2000 input.vcf.gz > region.fa
```
### Multiple Regions
Use with BED file to extract multiple regions.
## Chain Files
### Generate Chain File
```bash
bcftools consensus -f reference.fa -c chain.txt input.vcf.gz > consensus.fa
```
Chain files map coordinates between reference and consensus:
- Useful for liftover of annotations
- Required when indels change sequence length
### Chain File Format
```
chain score ref_name ref_size ref_strand ref_start ref_end query_name query_size query_strand query_start query_end id
```
## Sample-Specific Consensus
### For Each Sample
```bash
for sample in $(bcftools query -l input.vcf.gz); do
bcftools consensus -f reference.fa -s "$sample" input.vcf.gz > "${sample}.fa"
done
```
### Both Haplotypes
```bash
sample="sample1"
bcftools consensus -f reference.fa -s "$sample" -H 1 input.vcf.gz > "${sample}_hap1.fa"
bcftools consensus -f reference.fa -s "$sample" -H 2 input.vcf.gz > "${sample}_hap2.fa"
```
## VCF Normalization Before Consensus
Normalize the VCF before applying variants to the reference. Non-normalized indel representations (left-aligned vs right-aligned, or decomposed vs multi-allelic) can produce incorrect consensus sequences:
```bash
bcftools norm -f reference.fa input.vcf.gz | bcftools consensus -f reference.fa > consensus.fa
```
Normalization left-aligns indels and splits multi-allelic records, ensuring variant positions match the reference context exactly. Without normalization, overlapping or adjacent indels are more likely to conflict, and bcftools consensus may silently produce unexpected sequence at those sites despite logging warnings to stderr.
## Diploid Consensus Considerations
For diploid organisms, a single consensus sequence is inherently a simplification -- the organism carries two distinct haplotype sequences. The choice of representation depends on downstream use:
| Strategy | Flag | Best for |
|----------|------|----------|
| Both haplotypes separately | `-H 1`, `-H 2` | Phasing-aware analyses, allele-specific expression |
| IUPAC ambiguity codes | `-I` | Retaining heterozygosity information |
| All ALT alleles | `-H A` | Maximum divergence from reference |
| Majority/reference allele | `-H R` | Conservative consensus |
For phylogenetic applications, IUPAC codes can cause issues with some alignment and tree-building tools that do not handle ambiguity codes (or treat them as missing data). Using a single haplotype or applying only homozygous ALT alleles (`bcftools view -i 'GT="1/1" || GT="1|1"'`) produces cleaner input for tree inference.
## Filtering Before Consensus
### PASS Variants Only
```bash
bcftools view -f PASS input.vcf.gz | \
bcftools consensus -f reference.fa > consensus.fa
```
### High-Quality Variants Only
```bash
bcftools filter -i 'QUAL>=30 && INFO/DP>=10' input.vcf.gz | \
bcftools consensus -f reference.fa > consensus.fa
```
### SNPs Only
```bash
bcftools view -v snps input.vcf.gz | \
bcftools consensus -f reference.fa > consensus_snps.fa
```
## Sequence Naming
### Default Naming
Output uses reference sequence names.
### Custom Prefix
```bash
bcftools consensus -f reference.fa -p "sample1_" input.vcf.gz > consensus.fa
```
Sequences named: `sample1_chr1`, `sample1_chr2`, etc.
## Common Workflows
**Goal:** Generate consensus sequences for downstream analyses like phylogenetics, viral surveillance, or gene-level comparison.
**Approach:** Filter variants to high-quality calls, apply per-sample consensus generation, mask low-coverage regions with N, then combine for multi-sample workflows.
### Phylogenetic Analysis Preparation
```bash
# For each sample, generate consensus
mkdir -p consensus
for sample in $(bcftools query -l cohort.vcf.gz); do
bcftools view -s "$sample" cohort.vcf.gz | \
bcftools view -c 1 | \
bcftools consensus -f reference.fa > "consensus/${sample}.fa"
done
# Combine for alignment
cat consensus/*.fa > all_samples.fa
```
### Viral Genome Assembly
```bash
# Apply high-quality variants only
bcftools filter -i 'QUAL>=30 && INFO/DP>=20' variants.vcf.gz | \
bcftools view -f PASS | \
bcftools consensus -f reference.fa -M N > consensus.fa
```
### Gene-Specific Consensus
```bash
# Extract gene region
bcftools consensus -f reference.fa -r chr1:1000000-1010000 \
-s sample1 variants.vcf.gz > gene.fa
```
### Masked Low-Coverage Regions
```bash
# Create mask from coverage
samtools depth -a input.bam | \
awk '$3<5 {print $1"\t"$2-1"\t"$2}' | \
bedtools merge > low_coverage.bed
# Generate consensus with mask
bcftools consensus -f reference.fa -m low_coverage.bed \
variants.vcf.gz > consensus.fa
```
## Verify Consensus
### Check Differences
```bash
# Align consensus to reference
minimap2 -a reference.fa consensus.fa | samtools view -bS > alignment.bam
# Or simple comparison
diff <(grep -v "^>" reference.fa) <(grep -v "^>" consensus.fa) | head
```
### Count Changes
```bash
# Number of differences
bcftools view -H input.vcf.gz | wc -l
```
## Handling Overlapping Variants
bcftools consensus processes variants in coordinate order. When variants overlap (particularly indels whose reference alleles span the same positions), later variants may conflict with already-applied changes. bcftools consensus logs warnings to stderr but still produces output -- the result at conflicting sites may not reflect the intended genotype. Normalizing the VCF beforehand (see above) reduces but does not eliminate this issue.
Check for warnings:
```bash
bcftools consensus -f reference.fa input.vcf.gz 2>&1 | grep -i warn
```
If overlapping variant warnings appear, inspect the affected regions and consider filtering one of the conflicting records or resolving manually.
## cyvcf2 Consensus (Simple Cases)
### Manual Consensus Generation
```python
from cyvcf2 import VCF
from Bio import SeqIO
# Load reference
ref_dict = {rec.id: str(rec.seq) for rec in SeqIO.parse('reference.fa', 'fasta')}
# Apply variants (SNPs only, simplified)
vcf = VCF('input.vcf.gz')
changes = {}
for variant in vcf:
if variant.is_snp and len(variant.ALT) == 1:
chrom = variant.CHROM
pos = variant.POS - 1 # 0-based
if chrom not in changes:
changes[chrom] = {}
changes[chrom][pos] = variant.ALT[0]
# Apply changes
for chrom, positions in changes.items():
seq = list(ref_dict[chrom])
for pos, alt in positions.items():
seq[pos] = alt
ref_dict[chrom] = ''.join(seq)
# Write output
with open('consensus.fa', 'w') as f:
for chrom, seq in ref_dict.items():
f.write(f'>{chrom}\n{seq}\n')
```
Note: Use `bcftools consensus` for production - handles indels and edge cases properly.
## Quick Reference
| Task | Command |
|------|---------|
| Basic consensus | `bcftools consensus -f ref.fa in.vcf.gz` |
| Specific sample | `bcftools consensus -f ref.fa -s sample in.vcf.gz` |
| Haplotype 1 | `bcftools consensus -f ref.fa -H 1 in.vcf.gz` |
| IUPAC codes | `bcftools consensus -f ref.fa -I in.vcf.gz` |
| With mask | `bcftools consensus -f ref.fa -m mask.bed in.vcf.gz` |
| Generate chain | `bcftools consensus -f ref.fa -c chain.txt in.vcf.gz` |
| Specific region | `bcftools consensus -f ref.fa -r chr1:1-1000 in.vcf.gz` |
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `not indexed` | VCF not indexed | Run `bcftools index` |
| `sequence not found` | Chromosome mismatch | Check chromosome names |
| `overlapping records` | Variants overlap | Usually OK, check warnings |
| `REF does not match` | Wrong reference | Use same reference as caller |
## Related Skills
- variant-calling/variant-calling - Generate VCF for consensus
- variant-calling/filtering-best-practices - Filter variants before consensus
- variant-calling/variant-normalization - Normalize indels first
- alignment-files/reference-operations - Reference manipulation
- phylogenetics/tree-inference - Tree building from consensus alignments
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.