bio-phasing-imputation-haplotype-phasing
$
npx mdskill add GPTomics/bioSkills/bio-phasing-imputation-haplotype-phasingPhases genotypes into haplotypes using Beagle or SHAPEIT for downstream analysis
- Resolves inherited alleles for imputation, HLA typing, or population genetics
- Uses Beagle 5.4 or SHAPEIT as primary phasing tools
- Applies genetic maps to improve phasing accuracy when available
- Produces phased VCF files for use in subsequent genomic workflows
SKILL.md
.github/skills/bio-phasing-imputation-haplotype-phasingView on GitHub ↗
---
name: bio-phasing-imputation-haplotype-phasing
description: Phase genotypes into haplotypes using Beagle or SHAPEIT. Resolves which alleles are inherited together on each chromosome. Use when preparing VCF files for imputation, HLA typing, or population genetic analyses requiring phased haplotypes.
tool_type: cli
primary_tool: beagle
---
## Version Compatibility
Reference examples tested with: bcftools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Haplotype Phasing
**"Phase my genotypes into haplotypes"** → Resolve which alleles are inherited together on each chromosome for downstream imputation, HLA typing, or population genetic analyses.
- CLI: `java -jar beagle.jar gt=input.vcf out=phased` (Beagle 5.4)
- CLI: `shapeit4 --input input.vcf --output phased.vcf` (SHAPEIT)
## Beagle 5.4 Phasing (Recommended)
```bash
# Download Beagle 5.4
wget https://faculty.washington.edu/browning/beagle/beagle.22Jul22.46e.jar
# Basic phasing
java -jar beagle.22Jul22.46e.jar \
gt=input.vcf.gz \
out=phased
# Output: phased.vcf.gz (phased genotypes)
# With genetic map (improves accuracy)
java -jar beagle.22Jul22.46e.jar \
gt=input.vcf.gz \
map=plink.chr22.GRCh38.map \
out=phased
```
## Beagle Options
```bash
java -jar beagle.22Jul22.46e.jar \
gt=input.vcf.gz \
out=phased \
map=genetic_map.txt \
nthreads=8 \
window=40 \
overlap=4 \
ne=20000 \ # Effective population size
seed=12345 # For reproducibility
```
## Phase Per Chromosome
```bash
# Process each chromosome separately
for chr in {1..22}; do
java -Xmx16g -jar beagle.jar \
gt=input.chr${chr}.vcf.gz \
map=genetic_maps/plink.chr${chr}.GRCh38.map \
out=phased.chr${chr} \
nthreads=8
done
# Concatenate chromosomes
bcftools concat phased.chr*.vcf.gz -Oz -o phased.all.vcf.gz
bcftools index phased.all.vcf.gz
```
## SHAPEIT5 Phasing (for Large Datasets)
**Goal:** Phase large biobank-scale datasets using SHAPEIT5's two-stage approach for accurate haplotype resolution of both common and rare variants.
**Approach:** First phase common variants to build a haplotype scaffold, then phase rare variants onto that scaffold using the common-variant structure as a guide.
```bash
# Phase common variants first
shapeit5_phase_common \
--input input.vcf.gz \
--map genetic_map.txt \
--output phased_common.bcf \
--thread 8 \
--log phased.log
# Then phase rare variants
shapeit5_phase_rare \
--input input.vcf.gz \
--scaffold phased_common.bcf \
--map genetic_map.txt \
--output phased.bcf \
--thread 8
```
## SHAPEIT5 with Reference Panel
```bash
# Improves phasing using reference haplotypes
shapeit5_phase_common \
--input input.vcf.gz \
--reference reference_panel.bcf \
--map genetic_map.txt \
--output phased.bcf \
--thread 8
```
## Beagle with Reference Panel
```bash
# Use reference panel for better phasing
java -jar beagle.22Jul22.46e.jar \
gt=input.vcf.gz \
ref=reference.vcf.gz \
map=genetic_map.txt \
out=phased \
nthreads=8
```
## Input Preparation
```bash
# Filter variants before phasing
bcftools view -m2 -M2 -v snps input.vcf.gz -Oz -o biallelic_snps.vcf.gz
# Remove missing genotypes (optional)
bcftools view -g ^miss biallelic_snps.vcf.gz -Oz -o no_missing.vcf.gz
# Normalize (important!)
bcftools norm -f reference.fa -Oz -o normalized.vcf.gz input.vcf.gz
```
## Check Phasing Results
```bash
# View phased genotypes (| instead of /)
bcftools query -f '%CHROM\t%POS\t[%GT\t]\n' phased.vcf.gz | head
# Unphased: 0/1
# Phased: 0|1 or 1|0
# Count phased vs unphased
bcftools query -f '[%GT\n]' phased.vcf.gz | grep -c '|'
```
## Genetic Maps
```bash
# Download genetic maps (GRCh38)
wget https://faculty.washington.edu/browning/beagle/genetic_maps/plink.GRCh38.map.zip
unzip plink.GRCh38.map.zip
# Format: chromosome position rate(cM/Mb) genetic_position(cM)
# chr1 55550 2.981822 0.000000
```
## Key Parameters
| Parameter | Beagle | SHAPEIT5 | Description |
|-----------|--------|----------|-------------|
| Threads | nthreads | --thread | CPU threads |
| Window | window | --window | Analysis window size |
| Eff. pop size | ne | --effective-size | For LD modeling |
| Seed | seed | --seed | Random seed |
## Memory Requirements
| Dataset Size | Beagle Memory | SHAPEIT5 Memory |
|--------------|--------------|-----------------|
| 1,000 samples | 8 GB | 4 GB |
| 10,000 samples | 32 GB | 16 GB |
| 100,000 samples | 64+ GB | 32 GB |
## Phasing Accuracy Metrics
- **Switch error rate**: Rate of phase switches vs truth
- **Mismatch error rate**: Overall haplotype differences
- Measure using trio data or known haplotypes
## Related Skills
- phasing-imputation/genotype-imputation - Impute after phasing
- phasing-imputation/reference-panels - Get reference data
- variant-calling/filtering-best-practices - Prepare input VCF
- population-genetics/linkage-disequilibrium - LD analysis
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.