bio-phasing-imputation-genotype-imputation
$
npx mdskill add GPTomics/bioSkills/bio-phasing-imputation-genotype-imputationImpute missing genotypes using Beagle or Minimac4 with reference panels
- Fills untyped variants in GWAS data to increase variant density and harmonize platforms
- Relies on Beagle, Minimac4, and tools like bcftools and pandas
- Uses linkage disequilibrium patterns from reference panels to infer missing genotypes
- Generates imputed VCF files with inferred genotypes for downstream analysis
SKILL.md
.github/skills/bio-phasing-imputation-genotype-imputationView on GitHub ↗
---
name: bio-phasing-imputation-genotype-imputation
description: Impute missing genotypes using reference panels with Beagle or Minimac4. Use when increasing variant density for GWAS, harmonizing data across genotyping platforms, or inferring variants not directly typed in array data.
tool_type: cli
primary_tool: beagle
---
## Version Compatibility
Reference examples tested with: bcftools 1.19+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Genotype Imputation
**"Impute missing genotypes using a reference panel"** → Fill in untyped variants by leveraging LD patterns from a reference panel to increase variant density for GWAS or cross-platform harmonization.
- CLI: `java -jar beagle.jar gt=input.vcf ref=panel.vcf out=imputed`
- CLI: `minimac4 --refHaps panel.m3vcf --haps input.vcf --prefix imputed`
## Beagle Imputation
```bash
# Basic imputation
java -jar beagle.jar \
gt=study.vcf.gz \
ref=reference_panel.vcf.gz \
map=genetic_map.txt \
out=imputed
# Output: imputed.vcf.gz with imputed genotypes
```
## Beagle with Options
```bash
java -Xmx32g -jar beagle.jar \
gt=study.vcf.gz \
ref=reference_panel.vcf.gz \
map=genetic_map.txt \
out=imputed \
nthreads=8 \
gp=true \ # Output genotype probabilities
ap=true \ # Output allele probabilities
impute=true \ # Perform imputation (default)
ne=20000 # Effective population size
```
## Impute Per Chromosome
```bash
for chr in {1..22}; do
java -Xmx32g -jar beagle.jar \
gt=study.chr${chr}.vcf.gz \
ref=ref.chr${chr}.vcf.gz \
map=genetic_maps/plink.chr${chr}.GRCh38.map \
out=imputed.chr${chr} \
gp=true \
nthreads=8
done
# Concatenate
bcftools concat imputed.chr*.vcf.gz -Oz -o imputed.all.vcf.gz
bcftools index imputed.all.vcf.gz
```
## IMPUTE5 (Alternative)
```bash
# Newer IMPUTE software
impute5 \
--h reference.bcf \
--m genetic_map.txt \
--g study.vcf.gz \
--r chr22 \
--o imputed.chr22.vcf.gz \
--threads 8
```
## Minimac4 (Michigan Imputation Server)
```bash
# Often used via web server, but can run locally
minimac4 \
--refHaps reference.m3vcf.gz \
--haps study.vcf.gz \
--prefix imputed \
--format GT,DS,GP \
--cpus 8
```
## Input Preparation
**Goal:** Prepare study genotypes for imputation by fixing strand orientation, filtering to overlapping sites, and pre-phasing.
**Approach:** Align alleles to the reference genome with fixref, intersect with reference panel sites, phase with Beagle, then impute against the full reference panel.
```bash
# 1. Align to reference (strand, allele order)
bcftools +fixref study.vcf.gz -Oz -o fixed.vcf.gz -- \
-f reference.fa -m flip
# 2. Filter to sites in reference
bcftools isec -n=2 -w1 fixed.vcf.gz reference_sites.vcf.gz \
-Oz -o study_overlap.vcf.gz
# 3. Phase first (if not already phased)
java -jar beagle.jar gt=study_overlap.vcf.gz out=phased
# 4. Then impute
java -jar beagle.jar gt=phased.vcf.gz ref=reference.vcf.gz out=imputed
```
## Extract Imputation Quality
```bash
# INFO/DR2 or INFO/R2 contains imputation quality
bcftools query -f '%CHROM\t%POS\t%ID\t%INFO/DR2\n' imputed.vcf.gz > info_scores.txt
# Filter by quality
bcftools view -i 'INFO/DR2 > 0.3' imputed.vcf.gz -Oz -o imputed_filtered.vcf.gz
```
## Output Formats
| Format | Field | Description |
|--------|-------|-------------|
| GT | 0\|0, 0\|1, 1\|1 | Hard-called genotype |
| DS | 0.0-2.0 | Dosage (expected ALT allele count) |
| GP | 0.0-1.0,0.0-1.0,0.0-1.0 | Genotype probabilities (AA,AB,BB) |
| DR2/R2 | 0.0-1.0 | Imputation quality score |
## Using Dosages for GWAS
```python
import pandas as pd
# Extract dosages
# bcftools query -f '%CHROM\t%POS\t%ID[\t%DS]\n' imputed.vcf.gz > dosages.txt
dosages = pd.read_csv('dosages.txt', sep='\t')
# Dosage-based association (treats uncertainty)
# Use --dosage in PLINK2 or similar
```
```bash
# PLINK2 with dosages
plink2 --vcf imputed.vcf.gz dosage=DS \
--glm \
--pheno phenotypes.txt \
--out gwas_results
```
## Quality Thresholds
| Analysis | Minimum INFO/R2 |
|----------|-----------------|
| GWAS discovery | 0.3 |
| GWAS fine-mapping | 0.8 |
| Meta-analysis | 0.5 |
| Polygenic scores | 0.9 |
## Key Parameters
| Parameter | Beagle | Description |
|-----------|--------|-------------|
| gt | input VCF | Study genotypes |
| ref | reference VCF | Reference panel |
| map | genetic map | Recombination map |
| gp | true/false | Output genotype probs |
| ne | 20000 | Effective population size |
| nthreads | N | CPU threads |
| window | 40 | Window size (cM) |
## Imputation Servers
For large-scale imputation, consider web-based servers:
- **Michigan Imputation Server**: imputationserver.sph.umich.edu
- **TOPMed Imputation Server**: imputation.biodatacatalyst.nhlbi.nih.gov
- **Sanger Imputation Server**: imputation.sanger.ac.uk
```bash
# Prepare input for server
# Most require VCF.GZ per chromosome
for chr in {1..22}; do
bcftools view -r chr${chr} study.vcf.gz -Oz -o study.chr${chr}.vcf.gz
done
```
## Related Skills
- phasing-imputation/haplotype-phasing - Pre-phasing step
- phasing-imputation/reference-panels - Reference panel setup
- phasing-imputation/imputation-qc - Quality control
- population-genetics/association-testing - GWAS with imputed data
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.