bio-phasing-imputation-reference-panels
$
npx mdskill add GPTomics/bioSkills/bio-phasing-imputation-reference-panelsDownload and manage reference panels for genotype phasing and imputation
- Solves the task of setting up reference panels for imputation workflows
- Uses CLI tools like bcftools and picard for data processing
- Selects appropriate panels like 1000 Genomes, HRC, and TOPMed for target populations
- Delivers prepared reference panels in standardized formats for downstream analysis
SKILL.md
.github/skills/bio-phasing-imputation-reference-panelsView on GitHub ↗
---
name: bio-phasing-imputation-reference-panels
description: Download, prepare, and manage reference panels for phasing and imputation. Covers 1000 Genomes, HRC, and TOPMed panels. Use when setting up imputation infrastructure or selecting appropriate reference panels for target populations.
tool_type: cli
primary_tool: bcftools
---
## Version Compatibility
Reference examples tested with: bcftools 1.19+, picard 3.1+
Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Reference Panels
**"Set up reference panels for imputation"** → Download, prepare, and manage reference panels (1000 Genomes, HRC, TOPMed) for genotype phasing and imputation, including population subsetting and format conversion.
- CLI: `bcftools view -S panel_samples.txt` for subsetting, `bcftools norm` for normalization
## 1000 Genomes Phase 3 (GRCh38)
```bash
# Download from IGSR
BASE_URL="http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased"
for chr in {1..22}; do
wget ${BASE_URL}/CCDG_14151_B01_GRM_WGS_2020-08-05_chr${chr}.filtered.shapeit2-duohmm-phased.vcf.gz
wget ${BASE_URL}/CCDG_14151_B01_GRM_WGS_2020-08-05_chr${chr}.filtered.shapeit2-duohmm-phased.vcf.gz.tbi
done
```
## Subset by Population
```bash
# Download sample info
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_2504_high_coverage.sequence.index
# Create population sample lists
grep "EUR" samples.txt | cut -f1 > european_samples.txt
grep "AFR" samples.txt | cut -f1 > african_samples.txt
grep "EAS" samples.txt | cut -f1 > east_asian_samples.txt
# Subset reference to specific population
bcftools view -S european_samples.txt \
1000GP.chr22.vcf.gz \
-Oz -o 1000GP_EUR.chr22.vcf.gz
```
## Convert to Beagle Format
```bash
# Beagle uses VCF directly, but ensure proper format
bcftools view -m2 -M2 -v snps reference.vcf.gz | \
bcftools annotate --set-id '%CHROM:%POS:%REF:%ALT' | \
bgzip > reference_beagle.vcf.gz
bcftools index reference_beagle.vcf.gz
```
## Convert to IMPUTE5 Format
```bash
# IMPUTE5 uses its own format
imp5Converter \
--h reference.vcf.gz \
--r chr22 \
--o reference.chr22.imp5
```
## HRC Reference Panel
```bash
# HRC requires registration at EGA
# After access granted:
# Download from EGA using pyega3
pip install pyega3
pyega3 -cf credentials.json fetch EGAD00001002729
# HRC contains 32,470 samples (mostly European)
```
## TOPMed Reference Panel
```bash
# TOPMed available through imputation servers
# Or download from dbGaP with appropriate access
# Use via Michigan Imputation Server:
# 1. Upload study VCF
# 2. Select "TOPMed r2" as reference
# 3. Download imputed results
```
## Genetic Maps
```bash
# Beagle format (GRCh38) - from Browning lab
wget https://faculty.washington.edu/browning/beagle/genetic_maps/plink.GRCh38.map.zip
unzip plink.GRCh38.map.zip -d genetic_maps/
# SHAPEIT5 format (recommended for SHAPEIT5)
wget https://github.com/odelaneau/shapeit5/raw/main/maps/genetic_maps.b38.tar.gz
tar xzf genetic_maps.b38.tar.gz
```
## Check Reference Panel
```bash
# Basic stats
bcftools stats reference.vcf.gz | head -50
# Sample count
bcftools query -l reference.vcf.gz | wc -l
# Variant count
bcftools view -H reference.vcf.gz | wc -l
# Check chromosomes
bcftools index -s reference.vcf.gz
```
## Lift Over Reference Panel
```bash
# GRCh37 to GRCh38
# Using Picard
java -jar picard.jar LiftoverVcf \
I=reference_hg19.vcf.gz \
O=reference_hg38.vcf.gz \
CHAIN=hg19ToHg38.over.chain.gz \
REJECT=rejected.vcf \
R=hg38.fa
# Or using CrossMap
CrossMap.py vcf hg19ToHg38.chain reference_hg19.vcf hg38.fa reference_hg38.vcf
```
## Align Study to Reference
```bash
# Check strand concordance
bcftools +fixref study.vcf.gz -Oz -o study_fixed.vcf.gz -- \
-f reference.fa \
-i reference_panel.vcf.gz \
-m flip
# Statistics on fixes
bcftools +fixref study.vcf.gz -- -f reference.fa -m stats
```
## Filter Reference Panel
```bash
# Remove singletons (appear in only 1 sample)
bcftools view -c 2 reference.vcf.gz -Oz -o reference_no_singletons.vcf.gz
# Filter by MAF
bcftools view -q 0.001:minor reference.vcf.gz -Oz -o reference_maf001.vcf.gz
# Remove indels (SNPs only)
bcftools view -v snps reference.vcf.gz -Oz -o reference_snps.vcf.gz
```
## Merge Custom Panel with 1000G
```bash
# If you have additional reference samples
bcftools merge \
1000GP.chr22.vcf.gz \
custom_reference.chr22.vcf.gz \
-Oz -o combined_reference.chr22.vcf.gz
# Ensure matching variants first
bcftools isec -n=2 \
1000GP.chr22.vcf.gz \
custom_reference.chr22.vcf.gz \
-p isec_output
```
## Reference Panel Comparison
| Panel | Samples | Variants | Populations |
|-------|---------|----------|-------------|
| 1000G Phase 3 | 2,504 | 88M | 26 global |
| HRC r1.1 | 32,470 | 40M | European-heavy |
| TOPMed r2 | 97,256 | 308M | 60% European, diverse |
| UK10K | 3,781 | 42M | British |
## Related Skills
- phasing-imputation/haplotype-phasing - Use panels for phasing
- phasing-imputation/genotype-imputation - Use panels for imputation
- variant-calling/vcf-manipulation - VCF file operations
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.