bio-population-genetics-plink-basics
$
npx mdskill add GPTomics/bioSkills/bio-population-genetics-plink-basicsConverts and filters genetic data using PLINK for population genetics
- Solves tasks involving PLINK format conversion and quality control filtering
- Uses PLINK 1.9 and 2.0 command-line tools for processing genetic data
- Applies MAF, genotyping rate, and HWE filters to ensure data quality
- Generates output files in BED/BIM/FAM or other requested formats for downstream analysis
SKILL.md
.github/skills/bio-population-genetics-plink-basicsView on GitHub ↗
---
name: bio-population-genetics-plink-basics
description: PLINK file formats, format conversion, and quality control filtering for population genetics. Convert between VCF, BED/BIM/FAM, and PED/MAP formats, apply MAF, genotyping rate, and HWE filters using PLINK 1.9 and 2.0. Use when working with PLINK format files or running QC.
tool_type: cli
primary_tool: plink
---
## Version Compatibility
Reference examples tested with: pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# PLINK Basics
**"Convert my VCF to PLINK format and run QC"** → Handle PLINK file format conversions (VCF, BED/BIM/FAM, PED/MAP) and apply standard genotype QC filters for MAF, genotyping rate, and HWE.
- CLI: `plink2 --vcf input.vcf --make-bed` for format conversion
- CLI: `plink2 --maf 0.01 --geno 0.05 --hwe 1e-6` for QC filtering
File formats, conversion, and quality control filtering with PLINK 1.9 and 2.0.
## File Formats
### Binary Format (Recommended)
| File | Contents |
|------|----------|
| `.bed` | Binary genotype data |
| `.bim` | Variant information (chr, ID, cM, pos, A1, A2) |
| `.fam` | Sample information (FID, IID, father, mother, sex, pheno) |
### PLINK 2.0 Format
| File | Contents |
|------|----------|
| `.pgen` | Binary genotype data (compressed) |
| `.pvar` | Variant information |
| `.psam` | Sample information |
### Text Format (Legacy)
| File | Contents |
|------|----------|
| `.ped` | Genotypes (FID, IID, father, mother, sex, pheno, genotypes) |
| `.map` | Variant positions (chr, ID, cM, pos) |
## Format Conversion
### VCF to PLINK Binary
```bash
# PLINK 1.9
plink --vcf input.vcf.gz --make-bed --out output
# PLINK 2.0
plink2 --vcf input.vcf.gz --make-bed --out output
# With sample ID handling
plink2 --vcf input.vcf.gz --double-id --make-bed --out output
```
### PLINK Binary to VCF
```bash
# PLINK 1.9
plink --bfile input --recode vcf --out output
# PLINK 2.0
plink2 --bfile input --export vcf --out output
# Compressed VCF
plink2 --bfile input --export vcf bgz --out output
```
### PED/MAP to Binary (PLINK 1.9 Only)
```bash
# PLINK 1.9 (PLINK 2.0 doesn't support .ped/.map directly)
plink --file input --make-bed --out output
```
### Binary to PED/MAP
```bash
# PLINK 1.9
plink --bfile input --recode --out output
# PLINK 2.0
plink2 --bfile input --export ped --out output
```
### PLINK 1.9 to 2.0 Format
```bash
# Convert to PGEN format
plink2 --bfile input --make-pgen --out output
# Convert back to BED
plink2 --pfile input --make-bed --out output
```
## Quality Control Filtering
### MAF Filter (Minor Allele Frequency)
```bash
# Remove variants with MAF < 0.01
plink --bfile input --maf 0.01 --make-bed --out output
# PLINK 2.0
plink2 --bfile input --maf 0.01 --make-bed --out output
# Remove rare variants (MAF < 0.05)
plink2 --bfile input --maf 0.05 --make-bed --out output
```
### Genotyping Rate Filters
```bash
# Per-variant missing rate (remove if >5% missing)
plink2 --bfile input --geno 0.05 --make-bed --out output
# Per-sample missing rate (remove if >5% missing)
plink2 --bfile input --mind 0.05 --make-bed --out output
```
### Hardy-Weinberg Equilibrium Filter
```bash
# Remove variants with HWE p-value < 1e-6
plink2 --bfile input --hwe 1e-6 --make-bed --out output
# Different threshold for cases vs controls
plink2 --bfile input --hwe 1e-6 --hwe-all --make-bed --out output
```
### Combined QC Pipeline
```bash
# Standard QC filtering
plink2 --bfile input \
--maf 0.01 \
--geno 0.05 \
--mind 0.05 \
--hwe 1e-6 \
--make-bed --out qc_filtered
```
## Sample and Variant Selection
### Keep/Remove Samples
```bash
# Keep specific samples (samples.txt: FID IID per line)
plink2 --bfile input --keep samples.txt --make-bed --out output
# Remove specific samples
plink2 --bfile input --remove samples.txt --make-bed --out output
# Keep single sample
plink2 --bfile input --keep-fam sample_id --make-bed --out output
```
### Extract/Exclude Variants
```bash
# Extract specific variants (variants.txt: variant IDs)
plink2 --bfile input --extract variants.txt --make-bed --out output
# Exclude specific variants
plink2 --bfile input --exclude variants.txt --make-bed --out output
# Extract by range
plink2 --bfile input --extract range chr1:1000000-2000000 --make-bed --out output
```
### Chromosome Selection
```bash
# Single chromosome
plink2 --bfile input --chr 22 --make-bed --out chr22
# Multiple chromosomes
plink2 --bfile input --chr 1-22 --make-bed --out autosomes
# Exclude chromosome
plink2 --bfile input --not-chr 23,24,25,26 --make-bed --out autosomes
```
## Allele Frequency
```bash
# PLINK 1.9 (MAF-based)
plink --bfile input --freq --out output
# PLINK 2.0 (ALT allele frequency - not MAF!)
plink2 --bfile input --freq --out output
# PLINK 2.0 with MAF
plink2 --bfile input --freq cols=+mac,+mafreq --out output
```
## Missing Data Statistics
```bash
# Per-sample and per-variant missing rates
plink2 --bfile input --missing --out output
# Output files:
# output.smiss - sample missing rates
# output.vmiss - variant missing rates
```
## Sex Check
Verify reported sex matches X chromosome heterozygosity.
```bash
# PLINK 1.9
plink --bfile input --check-sex --out sex_check
# PLINK 2.0
plink2 --bfile input --split-par hg38 --check-sex --out sex_check
```
### Interpret Results
```python
import pandas as pd
sex = pd.read_csv('sex_check.sexcheck', sep='\s+')
problems = sex[sex['STATUS'] == 'PROBLEM']
print(f'Sex mismatches: {len(problems)}')
# F statistic: <0.2 = female, >0.8 = male, between = ambiguous
# PEDSEX: reported sex (1=male, 2=female, 0=unknown)
# SNPSEX: inferred sex (1=male, 2=female, 0=undetermined)
```
### Update or Remove
```bash
# Update sex from check results
plink2 --bfile input --update-sex sex_check.sexcheck col-num=4 --make-bed --out updated
# Remove sex mismatches
awk '$5 == "PROBLEM" {print $1, $2}' sex_check.sexcheck > sex_problems.txt
plink2 --bfile input --remove sex_problems.txt --make-bed --out output
```
## Sample Information
### Update Phenotypes
```bash
# phenotypes.txt: FID IID pheno (1=control, 2=case, -9=missing)
plink2 --bfile input --pheno phenotypes.txt --make-bed --out output
# Quantitative phenotype
plink2 --bfile input --pheno phenotypes.txt --make-bed --out output
```
### Update Sex
```bash
# sex.txt: FID IID sex (1=male, 2=female, 0=unknown)
plink2 --bfile input --update-sex sex.txt --make-bed --out output
```
### Update Sample IDs
```bash
# ids.txt: old_FID old_IID new_FID new_IID
plink2 --bfile input --update-ids ids.txt --make-bed --out output
```
## Merging Datasets
```bash
# Merge two datasets (PLINK 1.9)
plink --bfile data1 --bmerge data2.bed data2.bim data2.fam --make-bed --out merged
# Merge list of datasets
plink --bfile data1 --merge-list merge_list.txt --make-bed --out merged
# merge_list.txt contains: data2.bed data2.bim data2.fam (one set per line)
# Handle strand flips
plink --bfile data1 --bmerge data2 --make-bed --out merged
# If error: plink --bfile data2 --flip missnps.txt --make-bed --out data2_flipped
```
## Variant Information
### Set Variant IDs
```bash
# Set ID based on position
plink2 --bfile input --set-all-var-ids @:#:\$r:\$a --make-bed --out output
# Format: chr:pos:ref:alt
```
### Update Variant Names
```bash
# update.txt: old_id new_id
plink2 --bfile input --update-name update.txt --make-bed --out output
```
## PLINK 2.0 vs 1.9 Summary
| Feature | PLINK 2.0 | PLINK 1.9 |
|---------|-----------|-----------|
| Status | Current | Legacy |
| Command | `plink2` | `plink` |
| Format | `.pgen/.pvar/.psam` | `.bed/.bim/.fam` |
| Speed | Faster | Baseline |
| Memory | More efficient | Higher for large data |
| Export VCF | `--export vcf` | `--recode vcf` |
| Frequency output | ALT frequency | MAF |
| Missing output | `.smiss/.vmiss` | `.imiss/.lmiss` |
| PED/MAP support | No (convert via 1.9) | Yes (`--file`) |
## Related Skills
- association-testing - GWAS with filtered data
- population-structure - PCA after QC
- variant-calling/vcf-basics - VCF format before conversion
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.