bio-population-genetics-linkage-disequilibrium
$
npx mdskill add GPTomics/bioSkills/bio-population-genetics-linkage-disequilibriumCalculate LD statistics, prune variants, and visualize haplotype blocks.
- Computes r² and D' metrics to identify correlated genetic variants.
- Integrates PLINK2, scikit-allel, and LDBlockShow for analysis.
- Executes pruning algorithms based on user-defined r² thresholds.
- Outputs visualizations and statistical tables for population structure.
SKILL.md
.github/skills/bio-population-genetics-linkage-disequilibriumView on GitHub ↗
---
name: bio-population-genetics-linkage-disequilibrium
description: Calculate linkage disequilibrium statistics (r², D'), perform LD pruning for population structure analysis, identify haplotype blocks, and visualize LD patterns using PLINK, scikit-allel, and LDBlockShow. Use when calculating LD or pruning variants.
tool_type: mixed
primary_tool: plink2
---
## Version Compatibility
Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Linkage Disequilibrium
**"Calculate LD between my variants"** → Compute pairwise LD statistics (r², D'), prune correlated variants for independent sets, and identify haplotype blocks from genotype data.
- CLI: `plink2 --r2` for LD calculation, `--indep-pairwise` for pruning
- Python: `allel.rogers_huff_r()` for windowed LD in scikit-allel
Calculate LD statistics, prune correlated variants, and identify haplotype blocks.
## PLINK LD Calculations
### Pairwise r²
```bash
# All pairs within window
plink2 --bfile data --r2 --ld-window-kb 1000 --ld-window-r2 0.2 --out ld_results
# With SNP names in output
plink2 --bfile data --r2 inter-chr --ld-window-r2 0 --out all_pairs
# Squared correlation matrix
plink2 --bfile data --r2-phased square --out ld_matrix
```
### Output Format
```
# ld_results.ld contains:
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
```
### PLINK 1.9 Options
```bash
# r² with D' statistics
plink --bfile data --r2 dprime --ld-window-kb 500 --out ld_with_dprime
# Inter-chromosome LD
plink --bfile data --r2 inter-chr --ld-snp-list target_snps.txt --out target_ld
```
## LD Pruning
### Standard Pruning
```bash
# Calculate pruning list
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune
# Output files:
# prune.prune.in - Variants to keep
# prune.prune.out - Variants to remove
# Extract pruned set
plink2 --bfile data --extract prune.prune.in --make-bed --out data_pruned
```
### Pruning Parameters
| Parameter | Description | Common Values |
|-----------|-------------|---------------|
| Window (50) | Variants per window | 50-200 |
| Step (10) | Variants to shift | 5-50 |
| r² threshold (0.1) | Max LD allowed | 0.1-0.5 |
### Use Cases
```bash
# Strict pruning for PCA/Admixture
plink2 --bfile data --indep-pairwise 50 10 0.1 --out strict_prune
# Moderate pruning for polygenic scores
plink2 --bfile data --indep-pairwise 200 50 0.5 --out moderate_prune
# Region-based pruning
plink2 --bfile data --indep-pairwise 50 10 0.2 --chr 6 --from-mb 25 --to-mb 35 --out mhc_prune
```
## scikit-allel LD
### Pairwise r²
```python
import allel
import numpy as np
callset = allel.read_vcf('data.vcf.gz')
gt = allel.GenotypeArray(callset['calldata/GT'])
pos = callset['variants/POS']
gn = gt.to_n_alt()
r2 = allel.rogers_huff_r(gn[:100]) ** 2
```
### LD Decay
**Goal:** Plot LD decay as a function of physical distance to characterize the extent of linkage in the population.
**Approach:** Compute pairwise r² values between nearby variants using scikit-allel, bin by physical distance, and plot mean r² per distance bin.
```python
import allel
import numpy as np
import matplotlib.pyplot as plt
gn = gt.to_n_alt()
r2, dist = [], []
n_variants = min(1000, gn.shape[0])
for i in range(n_variants):
for j in range(i + 1, min(i + 100, n_variants)):
r = allel.rogers_huff_r(gn[[i, j]])[0, 1] ** 2
d = pos[j] - pos[i]
r2.append(r)
dist.append(d)
r2 = np.array(r2)
dist = np.array(dist)
bins = np.arange(0, 100001, 1000)
bin_means = []
for i in range(len(bins) - 1):
mask = (dist >= bins[i]) & (dist < bins[i + 1])
if mask.sum() > 0:
bin_means.append(np.mean(r2[mask]))
else:
bin_means.append(np.nan)
plt.figure(figsize=(10, 6))
plt.plot(bins[:-1] / 1000, bin_means)
plt.xlabel('Distance (kb)')
plt.ylabel('Mean r²')
plt.title('LD Decay')
plt.savefig('ld_decay.png')
```
## Haplotype Blocks
### PLINK
```bash
# Identify haplotype blocks (Gabriel et al.)
plink --bfile data --blocks no-pheno-req --out blocks
# Output: blocks.blocks (block boundaries)
# Output: blocks.blocks.det (block details)
```
### Block Statistics
```python
import pandas as pd
blocks = pd.read_csv('blocks.blocks.det', sep='\s+')
print(f'Number of blocks: {len(blocks)}')
print(f'Mean block size: {blocks["KB"].mean():.1f} kb')
print(f'Mean SNPs per block: {blocks["NSNPS"].mean():.1f}')
```
## LD Matrix Visualization
```python
import allel
import numpy as np
import matplotlib.pyplot as plt
gn = gt.to_n_alt()[:200]
r = allel.rogers_huff_r(gn)
r2_matrix = r ** 2
plt.figure(figsize=(10, 10))
plt.imshow(r2_matrix, cmap='hot', vmin=0, vmax=1)
plt.colorbar(label='r²')
plt.xlabel('Variant index')
plt.ylabel('Variant index')
plt.title('LD Matrix')
plt.savefig('ld_matrix.png', dpi=150)
```
## LD-based Clumping (GWAS)
```bash
# Clump GWAS results by LD
plink --bfile data \
--clump gwas_results.txt \
--clump-p1 5e-8 \
--clump-p2 1e-5 \
--clump-r2 0.1 \
--clump-kb 250 \
--out clumped
# Output: clumped.clumped (independent signals)
```
### Clump Parameters
| Parameter | Description |
|-----------|-------------|
| --clump-p1 | Index SNP p-value threshold |
| --clump-p2 | Clumped SNP p-value threshold |
| --clump-r2 | LD threshold for clumping |
| --clump-kb | Physical distance threshold |
## vcftools LD
```bash
# Pairwise LD for region
vcftools --vcf data.vcf --geno-r2 --ld-window-bp 100000 --out ld_results
# Output: ld_results.geno.ld
# Haplotype-based r²
vcftools --vcf data.vcf --hap-r2 --ld-window-bp 100000 --out hap_ld
```
## Complete Workflow
```bash
# 1. Calculate genome-wide LD
plink2 --bfile data --r2 --ld-window-kb 500 --ld-window-r2 0.2 --out ld_genome
# 2. Generate pruned set for PCA
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune
plink2 --bfile data --extract prune.prune.in --make-bed --out pruned
# 3. Identify haplotype blocks
plink --bfile data --blocks no-pheno-req --out blocks
# 4. Visualize LD for specific region
plink --bfile data --r2 dprime --chr 6 --from-mb 28 --to-mb 34 --out mhc_ld
```
## Related Skills
- plink-basics - File format handling
- population-structure - Use pruned data for PCA
- association-testing - LD clumping for GWAS
- selection-statistics - LD affects EHH statistics
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.