bio-population-genetics-population-structure

$npx mdskill add GPTomics/bioSkills/bio-population-genetics-population-structure

Detect population stratification using PCA and ADMIXTURE analysis.

  • Identify genetic clusters and ancestry proportions in genotype data.
  • Integrates PLINK 2.0 for PCA and ADMIXTURE software for modeling.
  • Selects optimal K values based on admixture model performance metrics.
  • Outputs cluster assignments and ancestry proportions for downstream analysis.
SKILL.md
.github/skills/bio-population-genetics-population-structureView on GitHub ↗
---
name: bio-population-genetics-population-structure
description: Analyze population structure using PCA and admixture analysis with PLINK and ADMIXTURE. Identify population clusters, assess ancestry proportions, visualize genetic structure, and choose optimal K for admixture models. Use when analyzing population stratification with PCA or admixture.
tool_type: cli
primary_tool: plink2
---

## Version Compatibility

Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Population Structure

**"Analyze population structure in my genotype data"** → Detect population stratification using PCA of genotypes and estimate ancestry proportions with ADMIXTURE modeling.
- CLI: `plink2 --pca 20` for principal component analysis
- CLI: `admixture genotypes.bed K` for admixture proportions

Analyze genetic ancestry and population stratification using PCA and ADMIXTURE.

## Principal Component Analysis (PCA)

### PLINK 2.0 PCA

```bash
# Basic PCA (10 PCs)
plink2 --bfile data --pca 10 --out pca_results

# More PCs
plink2 --bfile data --pca 20 --out pca_results

# Approximate PCA (faster for large datasets)
plink2 --bfile data --pca 10 approx --out pca_results

# Output variant loadings
plink2 --bfile data --pca 10 var-wts --out pca_results
```

### Output Files

| File | Contents |
|------|----------|
| `.eigenvec` | PC scores per sample (FID, IID, PC1, PC2, ...) |
| `.eigenval` | Eigenvalues (variance explained) |
| `.eigenvec.var` | Variant loadings (if var-wts) |

### Variance Explained

```python
import numpy as np

eigenvalues = np.loadtxt('pca_results.eigenval')
variance_explained = eigenvalues / eigenvalues.sum() * 100
cumulative = np.cumsum(variance_explained)

for i, (ve, cum) in enumerate(zip(variance_explained, cumulative), 1):
    print(f'PC{i}: {ve:.2f}% (cumulative: {cum:.2f}%)')
```

### PCA Visualization

```python
import pandas as pd
import matplotlib.pyplot as plt

eigenvec = pd.read_csv('pca_results.eigenvec', sep='\s+', header=None)
eigenvec.columns = ['FID', 'IID'] + [f'PC{i}' for i in range(1, len(eigenvec.columns) - 1)]

pop_info = pd.read_csv('population_labels.txt', sep='\t')  # FID, IID, Population
eigenvec = eigenvec.merge(pop_info, on=['FID', 'IID'])

plt.figure(figsize=(10, 8))
for pop in eigenvec['Population'].unique():
    subset = eigenvec[eigenvec['Population'] == pop]
    plt.scatter(subset['PC1'], subset['PC2'], label=pop, s=20, alpha=0.7)

plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.savefig('pca_plot.png', dpi=150)
```

## LD Pruning (Before Admixture)

ADMIXTURE requires LD-pruned SNPs:

```bash
# Calculate LD and identify pruned set
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune

# Extract pruned variants
plink2 --bfile data --extract prune.prune.in --make-bed --out data_pruned
```

### Pruning Parameters

| Parameter | Description |
|-----------|-------------|
| Window (50) | SNPs in each window |
| Step (10) | SNPs to shift per step |
| r² threshold (0.1) | Max LD allowed |

## ADMIXTURE Analysis

### Basic Usage

```bash
# Run ADMIXTURE for K=3 clusters
admixture data_pruned.bed 3

# With cross-validation
admixture --cv data_pruned.bed 3

# Multithreaded
admixture -j4 data_pruned.bed 3
```

### Output Files

| File | Contents |
|------|----------|
| `.Q` | Ancestry proportions (samples × K) |
| `.P` | Allele frequencies per cluster |

### Testing Multiple K Values

```bash
# Run for K=2 through K=10
for K in $(seq 2 10); do
    admixture --cv -j4 data_pruned.bed $K 2>&1 | tee log${K}.out
done

# Extract CV errors
grep -h "CV" log*.out | awk '{print NR+1, $4}' > cv_errors.txt
```

### Choose Optimal K

```python
import matplotlib.pyplot as plt

cv_errors = []
with open('cv_errors.txt') as f:
    for line in f:
        k, cv = line.strip().split()
        cv_errors.append((int(k), float(cv)))

ks, cvs = zip(*cv_errors)
plt.figure(figsize=(8, 5))
plt.plot(ks, cvs, 'o-')
plt.xlabel('K')
plt.ylabel('Cross-validation error')
plt.title('Admixture CV Error')
plt.savefig('admixture_cv.png', dpi=150)

optimal_k = ks[cvs.index(min(cvs))]
print(f'Optimal K: {optimal_k}')
```

### Visualize Admixture

```python
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

K = 3
Q = pd.read_csv(f'data_pruned.{K}.Q', sep='\s+', header=None)
fam = pd.read_csv('data_pruned.fam', sep='\s+', header=None)
Q.columns = [f'Cluster{i}' for i in range(1, K + 1)]
Q['IID'] = fam[1].values

pop_info = pd.read_csv('population_labels.txt', sep='\t')
Q = Q.merge(pop_info, on='IID')
Q = Q.sort_values('Population')

colors = plt.cm.Set1(range(K))
fig, ax = plt.subplots(figsize=(14, 4))

bottom = np.zeros(len(Q))
for i in range(K):
    ax.bar(range(len(Q)), Q[f'Cluster{i+1}'], bottom=bottom, color=colors[i], width=1)
    bottom += Q[f'Cluster{i+1}'].values

ax.set_xlim(0, len(Q))
ax.set_ylim(0, 1)
ax.set_ylabel('Ancestry proportion')
plt.savefig('admixture_barplot.png', dpi=150, bbox_inches='tight')
```

## FlashPCA2 (Fast PCA for Large Datasets)

FlashPCA2 is optimized for very large datasets (100,000+ samples). Uses randomized algorithms for speed.

### Installation

```bash
# From conda
conda install -c bioconda flashpca

# Or download binaries from GitHub
# https://github.com/gabraham/flashpca
```

### Basic Usage

```bash
# Standard PCA
flashpca2 --bfile data --ndim 10 --outpc pcs.txt --outvec loadings.txt --outval eigenvalues.txt

# --ndim 10: Number of PCs to compute
# --outpc: Principal components output
# --outvec: Eigenvectors (variant loadings)
# --outval: Eigenvalues
```

### FlashPCA2 Options

| Option | Description |
|--------|-------------|
| --bfile | PLINK binary prefix |
| --ndim | Number of PCs (default 10) |
| --outpc | PC scores output file |
| --outvec | Eigenvectors output |
| --outval | Eigenvalues output |
| --numthreads | CPU threads to use |
| --mem | Memory limit (GB) |
| --seed | Random seed for reproducibility |

### Large Dataset Settings

```bash
# For biobank-scale data (>100k samples)
# numthreads=16: Adjust to available cores.
# mem=64: Memory in GB. Increase for larger datasets.
flashpca2 \
    --bfile large_data \
    --ndim 20 \
    --numthreads 16 \
    --mem 64 \
    --outpc pcs.txt \
    --outval eigenvalues.txt \
    --seed 42
```

### FlashPCA2 vs PLINK2

| Feature | FlashPCA2 | PLINK2 |
|---------|-----------|--------|
| Speed (100k samples) | Faster | Good |
| Memory efficiency | Better | Good |
| Randomized algorithm | Yes | Optional (approx) |
| Part of standard toolkit | No | Yes |

Use FlashPCA2 for biobank-scale data; PLINK2 sufficient for most studies.

### Parse FlashPCA2 Output

```python
import pandas as pd

# Load PCs
pcs = pd.read_csv('pcs.txt', sep='\t', header=None)
pcs.columns = ['FID', 'IID'] + [f'PC{i}' for i in range(1, len(pcs.columns) - 1)]

# Load eigenvalues
eigenvals = pd.read_csv('eigenvalues.txt', header=None)[0].values
var_explained = eigenvals / eigenvals.sum() * 100

print('Variance explained:')
for i, ve in enumerate(var_explained[:10], 1):
    print(f'  PC{i}: {ve:.2f}%')
```

## MDS (Alternative to PCA)

```bash
# PLINK 1.9 MDS
plink --bfile data --cluster --mds-plot 10 --out mds_results

# Output: mds_results.mds (sample coordinates)
```

## Kinship/Relatedness

### PLINK 2.0 KING-robust

```bash
# Calculate kinship matrix
plink2 --bfile data --make-king-table --out kinship

# Output: kinship.kin0 (pairs with kinship > 0.0442)
```

### Identify Related Individuals

```python
import pandas as pd

kin = pd.read_csv('kinship.kin0', sep='\t')
related = kin[kin['KINSHIP'] > 0.0884]  # First-degree relatives
print(f'Related pairs (1st degree): {len(related)}')

related = kin[kin['KINSHIP'] > 0.0442]  # Second-degree
print(f'Related pairs (2nd degree): {len(related)}')
```

### Remove Related Individuals

```bash
# Create list to remove (keep one per pair)
plink2 --bfile data --king-cutoff 0.0884 --out unrelated

# Filter to unrelated
plink2 --bfile data --keep unrelated.king.cutoff.in.id --make-bed --out unrelated
```

## Complete Workflow

**Goal:** Analyze population structure from raw genotypes through PCA and admixture modeling with optimal K selection.

**Approach:** Apply QC filters, LD-prune for independent SNPs, run PCA for visual stratification assessment, then fit ADMIXTURE models across multiple K values and select the best fit by cross-validation error.

```bash
# 1. QC filtering
plink2 --bfile raw --maf 0.01 --geno 0.05 --hwe 1e-6 --make-bed --out qc

# 2. LD pruning
plink2 --bfile qc --indep-pairwise 50 10 0.1 --out prune
plink2 --bfile qc --extract prune.prune.in --make-bed --out pruned

# 3. PCA
plink2 --bfile pruned --pca 20 --out pca

# 4. Admixture (multiple K)
for K in 2 3 4 5 6; do
    admixture --cv -j4 pruned.bed $K 2>&1 | tee log${K}.out
done
```

## Related Skills

- plink-basics - Data preparation and QC
- linkage-disequilibrium - LD pruning details
- association-testing - Use PCs as covariates
- ecological-genomics/landscape-genomics - Population structure correction for GEA
More from GPTomics/bioSkills