bio-population-genetics-association-testing

Name: bio-population-genetics-association-testing
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-population-genetics-association-testing

Run GWAS association tests with PLINK 2.0 and visualize results.

Executes logistic and linear regression for case-control and quantitative traits.
Depends on PLINK 2.0 CLI and requires matplotlib, numpy, pandas, and scipy.
Selects analysis type based on phenotype file structure and user intent.
Outputs Manhattan and QQ plots alongside statistical association results.

SKILL.md

.github/skills/bio-population-genetics-association-testingView on GitHub ↗

---
name: bio-population-genetics-association-testing
description: Genome-wide association studies (GWAS) with PLINK. Perform case-control and quantitative trait association testing using logistic/linear regression with covariates, generate Manhattan and QQ plots for result visualization. Use when running GWAS or association tests.
tool_type: cli
primary_tool: plink2
---

## Version Compatibility

Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+, scipy 1.12+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Association Testing

**"Run a GWAS on my genotyping data"** → Perform genome-wide association testing using logistic (case-control) or linear (quantitative) regression with covariates, then visualize results with Manhattan and QQ plots.
- CLI: `plink2 --glm` for association testing with covariates

GWAS analysis using PLINK 2.0's unified `--glm` command for case-control and quantitative traits.

## PLINK 2.0 Association Testing

### Basic Case-Control (Binary Phenotype)

```bash
# Basic logistic regression
plink2 --bfile data --glm --out results

# With phenotype file
plink2 --bfile data --pheno pheno.txt --glm --out results
```

### Quantitative Trait (Continuous Phenotype)

```bash
# Linear regression for quantitative traits
plink2 --bfile data --pheno pheno.txt --glm --out results
```

### With Covariates

```bash
# Include covariates (sex, age, PCs)
plink2 --bfile data \
    --pheno pheno.txt \
    --covar covariates.txt \
    --glm --out results

# Specify which covariates to use
plink2 --bfile data \
    --pheno pheno.txt \
    --covar covariates.txt \
    --covar-name PC1,PC2,PC3,age,sex \
    --glm --out results
```

## Covariate Files

### Phenotype File Format

```
# pheno.txt: FID IID pheno
# For binary: 1=control, 2=case, -9=missing
# For quantitative: continuous values
FAM001 IND001 2
FAM002 IND002 1
FAM003 IND003 1.5
```

### Covariate File Format

```
# covariates.txt: FID IID cov1 cov2 ...
FAM001 IND001 0.15 35 1
FAM002 IND002 -0.22 42 2
FAM003 IND003 0.08 28 1
```

## GLM Options

### Phenotype Handling

```bash
# Multiple phenotypes (test all)
plink2 --bfile data --pheno pheno_multi.txt --glm --out results

# Specific phenotype column
plink2 --bfile data --pheno pheno_multi.txt --pheno-name trait1 --glm --out results

# Missing phenotype handling
plink2 --bfile data --glm allow-no-covars --out results
```

### Model Options

```bash
# Additive model (default)
plink2 --bfile data --glm --out results

# Dominant model
plink2 --bfile data --glm dominant --out results

# Recessive model
plink2 --bfile data --glm recessive --out results

# Genotypic (2df test)
plink2 --bfile data --glm genotypic --out results

# Hide covariates from output (cleaner output)
plink2 --bfile data --covar cov.txt --glm hide-covar --out results
```

### Firth Regression (Rare Variants)

```bash
# Enable Firth fallback for case-control (default in PLINK 2.0)
plink2 --bfile data --glm firth-fallback --out results

# Force Firth regression
plink2 --bfile data --glm firth --out results

# Disable Firth
plink2 --bfile data --glm no-firth --out results
```

## Output Format

### Output Columns

```bash
# Default output: results.PHENO1.glm.logistic or results.PHENO1.glm.linear
# Columns: CHROM, POS, ID, REF, ALT, A1, FIRTH?, TEST, OBS_CT, OR/BETA, SE, Tstat, P
```

### Custom Output Columns

```bash
# Add specific columns
plink2 --bfile data --glm cols=+a1freq,+machr2 --out results

# Available columns:
# +a1freq: A1 allele frequency
# +machr2: MaCH R-squared
# +ax: Reference allele dosage
# +err: Standard errors
```

## Population Stratification Control

### Include Principal Components

```bash
# 1. Run PCA
plink2 --bfile data --pca 10 --out pca_results

# 2. Use PCs as covariates
plink2 --bfile data \
    --pheno pheno.txt \
    --covar pca_results.eigenvec \
    --covar-name PC1,PC2,PC3,PC4,PC5 \
    --glm --out results
```

### Combined Workflow

**Goal:** Run a complete GWAS pipeline from raw genotypes through population stratification correction to association testing.

**Approach:** Apply MAF, genotyping rate, and HWE quality filters, compute principal components for population structure correction, then run logistic/linear regression with PCs as covariates.

```bash
# QC, PCA, and GWAS in sequence
plink2 --bfile raw --maf 0.01 --geno 0.05 --hwe 1e-6 --make-bed --out qc
plink2 --bfile qc --pca 10 --out pca
plink2 --bfile qc \
    --pheno pheno.txt \
    --covar pca.eigenvec \
    --covar-name PC1-PC5 \
    --glm hide-covar --out gwas
```

## Result Filtering

### Command Line Filtering

```bash
# Filter significant results
awk 'NR==1 || $13 < 5e-8' results.PHENO1.glm.logistic > significant.txt

# Extract top hits
sort -k13 -g results.PHENO1.glm.logistic | head -100 > top_hits.txt
```

### Python Analysis

```python
import pandas as pd

results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')

significant = results[results['P'] < 5e-8]
print(f'Genome-wide significant hits: {len(significant)}')

suggestive = results[results['P'] < 1e-5]
print(f'Suggestive hits: {len(suggestive)}')
```

## Visualization

### Manhattan Plot (Python)

```python
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
results = results[results['TEST'] == 'ADD']
results['-log10P'] = -np.log10(results['P'])

chrom_colors = ['#1f77b4', '#ff7f0e']
results['color'] = results['#CHROM'].apply(lambda x: chrom_colors[x % 2])

cumulative_pos = []
offset = 0
for chrom in sorted(results['#CHROM'].unique()):
    chrom_data = results[results['#CHROM'] == chrom]
    cumulative_pos.extend(chrom_data['POS'] + offset)
    offset += chrom_data['POS'].max()

results['cumulative_pos'] = cumulative_pos

plt.figure(figsize=(14, 6))
plt.scatter(results['cumulative_pos'], results['-log10P'], c=results['color'], s=1)
plt.axhline(y=-np.log10(5e-8), color='red', linestyle='--', label='Genome-wide (5e-8)')
plt.axhline(y=-np.log10(1e-5), color='blue', linestyle='--', label='Suggestive (1e-5)')
plt.xlabel('Chromosome')
plt.ylabel('-log10(P)')
plt.legend()
plt.savefig('manhattan.png', dpi=150)
```

### QQ Plot (Python)

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
observed_p = results[results['TEST'] == 'ADD']['P'].dropna().sort_values()

n = len(observed_p)
expected_p = np.arange(1, n + 1) / (n + 1)

plt.figure(figsize=(6, 6))
plt.scatter(-np.log10(expected_p), -np.log10(observed_p), s=1)
plt.plot([0, 8], [0, 8], 'r--')
plt.xlabel('Expected -log10(P)')
plt.ylabel('Observed -log10(P)')

lambda_gc = np.median(stats.chi2.ppf(1 - observed_p, 1)) / stats.chi2.ppf(0.5, 1)
plt.title(f'QQ Plot (λ = {lambda_gc:.3f})')
plt.savefig('qqplot.png', dpi=150)
```

## Genomic Inflation

```python
from scipy import stats
import numpy as np

results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
pvalues = results[results['TEST'] == 'ADD']['P'].dropna()

chisq = stats.chi2.ppf(1 - pvalues, 1)
lambda_gc = np.median(chisq) / stats.chi2.ppf(0.5, 1)
print(f'Genomic inflation factor: {lambda_gc:.3f}')
# Good: 1.0-1.05, Acceptable: 1.05-1.1, Concerning: >1.1
```

## Related Skills

- plink-basics - Data preparation and QC
- population-structure - PCA for stratification control
- linkage-disequilibrium - LD pruning before analysis