bio-population-genetics-association-testing
$
npx mdskill add GPTomics/bioSkills/bio-population-genetics-association-testingRun GWAS association tests with PLINK 2.0 and visualize results.
- Executes logistic and linear regression for case-control and quantitative traits.
- Depends on PLINK 2.0 CLI and requires matplotlib, numpy, pandas, and scipy.
- Selects analysis type based on phenotype file structure and user intent.
- Outputs Manhattan and QQ plots alongside statistical association results.
SKILL.md
.github/skills/bio-population-genetics-association-testingView on GitHub ↗
---
name: bio-population-genetics-association-testing
description: Genome-wide association studies (GWAS) with PLINK. Perform case-control and quantitative trait association testing using logistic/linear regression with covariates, generate Manhattan and QQ plots for result visualization. Use when running GWAS or association tests.
tool_type: cli
primary_tool: plink2
---
## Version Compatibility
Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+, scipy 1.12+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Association Testing
**"Run a GWAS on my genotyping data"** → Perform genome-wide association testing using logistic (case-control) or linear (quantitative) regression with covariates, then visualize results with Manhattan and QQ plots.
- CLI: `plink2 --glm` for association testing with covariates
GWAS analysis using PLINK 2.0's unified `--glm` command for case-control and quantitative traits.
## PLINK 2.0 Association Testing
### Basic Case-Control (Binary Phenotype)
```bash
# Basic logistic regression
plink2 --bfile data --glm --out results
# With phenotype file
plink2 --bfile data --pheno pheno.txt --glm --out results
```
### Quantitative Trait (Continuous Phenotype)
```bash
# Linear regression for quantitative traits
plink2 --bfile data --pheno pheno.txt --glm --out results
```
### With Covariates
```bash
# Include covariates (sex, age, PCs)
plink2 --bfile data \
--pheno pheno.txt \
--covar covariates.txt \
--glm --out results
# Specify which covariates to use
plink2 --bfile data \
--pheno pheno.txt \
--covar covariates.txt \
--covar-name PC1,PC2,PC3,age,sex \
--glm --out results
```
## Covariate Files
### Phenotype File Format
```
# pheno.txt: FID IID pheno
# For binary: 1=control, 2=case, -9=missing
# For quantitative: continuous values
FAM001 IND001 2
FAM002 IND002 1
FAM003 IND003 1.5
```
### Covariate File Format
```
# covariates.txt: FID IID cov1 cov2 ...
FAM001 IND001 0.15 35 1
FAM002 IND002 -0.22 42 2
FAM003 IND003 0.08 28 1
```
## GLM Options
### Phenotype Handling
```bash
# Multiple phenotypes (test all)
plink2 --bfile data --pheno pheno_multi.txt --glm --out results
# Specific phenotype column
plink2 --bfile data --pheno pheno_multi.txt --pheno-name trait1 --glm --out results
# Missing phenotype handling
plink2 --bfile data --glm allow-no-covars --out results
```
### Model Options
```bash
# Additive model (default)
plink2 --bfile data --glm --out results
# Dominant model
plink2 --bfile data --glm dominant --out results
# Recessive model
plink2 --bfile data --glm recessive --out results
# Genotypic (2df test)
plink2 --bfile data --glm genotypic --out results
# Hide covariates from output (cleaner output)
plink2 --bfile data --covar cov.txt --glm hide-covar --out results
```
### Firth Regression (Rare Variants)
```bash
# Enable Firth fallback for case-control (default in PLINK 2.0)
plink2 --bfile data --glm firth-fallback --out results
# Force Firth regression
plink2 --bfile data --glm firth --out results
# Disable Firth
plink2 --bfile data --glm no-firth --out results
```
## Output Format
### Output Columns
```bash
# Default output: results.PHENO1.glm.logistic or results.PHENO1.glm.linear
# Columns: CHROM, POS, ID, REF, ALT, A1, FIRTH?, TEST, OBS_CT, OR/BETA, SE, Tstat, P
```
### Custom Output Columns
```bash
# Add specific columns
plink2 --bfile data --glm cols=+a1freq,+machr2 --out results
# Available columns:
# +a1freq: A1 allele frequency
# +machr2: MaCH R-squared
# +ax: Reference allele dosage
# +err: Standard errors
```
## Population Stratification Control
### Include Principal Components
```bash
# 1. Run PCA
plink2 --bfile data --pca 10 --out pca_results
# 2. Use PCs as covariates
plink2 --bfile data \
--pheno pheno.txt \
--covar pca_results.eigenvec \
--covar-name PC1,PC2,PC3,PC4,PC5 \
--glm --out results
```
### Combined Workflow
**Goal:** Run a complete GWAS pipeline from raw genotypes through population stratification correction to association testing.
**Approach:** Apply MAF, genotyping rate, and HWE quality filters, compute principal components for population structure correction, then run logistic/linear regression with PCs as covariates.
```bash
# QC, PCA, and GWAS in sequence
plink2 --bfile raw --maf 0.01 --geno 0.05 --hwe 1e-6 --make-bed --out qc
plink2 --bfile qc --pca 10 --out pca
plink2 --bfile qc \
--pheno pheno.txt \
--covar pca.eigenvec \
--covar-name PC1-PC5 \
--glm hide-covar --out gwas
```
## Result Filtering
### Command Line Filtering
```bash
# Filter significant results
awk 'NR==1 || $13 < 5e-8' results.PHENO1.glm.logistic > significant.txt
# Extract top hits
sort -k13 -g results.PHENO1.glm.logistic | head -100 > top_hits.txt
```
### Python Analysis
```python
import pandas as pd
results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
significant = results[results['P'] < 5e-8]
print(f'Genome-wide significant hits: {len(significant)}')
suggestive = results[results['P'] < 1e-5]
print(f'Suggestive hits: {len(suggestive)}')
```
## Visualization
### Manhattan Plot (Python)
```python
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
results = results[results['TEST'] == 'ADD']
results['-log10P'] = -np.log10(results['P'])
chrom_colors = ['#1f77b4', '#ff7f0e']
results['color'] = results['#CHROM'].apply(lambda x: chrom_colors[x % 2])
cumulative_pos = []
offset = 0
for chrom in sorted(results['#CHROM'].unique()):
chrom_data = results[results['#CHROM'] == chrom]
cumulative_pos.extend(chrom_data['POS'] + offset)
offset += chrom_data['POS'].max()
results['cumulative_pos'] = cumulative_pos
plt.figure(figsize=(14, 6))
plt.scatter(results['cumulative_pos'], results['-log10P'], c=results['color'], s=1)
plt.axhline(y=-np.log10(5e-8), color='red', linestyle='--', label='Genome-wide (5e-8)')
plt.axhline(y=-np.log10(1e-5), color='blue', linestyle='--', label='Suggestive (1e-5)')
plt.xlabel('Chromosome')
plt.ylabel('-log10(P)')
plt.legend()
plt.savefig('manhattan.png', dpi=150)
```
### QQ Plot (Python)
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
observed_p = results[results['TEST'] == 'ADD']['P'].dropna().sort_values()
n = len(observed_p)
expected_p = np.arange(1, n + 1) / (n + 1)
plt.figure(figsize=(6, 6))
plt.scatter(-np.log10(expected_p), -np.log10(observed_p), s=1)
plt.plot([0, 8], [0, 8], 'r--')
plt.xlabel('Expected -log10(P)')
plt.ylabel('Observed -log10(P)')
lambda_gc = np.median(stats.chi2.ppf(1 - observed_p, 1)) / stats.chi2.ppf(0.5, 1)
plt.title(f'QQ Plot (λ = {lambda_gc:.3f})')
plt.savefig('qqplot.png', dpi=150)
```
## Genomic Inflation
```python
from scipy import stats
import numpy as np
results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
pvalues = results[results['TEST'] == 'ADD']['P'].dropna()
chisq = stats.chi2.ppf(1 - pvalues, 1)
lambda_gc = np.median(chisq) / stats.chi2.ppf(0.5, 1)
print(f'Genomic inflation factor: {lambda_gc:.3f}')
# Good: 1.0-1.05, Acceptable: 1.05-1.1, Concerning: >1.1
```
## Related Skills
- plink-basics - Data preparation and QC
- population-structure - PCA for stratification control
- linkage-disequilibrium - LD pruning before analysis
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.