bio-workflows-gwas-pipeline
$
npx mdskill add GPTomics/bioSkills/bio-workflows-gwas-pipelineExecute end-to-end GWAS pipelines from VCF to association results.
- Handles PLINK QC, population structure correction, and association testing.
- Depends on population genetics tools for linkage disequilibrium and structure.
- Validates QC checkpoints like call rates, HWE p-values, and QQ plots.
- Delivers Manhattan plots and association results for case-control or quantitative traits.
SKILL.md
.github/skills/bio-workflows-gwas-pipelineView on GitHub ↗
---
name: bio-workflows-gwas-pipeline
description: End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
tool_type: mixed
primary_tool: PLINK2
workflow: true
depends_on:
- population-genetics/plink-basics
- population-genetics/population-structure
- population-genetics/association-testing
- population-genetics/linkage-disequilibrium
qc_checkpoints:
- after_qc: "Sample/variant call rates >95%, HWE p>1e-6"
- after_structure: "No population stratification bias"
- after_association: "Lambda ~1.0, expected QQ plot"
---
## Version Compatibility
Reference examples tested with: ggplot2 3.5+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# GWAS Pipeline
**"Run a GWAS from my genotype data"** → Orchestrate sample/variant QC (PLINK2), population stratification (PCA), association testing (linear/logistic regression), multiple testing correction, and Manhattan/QQ plot visualization.
Complete workflow for genome-wide association studies from genotype data to significant associations.
## Workflow Overview
```
VCF/PLINK files
|
v
[1. QC Filtering] ------> Sample and variant QC
|
v
[2. LD Pruning] --------> Independent variants for PCA
|
v
[3. Population Structure] --> PCA for covariates
|
v
[4. Association Testing] --> Logistic/linear regression
|
v
[5. Results] -----------> Manhattan plot, QQ plot
|
v
Significant associations
```
## Step 1: Data Import and QC
### Convert VCF to PLINK
```bash
# VCF to PLINK binary format
plink2 --vcf input.vcf.gz \
--make-bed \
--out study
# Or with phenotype/covariate files
plink2 --vcf input.vcf.gz \
--pheno phenotypes.txt \
--make-bed \
--out study
```
### Sample QC
```bash
# Calculate sample statistics
plink2 --bfile study \
--missing \
--out study_stats
# Remove samples with high missing rate (>5%)
plink2 --bfile study \
--mind 0.05 \
--make-bed \
--out study_sample_qc
# Check for sex discrepancies (if sex chromosome data available)
plink2 --bfile study_sample_qc \
--check-sex \
--out study_sex_check
# Remove related individuals (optional, requires IBD)
plink2 --bfile study_sample_qc \
--king-cutoff 0.0884 \
--make-bed \
--out study_unrelated
```
### Variant QC
```bash
# Apply standard variant filters
plink2 --bfile study_sample_qc \
--geno 0.05 \
--maf 0.01 \
--hwe 1e-6 \
--make-bed \
--out study_qc
# Summary
plink2 --bfile study_qc --freq --out study_qc
```
**QC Checkpoint:**
- Sample call rate >95%
- Variant call rate >95%
- MAF >1%
- HWE p-value >1e-6 (controls only for case-control)
## Step 2: LD Pruning for PCA
```bash
# Identify independent variants
plink2 --bfile study_qc \
--indep-pairwise 50 5 0.2 \
--out pruned
# Extract pruned variants
plink2 --bfile study_qc \
--extract pruned.prune.in \
--make-bed \
--out study_pruned
```
## Step 3: Population Structure (PCA)
```bash
# Calculate principal components
plink2 --bfile study_pruned \
--pca 10 \
--out study_pca
# The eigenvec file contains PCs for use as covariates
```
### Visualize PCA
```r
library(ggplot2)
# Load PCA results
pca <- read.table('study_pca.eigenvec', header = FALSE)
colnames(pca) <- c('FID', 'IID', paste0('PC', 1:10))
# Load phenotype for coloring
pheno <- read.table('phenotypes.txt', header = TRUE)
pca <- merge(pca, pheno, by = c('FID', 'IID'))
# Plot
ggplot(pca, aes(x = PC1, y = PC2, color = as.factor(PHENO))) +
geom_point(alpha = 0.5) +
labs(title = 'PCA of Study Samples', color = 'Phenotype') +
theme_minimal()
ggsave('pca_plot.pdf', width = 8, height = 6)
```
## Step 4: Association Testing
### Case-Control (Binary Trait)
```bash
# Logistic regression with PCA covariates
plink2 --bfile study_qc \
--pheno phenotypes.txt \
--covar study_pca.eigenvec \
--covar-col-nums 3-12 \
--glm hide-covar \
--out gwas_results
# Results in gwas_results.PHENO.glm.logistic
```
### Quantitative Trait
```bash
# Linear regression
plink2 --bfile study_qc \
--pheno phenotypes.txt \
--pheno-name BMI \
--covar study_pca.eigenvec \
--covar-col-nums 3-12 \
--glm hide-covar \
--out gwas_bmi
# Results in gwas_bmi.BMI.glm.linear
```
### With Additional Covariates
```bash
# Include age, sex, and PCs
plink2 --bfile study_qc \
--pheno phenotypes.txt \
--covar covariates.txt \
--covar-name AGE,SEX,PC1-PC10 \
--glm hide-covar \
--out gwas_adjusted
```
## Step 5: Results Visualization
### Manhattan Plot
```r
library(qqman)
# Load results
results <- read.table('gwas_results.PHENO.glm.logistic', header = TRUE)
results <- results[!is.na(results$P),]
# Manhattan plot
png('manhattan.png', width = 1200, height = 600)
manhattan(results, chr = 'X.CHROM', bp = 'POS', snp = 'ID', p = 'P',
suggestiveline = -log10(1e-5), genomewideline = -log10(5e-8))
dev.off()
# QQ plot
png('qq_plot.png', width = 600, height = 600)
qq(results$P)
dev.off()
```
### Calculate Genomic Inflation
```r
# Lambda (genomic inflation factor)
chisq <- qchisq(1 - results$P, 1)
lambda <- median(chisq) / qchisq(0.5, 1)
cat('Lambda:', round(lambda, 3), '\n')
# Lambda should be close to 1.0 (1.0-1.1 acceptable)
```
### Extract Significant Hits
```bash
# Genome-wide significant (p < 5e-8)
awk '$12 < 5e-8' gwas_results.PHENO.glm.logistic > significant_hits.txt
# Suggestive (p < 1e-5)
awk '$12 < 1e-5' gwas_results.PHENO.glm.logistic > suggestive_hits.txt
```
## Parameter Recommendations
| Step | Parameter | Value |
|------|-----------|-------|
| Sample QC | --mind | 0.05 |
| Variant QC | --geno | 0.05 |
| Variant QC | --maf | 0.01 |
| Variant QC | --hwe | 1e-6 |
| LD pruning | --indep-pairwise | 50 5 0.2 |
| PCA | --pca | 10 |
| Significance | p-value | 5e-8 |
## Troubleshooting
| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| High lambda (>1.1) | Population stratification | Add more PCs, check ancestry |
| No significant hits | Low power | Increase sample size, meta-analysis |
| Deflated lambda (<1) | Over-correction | Reduce PC covariates |
| QQ deviation at low end | Batch effects | Check for technical artifacts |
## Complete Pipeline Script
```bash
#!/bin/bash
set -e
INPUT_VCF="genotypes.vcf.gz"
PHENO="phenotypes.txt"
OUTDIR="gwas_results"
mkdir -p ${OUTDIR}
# Step 1: Convert and QC
plink2 --vcf ${INPUT_VCF} --make-bed --out ${OUTDIR}/raw
plink2 --bfile ${OUTDIR}/raw --mind 0.05 --geno 0.05 --maf 0.01 --hwe 1e-6 \
--make-bed --out ${OUTDIR}/qc
# Step 2: LD pruning
plink2 --bfile ${OUTDIR}/qc --indep-pairwise 50 5 0.2 --out ${OUTDIR}/pruned
plink2 --bfile ${OUTDIR}/qc --extract ${OUTDIR}/pruned.prune.in \
--make-bed --out ${OUTDIR}/pruned
# Step 3: PCA
plink2 --bfile ${OUTDIR}/pruned --pca 10 --out ${OUTDIR}/pca
# Step 4: Association
plink2 --bfile ${OUTDIR}/qc --pheno ${PHENO} \
--covar ${OUTDIR}/pca.eigenvec --covar-col-nums 3-12 \
--glm hide-covar --out ${OUTDIR}/gwas
echo "=== GWAS Complete ==="
echo "Results: ${OUTDIR}/gwas.*.glm.*"
```
## Related Skills
- population-genetics/plink-basics - PLINK file formats and commands
- population-genetics/population-structure - PCA and admixture
- population-genetics/association-testing - Statistical models
- population-genetics/linkage-disequilibrium - LD concepts
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.