bio-experimental-design-multiple-testing
$
npx mdskill add GPTomics/bioSkills/bio-experimental-design-multiple-testingApplies multiple testing corrections for genomics data using FDR, Bonferroni, and q-value methods
- Reduces false positives in differential expression analysis with thousands of tests
- Uses R (p.adjust, qvalue) and Python (statsmodels) for correction calculations
- Chooses correction method based on study design and false discovery control needs
- Returns adjusted p-values or q-values for downstream filtering and reporting
SKILL.md
.github/skills/bio-experimental-design-multiple-testingView on GitHub ↗
---
name: bio-experimental-design-multiple-testing
description: Applies multiple testing correction methods including FDR, Bonferroni, and q-value for genomics data. Use when filtering differential expression results, setting significance thresholds, or choosing between correction methods for different study designs.
tool_type: r
primary_tool: qvalue
---
## Version Compatibility
Reference examples tested with: R stats (base), statsmodels 0.14+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Multiple Testing Correction
**"Correct p-values for multiple testing"** → Adjust raw p-values from thousands of simultaneous tests to control false discovery rate or family-wise error rate.
- R: `p.adjust(pvalues, method = 'BH')`, `qvalue::qvalue()`
- Python: `statsmodels.stats.multitest.multipletests()`
## The Problem
Testing 20,000 genes at p < 0.05 yields ~1,000 false positives by chance. Correction is essential.
## Common Methods
### Bonferroni (Most Conservative)
```r
# Strict family-wise error rate control
p_adj <- p.adjust(pvalues, method = 'bonferroni')
# Threshold: alpha / n_tests
# Use for: small gene sets, confirmatory studies
```
### Benjamini-Hochberg FDR (Standard)
```r
# Controls false discovery rate
p_adj <- p.adjust(pvalues, method = 'BH')
# Most common for genomics
# FDR 0.05 = expect 5% of significant results to be false
```
### q-value (Recommended for Large-Scale)
**Goal:** Estimate the false discovery rate for each gene in a genome-wide test while maximizing detection power by estimating the proportion of true nulls.
**Approach:** Fit the q-value model to the p-value distribution, which estimates pi0 (fraction of true null hypotheses) and converts each p-value to a q-value representing the minimum FDR at which that gene would be called significant.
```r
library(qvalue)
qobj <- qvalue(pvalues)
qvalues <- qobj$qvalues
pi0 <- qobj$pi0 # Estimated proportion of true nulls
# q-value directly estimates FDR for each gene
# More powerful than BH when many true positives exist
```
## Method Selection Guide
| Scenario | Recommended Method | Threshold |
|----------|-------------------|-----------|
| Genome-wide DE | BH or q-value | FDR < 0.05 |
| Candidate genes | Bonferroni | p < 0.05/n |
| Exploratory | BH | FDR < 0.10 |
| Validation study | Bonferroni | p < 0.05/n |
| GWAS | Bonferroni | p < 5e-8 |
## Python Equivalent
```python
from statsmodels.stats.multitest import multipletests
# Benjamini-Hochberg
rejected, pvals_corrected, _, _ = multipletests(pvalues, method='fdr_bh')
# Bonferroni
rejected, pvals_corrected, _, _ = multipletests(pvalues, method='bonferroni')
```
## Interpreting Results
- **FDR 0.05**: Among genes called significant, ~5% are false positives
- **FDR 0.01**: More stringent, fewer false positives but more false negatives
- **padj vs qvalue**: Both estimate FDR; q-value is slightly more powerful
## Related Skills
- differential-expression/de-results - Applying corrections to DE output
- population-genetics/association-testing - GWAS significance thresholds
- pathway-analysis/go-enrichment - Correcting enrichment p-values
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.