bio-workflows-expression-to-pathways
$
npx mdskill add GPTomics/bioSkills/bio-workflows-expression-to-pathwaysTransform differential expression data into pathway enrichment insights using R
- Solve the task of identifying enriched biological pathways from DE gene lists
- Relies on clusterProfiler, ReactomePA, DESeq2, and visualization tools for analysis
- Validates input gene IDs and checks enrichment results for statistical reasonableness
- Delivers GO, KEGG, Reactome, and GSEA results with visual summaries for downstream interpretation
SKILL.md
.github/skills/bio-workflows-expression-to-pathwaysView on GitHub ↗
---
name: bio-workflows-expression-to-pathways
description: Workflow from differential expression results to functional enrichment analysis. Covers GO, KEGG, Reactome enrichment with clusterProfiler and visualization. Use when taking DE results to pathway enrichment.
tool_type: r
primary_tool: clusterProfiler
workflow: true
depends_on:
- pathway-analysis/go-enrichment
- pathway-analysis/kegg-pathways
- pathway-analysis/reactome-pathways
- pathway-analysis/gsea
- pathway-analysis/enrichment-visualization
qc_checkpoints:
- input_validation: "Valid gene IDs, sufficient DE genes"
- enrichment_qc: "Reasonable number of terms, p-values not all significant"
---
## Version Compatibility
Reference examples tested with: DESeq2 1.42+, R stats (base), ReactomePA 1.46+, clusterProfiler 4.10+, ggplot2 3.5+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Expression to Pathways Workflow
**"Find enriched pathways from my differential expression results"** → Orchestrate GO enrichment (clusterProfiler), GSEA, KEGG/Reactome pathway mapping, and enrichment visualization from DE gene lists or ranked gene lists.
Convert differential expression results into biological insights through functional enrichment analysis.
## Method Selection
| Scenario | Method | Why |
|----------|--------|-----|
| Have DE results with Wald stat / t-stat for all genes | GSEA (Step 4) | Uses full ranking; no arbitrary cutoff; ~35% higher F1 than ORA |
| Clear gene list from non-DE source (co-expression, GWAS) | ORA (Steps 1-3) | No ranking available |
| RNA-seq with known gene length bias | GOseq (goseq package) | Standard ORA ignores length bias |
| Bacterial / prokaryotic data | KEGG with locus tags | No org.*.eg.db; use keyType='kegg' |
| Multiple conditions to compare | compareCluster or mitch | Never compare p-values across separate enrichments |
When in doubt, run both ORA and GSEA and compare. Concordant results are more trustworthy.
## Workflow Overview
```
DE Results (gene list or ranked list)
|
v
[1. Gene ID Conversion] --> Convert to Entrez/Ensembl
|
v
[2. Over-representation Analysis]
|
+---> GO Enrichment (BP, MF, CC)
|
+---> KEGG Pathways
|
+---> Reactome Pathways
|
v
[3. GSEA (ranked genes)]
|
v
[4. Visualization] -----> Dot plots, networks, bar plots
|
v
Functional annotations and pathway insights
```
## Input Preparation
### From DESeq2 Results
```r
library(DESeq2)
library(clusterProfiler)
library(org.Hs.eg.db)
# Load DE results
res <- read.csv('deseq2_results.csv', row.names = 1)
# Significant genes for ORA
sig_genes <- rownames(subset(res, padj < 0.05 & abs(log2FoldChange) > 1))
# Background = all tested genes (NOT the full genome)
# Pre-filtering and independent filtering reduce the tested set; use only genes that were tested
background_genes <- rownames(res[!is.na(res$pvalue), ])
# Ranked list for GSEA — prefer Wald statistic (combines magnitude + precision)
# Alternatives: shrunken LFC, or sign(logFC) * -log10(PValue) for edgeR
ranked_genes <- res$stat
names(ranked_genes) <- rownames(res)
ranked_genes <- sort(ranked_genes[!is.na(ranked_genes)], decreasing = TRUE)
```
### Gene ID Conversion
```r
# Convert gene symbols to Entrez IDs
sig_entrez <- bitr(sig_genes, fromType = 'SYMBOL', toType = 'ENTREZID',
OrgDb = org.Hs.eg.db)
# For ranked list
ranked_entrez <- bitr(names(ranked_genes), fromType = 'SYMBOL', toType = 'ENTREZID',
OrgDb = org.Hs.eg.db)
ranked_list <- ranked_genes[ranked_entrez$SYMBOL]
names(ranked_list) <- ranked_entrez$ENTREZID
```
## Step 1: GO Over-representation Analysis
```r
# Convert background genes too
bg_entrez <- bitr(background_genes, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
# Biological Process — always specify universe (background)
go_bp <- enrichGO(gene = sig_entrez$ENTREZID,
universe = bg_entrez$ENTREZID,
OrgDb = org.Hs.eg.db,
ont = 'BP',
pAdjustMethod = 'BH',
pvalueCutoff = 0.05,
qvalueCutoff = 0.1,
readable = TRUE)
# Molecular Function
go_mf <- enrichGO(gene = sig_entrez$ENTREZID,
universe = bg_entrez$ENTREZID,
OrgDb = org.Hs.eg.db,
ont = 'MF',
pAdjustMethod = 'BH',
pvalueCutoff = 0.05,
readable = TRUE)
# Cellular Component
go_cc <- enrichGO(gene = sig_entrez$ENTREZID,
universe = bg_entrez$ENTREZID,
OrgDb = org.Hs.eg.db,
ont = 'CC',
pAdjustMethod = 'BH',
pvalueCutoff = 0.05,
readable = TRUE)
# Simplify redundant terms
go_bp_simple <- simplify(go_bp, cutoff = 0.7, by = 'p.adjust')
```
## Step 2: KEGG Pathway Enrichment
```r
kegg <- enrichKEGG(gene = sig_entrez$ENTREZID,
organism = 'hsa',
pvalueCutoff = 0.05,
qvalueCutoff = 0.1)
# Convert KEGG IDs to readable names
kegg <- setReadable(kegg, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')
```
## Step 3: Reactome Pathway Enrichment
```r
library(ReactomePA)
reactome <- enrichPathway(gene = sig_entrez$ENTREZID,
organism = 'human',
pvalueCutoff = 0.05,
readable = TRUE)
```
## Step 4: Gene Set Enrichment Analysis (GSEA)
```r
# GO GSEA
gsea_go <- gseGO(geneList = ranked_list,
OrgDb = org.Hs.eg.db,
ont = 'BP',
minGSSize = 10,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE)
# KEGG GSEA
gsea_kegg <- gseKEGG(geneList = ranked_list,
organism = 'hsa',
minGSSize = 10,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE)
```
## Step 5: Visualization
```r
library(enrichplot)
library(ggplot2)
# Dot plot
dotplot(go_bp_simple, showCategory = 20) +
ggtitle('GO Biological Process Enrichment')
ggsave('go_bp_dotplot.pdf', width = 10, height = 8)
# Bar plot
barplot(kegg, showCategory = 15) +
ggtitle('KEGG Pathway Enrichment')
ggsave('kegg_barplot.pdf', width = 9, height = 6)
# Enrichment map (network of related terms)
go_bp_simple <- pairwise_termsim(go_bp_simple)
emapplot(go_bp_simple, showCategory = 30) +
ggtitle('GO Term Similarity Network')
ggsave('go_network.pdf', width = 10, height = 10)
# Concept network (gene-term connections)
cnetplot(go_bp, showCategory = 5, categorySize = 'pvalue') +
ggtitle('Gene-Concept Network')
ggsave('cnet_plot.pdf', width = 12, height = 10)
# GSEA plot for specific pathway
gseaplot2(gsea_kegg, geneSetID = 1:3, pvalue_table = TRUE)
ggsave('gsea_plot.pdf', width = 10, height = 8)
# Ridge plot for GSEA
ridgeplot(gsea_go, showCategory = 15)
ggsave('gsea_ridge.pdf', width = 8, height = 10)
```
## Step 6: Export Results
```r
# Export enrichment results
write.csv(as.data.frame(go_bp), 'go_bp_enrichment.csv', row.names = FALSE)
write.csv(as.data.frame(kegg), 'kegg_enrichment.csv', row.names = FALSE)
write.csv(as.data.frame(reactome), 'reactome_enrichment.csv', row.names = FALSE)
write.csv(as.data.frame(gsea_go), 'gsea_go_results.csv', row.names = FALSE)
# Combine key results
combined <- rbind(
data.frame(Database = 'GO_BP', as.data.frame(go_bp_simple)[1:10,]),
data.frame(Database = 'KEGG', as.data.frame(kegg)[1:10,]),
data.frame(Database = 'Reactome', as.data.frame(reactome)[1:10,])
)
write.csv(combined, 'top_enriched_pathways.csv', row.names = FALSE)
```
## Parameter Recommendations
| Analysis | Parameter | Value |
|----------|-----------|-------|
| enrichGO | pvalueCutoff | 0.05 |
| enrichGO | qvalueCutoff | 0.1 |
| simplify | cutoff | 0.7 |
| gseGO | minGSSize | 10 |
| gseGO | maxGSSize | 500 |
| GSEA | perm | 1000 (default) |
## Troubleshooting
| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| No enriched terms | Too few genes, wrong IDs | Check gene IDs, relax thresholds |
| All terms significant | Too many genes | Be more stringent with DE cutoffs |
| Gene ID conversion fails | Wrong organism, format | Check OrgDb package, gene format |
| GSEA no results | Poor ranking, small gene sets | Check ranked list, adjust minGSSize |
## Complete Workflow Script
```r
library(clusterProfiler)
library(org.Hs.eg.db)
library(ReactomePA)
library(enrichplot)
library(ggplot2)
# Configuration
de_file <- 'deseq2_results.csv'
output_dir <- 'pathway_analysis'
dir.create(output_dir, showWarnings = FALSE)
# Load and prepare data
res <- read.csv(de_file, row.names = 1)
sig_genes <- rownames(subset(res, padj < 0.05 & abs(log2FoldChange) > 1))
cat('Significant genes:', length(sig_genes), '\n')
# Convert IDs
sig_entrez <- bitr(sig_genes, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
cat('Converted to Entrez:', nrow(sig_entrez), '\n')
# Ranked list for GSEA (Wald statistic preferred over LFC)
ranked <- res$stat
names(ranked) <- rownames(res)
ranked <- sort(ranked[!is.na(ranked)], decreasing = TRUE)
ranked_entrez <- bitr(names(ranked), fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
ranked_list <- ranked[ranked_entrez$SYMBOL]
names(ranked_list) <- ranked_entrez$ENTREZID
# Background genes (all tested, not full genome)
bg_entrez <- bitr(rownames(res[!is.na(res$pvalue), ]), fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
# GO enrichment with background
go_bp <- enrichGO(sig_entrez$ENTREZID, universe = bg_entrez$ENTREZID, OrgDb = org.Hs.eg.db, ont = 'BP', readable = TRUE)
go_bp_simple <- simplify(go_bp, cutoff = 0.7)
# KEGG
kegg <- enrichKEGG(sig_entrez$ENTREZID, organism = 'hsa')
kegg <- setReadable(kegg, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')
# Reactome
reactome <- enrichPathway(sig_entrez$ENTREZID, organism = 'human', readable = TRUE)
# GSEA
gsea_go <- gseGO(ranked_list, OrgDb = org.Hs.eg.db, ont = 'BP', verbose = FALSE)
# Plots
pdf(file.path(output_dir, 'enrichment_plots.pdf'), width = 10, height = 8)
print(dotplot(go_bp_simple, showCategory = 20) + ggtitle('GO Biological Process'))
print(barplot(kegg, showCategory = 15) + ggtitle('KEGG Pathways'))
if (nrow(as.data.frame(reactome)) > 0) {
print(dotplot(reactome, showCategory = 15) + ggtitle('Reactome Pathways'))
}
dev.off()
# Export
write.csv(as.data.frame(go_bp_simple), file.path(output_dir, 'go_bp.csv'), row.names = FALSE)
write.csv(as.data.frame(kegg), file.path(output_dir, 'kegg.csv'), row.names = FALSE)
write.csv(as.data.frame(reactome), file.path(output_dir, 'reactome.csv'), row.names = FALSE)
cat('\nResults saved to:', output_dir, '\n')
cat('GO BP terms:', nrow(as.data.frame(go_bp_simple)), '\n')
cat('KEGG pathways:', nrow(as.data.frame(kegg)), '\n')
cat('Reactome pathways:', nrow(as.data.frame(reactome)), '\n')
```
## Prokaryotic Organisms
For bacteria/archaea, standard org.db annotation packages are unavailable. Use KEGG directly with strain-specific organism codes:
```r
# Find organism code
search_kegg_organism('Pseudomonas aeruginosa', by = 'scientific_name')
# KEGG ORA with bacterial organism code (e.g., 'pae' for P. aeruginosa PAO1)
kegg_bac <- enrichKEGG(gene = sig_gene_ids, organism = 'pae', keyType = 'kegg',
pvalueCutoff = 0.05)
# For organisms without KEGG annotation, use KEGG Orthology
# Map genes to KO IDs via eggNOG-mapper or KOALA, then:
kegg_ko <- enrichKEGG(gene = ko_ids, organism = 'ko', keyType = 'kegg')
```
GO enrichment for prokaryotes: use `enricher()` with custom GO-to-gene mapping from eggNOG-mapper or InterProScan output, rather than org.db packages.
## Multi-Condition Enrichment Comparison
When comparing enrichment across conditions (e.g., treatment A vs B vs C):
```r
# compareCluster: run ORA across multiple gene lists
gene_clusters <- list(
ConditionA = sig_genes_A,
ConditionB = sig_genes_B,
ConditionC = sig_genes_C
)
cc <- compareCluster(gene_clusters, fun = 'enrichKEGG', organism = 'hsa')
dotplot(cc, showCategory = 10) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
Do not compare raw -log10(p-values) across conditions — they scale with sample size. Compare NES (normalized enrichment scores) for GSEA, or use compareCluster for ORA.
## Related Skills
- pathway-analysis/go-enrichment - GO enrichment details
- pathway-analysis/kegg-pathways - KEGG analysis
- pathway-analysis/reactome-pathways - Reactome analysis
- pathway-analysis/gsea - GSEA methods
- pathway-analysis/enrichment-visualization - Visualization options
- differential-expression/de-results - Preparing gene lists for enrichment
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.