bio-workflows-timecourse-pipeline
$
npx mdskill add GPTomics/bioSkills/bio-workflows-timecourse-pipelineExecutes end-to-end time-course analysis from expression matrices to enriched pathways.
- Transforms bulk time-series omics data into temporal patterns and biological insights.
- Depends on differential expression, clustering, rhythm detection, and enrichment tools.
- Validates outputs against statistical thresholds and model fit quality metrics.
- Delivers cluster-specific pathway enrichments with confidence scores and residuals.
SKILL.md
.github/skills/bio-workflows-timecourse-pipelineView on GitHub ↗
---
name: bio-workflows-timecourse-pipeline
description: End-to-end time-course analysis from expression matrix to temporal patterns and enrichment. Covers temporal DE, Mfuzz soft clustering, optional rhythm detection, GAM trajectory fitting, and per-cluster pathway enrichment. Use when analyzing bulk time-series expression experiments from any omics platform.
tool_type: mixed
primary_tool: Mfuzz
workflow: true
depends_on:
- differential-expression/timeseries-de
- temporal-genomics/temporal-clustering
- temporal-genomics/circadian-rhythms
- temporal-genomics/trajectory-modeling
- pathway-analysis/go-enrichment
qc_checkpoints:
- after_de: "Significant temporal genes >100 at FDR <0.05; model fit residuals reasonable"
- after_clustering: "Membership >0.5 for soft clustering; no empty clusters; validated by gap statistic or silhouette (typical range 4-20 clusters depending on complexity)"
- after_enrichment: "At least 3 clusters with significant GO terms at FDR <0.05"
---
## Version Compatibility
Reference examples tested with: DESeq2 1.42+, clusterProfiler 4.10+, limma 3.58+, numpy 1.26+, pandas 2.2+, scanpy 1.10+, scipy 1.12+, statsmodels 0.14+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Time-Course Analysis Pipeline
**"Analyze my time-course expression data end-to-end"** → Orchestrate temporal differential expression, Mfuzz soft clustering, optional circadian rhythm detection, GAM trajectory fitting, changepoint detection, and per-cluster pathway enrichment.
Complete workflow from expression matrix through temporal differential expression, soft clustering,
optional rhythm detection, trajectory fitting, and per-cluster pathway enrichment.
## Pipeline Overview
```
Expression matrix + time metadata
|
v
[1. Temporal DE] ---------> limma splines / DESeq2 LRT
|
v
[2. Filter] --------------> Significant temporal genes (FDR <0.05)
|
v
[3. Mfuzz Clustering] ----> Soft clustering of expression profiles
| |
| +---> QC: membership >0.5, no empty clusters
|
+--- Circadian design? ---> [4a. Rhythm Detection] (MetaCycle / CosinorPy)
| |
| v
| Rhythmic genes + period/phase estimates
|
v
[4b. GAM Trajectory] -----> mgcv GAM fitting for top clusters
|
v
[5. Pathway Enrichment] --> clusterProfiler per-cluster GO/KEGG
|
v
Temporal gene modules + enriched pathways + trajectory plots
```
## Step 1: Temporal Differential Expression
### R (limma splines)
```r
library(limma)
library(splines)
expr <- as.matrix(read.csv('counts_normalized.csv', row.names = 1))
meta <- read.csv('metadata.csv')
time_points <- meta$time
design <- model.matrix(~ ns(time_points, df = 3))
fit <- lmFit(expr, design)
fit <- eBayes(fit)
# Test all spline coefficients jointly for temporal significance
temporal_results <- topTable(fit, coef = 2:ncol(design), number = Inf, sort.by = 'F')
# topTable already returns adj.P.Val (BH-corrected); use it directly
```
### R (DESeq2 LRT)
```r
library(DESeq2)
counts <- as.matrix(read.csv('raw_counts.csv', row.names = 1))
meta <- read.csv('metadata.csv')
meta$time <- factor(meta$time)
dds <- DESeqDataSetFromMatrix(counts, colData = meta, design = ~ time)
dds <- DESeq(dds, test = 'LRT', reduced = ~ 1)
res <- results(dds)
```
### Python (statsmodels)
```python
import pandas as pd
import numpy as np
from statsmodels.stats.multitest import multipletests
from patsy import dmatrix
from scipy import stats
expr = pd.read_csv('counts_normalized.csv', index_col=0)
meta = pd.read_csv('metadata.csv')
spline_basis = dmatrix('bs(time, df=3)', data=meta, return_type='dataframe')
design_full = np.column_stack([np.ones(len(meta)), spline_basis.values])
design_reduced = np.ones((len(meta), 1))
pvals = []
for gene in expr.index:
y = expr.loc[gene].values
ss_full = np.sum((y - design_full @ np.linalg.lstsq(design_full, y, rcond=None)[0]) ** 2)
ss_red = np.sum((y - design_reduced @ np.linalg.lstsq(design_reduced, y, rcond=None)[0]) ** 2)
df_diff = design_full.shape[1] - design_reduced.shape[1]
df_resid = len(y) - design_full.shape[1]
f_stat = ((ss_red - ss_full) / df_diff) / (ss_full / df_resid)
pvals.append(1 - stats.f.cdf(f_stat, df_diff, df_resid))
_, fdr, _, _ = multipletests(pvals, method='fdr_bh')
temporal_genes = expr.index[fdr < 0.05].tolist()
```
### QC Checkpoint: Temporal DE
```r
# Gate 1: Sufficient temporal genes detected
sig_genes <- temporal_results[temporal_results$adj.P.Val < 0.05, ]
n_sig <- nrow(sig_genes)
message(sprintf('Significant temporal genes: %d', n_sig))
if (n_sig < 100) message('WARNING: Few temporal genes. Check time point spacing or consider relaxing FDR.')
if (n_sig > 10000) message('WARNING: Many temporal genes. Consider stricter FDR or inspect batch effects.')
# Gate 2: Residual distribution check
residuals <- residuals(fit, expr)
message(sprintf('Residual mean: %.4f, SD: %.4f', mean(residuals), sd(residuals)))
```
## Step 2: Filter Significant Genes
```r
# FDR <0.05: Standard threshold for temporal DE
# More permissive (0.1) acceptable for exploratory clustering
sig_genes <- rownames(temporal_results[temporal_results$adj.P.Val < 0.05, ])
expr_sig <- expr[sig_genes, ]
message(sprintf('Genes passing FDR <0.05: %d', length(sig_genes)))
```
## Step 3: Mfuzz Soft Clustering
```r
library(Mfuzz)
eset <- ExpressionSet(assayData = as.matrix(expr_sig))
# Standardize expression profiles (mean=0, sd=1 per gene)
eset <- standardise(eset)
# Estimate fuzzifier m
# mestimate() calculates optimal m from data geometry; typical range 1.5-2.5
m <- mestimate(eset)
message(sprintf('Estimated fuzzifier m = %.2f', m))
# Cluster count: start with sqrt(n_genes/2), refine with gap statistic
# Typical range 4-20 depending on temporal complexity
n_clusters <- 8
cl <- mfuzz(eset, c = n_clusters, m = m)
# Filter low-membership genes
# Membership >0.5: gene clearly belongs to one cluster
# Lower to 0.3 for exploratory analysis with more overlap
core_genes <- acore(eset, cl, min.acore = 0.5)
```
### Python Alternative (tslearn)
```python
from tslearn.clustering import TimeSeriesKMeans
# Row-wise z-scoring: normalize each gene across its timepoints (not per-timepoint)
expr_scaled = (expr_sig.values - expr_sig.values.mean(axis=1, keepdims=True)) / expr_sig.values.std(axis=1, keepdims=True)
# n_clusters: 4-20 depending on complexity; evaluate with silhouette score
model = TimeSeriesKMeans(n_clusters=8, metric='softdtw', metric_params={'gamma': 0.01},
max_iter=50, random_state=42)
labels = model.fit_predict(expr_scaled.reshape(expr_scaled.shape[0], expr_scaled.shape[1], 1))
```
### QC Checkpoint: Clustering
```r
# Gate 1: No empty clusters
cluster_sizes <- table(cl$cluster)
message('Cluster sizes:')
print(cluster_sizes)
if (any(cluster_sizes == 0)) message('WARNING: Empty clusters found. Reduce n_clusters.')
# Gate 2: Membership quality
for (i in seq_along(core_genes)) {
n_core <- nrow(core_genes[[i]])
message(sprintf('Cluster %d: %d core genes (membership >0.5)', i, n_core))
}
# Gate 3: Silhouette score (optional validation)
library(cluster)
hard_labels <- cl$cluster
sil <- silhouette(hard_labels, dist(exprs(eset)))
message(sprintf('Mean silhouette: %.3f', mean(sil[, 3])))
```
## Step 4a: Rhythm Detection (Optional - Circadian Designs)
Only applicable when sampling covers 24h+ cycles with sufficient resolution (every 2-4h).
### R (MetaCycle)
```r
library(MetaCycle)
# Expects genes as rows, time points as columns
# Column names must be numeric time values (hours)
expr_for_meta <- expr_sig
colnames(expr_for_meta) <- meta$time_hours
write.csv(expr_for_meta, 'expr_for_metacycle.csv')
# Period range 20-28h: standard circadian search window
# Adjust for ultradian (4-12h) or infradian (>28h) rhythms
meta2d('expr_for_metacycle.csv', filestyle = 'csv',
minper = 20, maxper = 28,
timepoints = sort(unique(meta$time_hours)),
outdir = 'metacycle_results')
```
### Python (CosinorPy)
```python
from cosinorpy import file_parser, cosinor
# fit_group expects long-format DataFrame with columns 'x' (time), 'y' (expression), 'test' (gene name)
# Reshape expression matrix to long format before passing
# period=24: standard circadian; adjust for other periodicities
results = cosinor.fit_group(expr_long, period=24, n_components=1)
rhythmic = results[results['p'] < 0.05]
```
## Step 4b: GAM Trajectory Fitting
### R (mgcv)
```r
library(mgcv)
cluster_trajectories <- list()
for (cl_id in 1:n_clusters) {
cl_genes <- names(cl$cluster[cl$cluster == cl_id])
mean_profile <- colMeans(expr_sig[cl_genes, ])
df_gam <- data.frame(time = meta$time, expr = mean_profile)
# k: basis dimension; k=5 sufficient for most time courses
# Increase to k=10 for >20 time points; decrease to k=3 for <6 time points
gam_fit <- gam(expr ~ s(time, k = 5), data = df_gam)
cluster_trajectories[[cl_id]] <- list(
fit = gam_fit,
r_squared = summary(gam_fit)$r.sq,
edf = summary(gam_fit)$edf
)
message(sprintf('Cluster %d: R^2 = %.3f, EDF = %.2f', cl_id,
summary(gam_fit)$r.sq, summary(gam_fit)$edf))
}
```
### Python (pygam)
```python
from pygam import LinearGAM, s
import numpy as np
for cl_id in range(n_clusters):
cl_mask = labels == cl_id
mean_profile = expr_scaled[cl_mask].mean(axis=0)
# n_splines=5: sufficient for most time courses
gam = LinearGAM(s(0, n_splines=5)).fit(meta['time'].values.reshape(-1, 1), mean_profile)
print(f'Cluster {cl_id}: GCV = {gam.statistics_["GCV"]:.4f}')
```
## Step 5: Per-Cluster Pathway Enrichment
### R (clusterProfiler)
```r
library(clusterProfiler)
library(org.Hs.eg.db)
# Background = all temporal genes tested (not the full genome)
all_temporal_entrez <- bitr(rownames(expr_sig), fromType = 'SYMBOL', toType = 'ENTREZID',
OrgDb = org.Hs.eg.db)
enrichment_results <- list()
for (i in seq_along(core_genes)) {
genes <- core_genes[[i]]$NAME
entrez <- bitr(genes, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
# GO Biological Process enrichment with proper background
ego <- enrichGO(gene = entrez$ENTREZID,
universe = all_temporal_entrez$ENTREZID,
OrgDb = org.Hs.eg.db,
ont = 'BP', pAdjustMethod = 'BH',
pvalueCutoff = 0.05, qvalueCutoff = 0.05,
readable = TRUE)
# Simplify redundant GO terms (parent-child hierarchy creates redundancy)
if (nrow(as.data.frame(ego)) > 0) {
ego <- simplify(ego, cutoff = 0.7, by = 'p.adjust')
}
enrichment_results[[i]] <- ego
message(sprintf('Cluster %d: %d significant GO terms', i, nrow(as.data.frame(ego))))
}
```
### Python (gseapy)
```python
import gseapy as gp
# Use all temporal genes as background for proper enrichment statistics
all_temporal_genes = list(expr_sig.index)
for cl_id in range(n_clusters):
cl_genes = [g for g, l in zip(expr_sig.index, labels) if l == cl_id]
enr = gp.enrichr(gene_list=cl_genes, gene_sets='GO_Biological_Process_2023',
organism='human', background=all_temporal_genes,
outdir=f'enrichr_cluster_{cl_id}')
sig_terms = enr.results[enr.results['Adjusted P-value'] < 0.05]
print(f'Cluster {cl_id}: {len(sig_terms)} significant GO terms')
```
### QC Checkpoint: Enrichment
```r
# Gate: At least 3 clusters should have significant GO terms
clusters_with_terms <- sum(sapply(enrichment_results, function(x) nrow(as.data.frame(x)) > 0))
message(sprintf('Clusters with significant GO terms: %d / %d', clusters_with_terms, length(enrichment_results)))
if (clusters_with_terms < 3) {
message('WARNING: Few clusters enriched. Check gene ID conversion or relax thresholds.')
}
```
## Parameter Recommendations
| Step | Parameter | Recommendation |
|------|-----------|----------------|
| Temporal DE | Spline df | 3 (default); increase to 4-5 for >10 time points |
| Temporal DE | FDR | 0.05 (standard); 0.1 for exploratory clustering |
| Mfuzz | fuzzifier m | Use mestimate(); typical range 1.5-2.5 |
| Mfuzz | n_clusters | 4-20; start with sqrt(n_genes/2), refine with gap statistic |
| Mfuzz | min membership | 0.5 (core genes); 0.3 (exploratory) |
| MetaCycle | period range | 20-28h (circadian); adjust for other periodicities |
| GAM | k (basis dim) | 5 (default); 3 for <6 time points; 10 for >20 |
| clusterProfiler | pvalueCutoff | 0.05 (standard); 0.1 (permissive) |
## Troubleshooting
| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| < 100 temporal genes | Insufficient replicates or noisy data | Add replicates; use DESeq2 LRT instead of limma |
| Empty Mfuzz clusters | Too many clusters | Reduce n_clusters; check gap statistic |
| All genes in one cluster | Fuzzifier too low or too few clusters | Increase m or n_clusters |
| No rhythmic genes | Non-circadian design or low power | Verify 24h+ sampling; increase resolution |
| GAM overfitting | k too high for time points | Set k = min(n_timepoints - 1, 5) |
| Few enriched clusters | Gene ID conversion failure | Check species; verify Entrez ID mapping |
| Low membership scores | High expression noise | Increase fuzzifier m; apply stricter gene filtering |
## Related Skills
- differential-expression/timeseries-de - Temporal DE methods
- temporal-genomics/temporal-clustering - Cluster analysis details
- temporal-genomics/circadian-rhythms - Rhythm detection details
- temporal-genomics/trajectory-modeling - GAM fitting details
- pathway-analysis/go-enrichment - Enrichment analysis details
- temporal-genomics/periodicity-detection - Unknown-period discovery
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.