bio-workflows-timecourse-pipeline

$npx mdskill add GPTomics/bioSkills/bio-workflows-timecourse-pipeline

Executes end-to-end time-course analysis from expression matrices to enriched pathways.

  • Transforms bulk time-series omics data into temporal patterns and biological insights.
  • Depends on differential expression, clustering, rhythm detection, and enrichment tools.
  • Validates outputs against statistical thresholds and model fit quality metrics.
  • Delivers cluster-specific pathway enrichments with confidence scores and residuals.
SKILL.md
.github/skills/bio-workflows-timecourse-pipelineView on GitHub ↗
---
name: bio-workflows-timecourse-pipeline
description: End-to-end time-course analysis from expression matrix to temporal patterns and enrichment. Covers temporal DE, Mfuzz soft clustering, optional rhythm detection, GAM trajectory fitting, and per-cluster pathway enrichment. Use when analyzing bulk time-series expression experiments from any omics platform.
tool_type: mixed
primary_tool: Mfuzz
workflow: true
depends_on:
  - differential-expression/timeseries-de
  - temporal-genomics/temporal-clustering
  - temporal-genomics/circadian-rhythms
  - temporal-genomics/trajectory-modeling
  - pathway-analysis/go-enrichment
qc_checkpoints:
  - after_de: "Significant temporal genes >100 at FDR <0.05; model fit residuals reasonable"
  - after_clustering: "Membership >0.5 for soft clustering; no empty clusters; validated by gap statistic or silhouette (typical range 4-20 clusters depending on complexity)"
  - after_enrichment: "At least 3 clusters with significant GO terms at FDR <0.05"
---

## Version Compatibility

Reference examples tested with: DESeq2 1.42+, clusterProfiler 4.10+, limma 3.58+, numpy 1.26+, pandas 2.2+, scanpy 1.10+, scipy 1.12+, statsmodels 0.14+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Time-Course Analysis Pipeline

**"Analyze my time-course expression data end-to-end"** → Orchestrate temporal differential expression, Mfuzz soft clustering, optional circadian rhythm detection, GAM trajectory fitting, changepoint detection, and per-cluster pathway enrichment.

Complete workflow from expression matrix through temporal differential expression, soft clustering,
optional rhythm detection, trajectory fitting, and per-cluster pathway enrichment.

## Pipeline Overview

```
Expression matrix + time metadata
    |
    v
[1. Temporal DE] ---------> limma splines / DESeq2 LRT
    |
    v
[2. Filter] --------------> Significant temporal genes (FDR <0.05)
    |
    v
[3. Mfuzz Clustering] ----> Soft clustering of expression profiles
    |                            |
    |                            +---> QC: membership >0.5, no empty clusters
    |
    +--- Circadian design? ---> [4a. Rhythm Detection] (MetaCycle / CosinorPy)
    |                               |
    |                               v
    |                           Rhythmic genes + period/phase estimates
    |
    v
[4b. GAM Trajectory] -----> mgcv GAM fitting for top clusters
    |
    v
[5. Pathway Enrichment] --> clusterProfiler per-cluster GO/KEGG
    |
    v
Temporal gene modules + enriched pathways + trajectory plots
```

## Step 1: Temporal Differential Expression

### R (limma splines)

```r
library(limma)
library(splines)

expr <- as.matrix(read.csv('counts_normalized.csv', row.names = 1))
meta <- read.csv('metadata.csv')

time_points <- meta$time
design <- model.matrix(~ ns(time_points, df = 3))

fit <- lmFit(expr, design)
fit <- eBayes(fit)

# Test all spline coefficients jointly for temporal significance
temporal_results <- topTable(fit, coef = 2:ncol(design), number = Inf, sort.by = 'F')
# topTable already returns adj.P.Val (BH-corrected); use it directly
```

### R (DESeq2 LRT)

```r
library(DESeq2)

counts <- as.matrix(read.csv('raw_counts.csv', row.names = 1))
meta <- read.csv('metadata.csv')
meta$time <- factor(meta$time)

dds <- DESeqDataSetFromMatrix(counts, colData = meta, design = ~ time)
dds <- DESeq(dds, test = 'LRT', reduced = ~ 1)
res <- results(dds)
```

### Python (statsmodels)

```python
import pandas as pd
import numpy as np
from statsmodels.stats.multitest import multipletests
from patsy import dmatrix
from scipy import stats

expr = pd.read_csv('counts_normalized.csv', index_col=0)
meta = pd.read_csv('metadata.csv')

spline_basis = dmatrix('bs(time, df=3)', data=meta, return_type='dataframe')
design_full = np.column_stack([np.ones(len(meta)), spline_basis.values])
design_reduced = np.ones((len(meta), 1))

pvals = []
for gene in expr.index:
    y = expr.loc[gene].values
    ss_full = np.sum((y - design_full @ np.linalg.lstsq(design_full, y, rcond=None)[0]) ** 2)
    ss_red = np.sum((y - design_reduced @ np.linalg.lstsq(design_reduced, y, rcond=None)[0]) ** 2)
    df_diff = design_full.shape[1] - design_reduced.shape[1]
    df_resid = len(y) - design_full.shape[1]
    f_stat = ((ss_red - ss_full) / df_diff) / (ss_full / df_resid)
    pvals.append(1 - stats.f.cdf(f_stat, df_diff, df_resid))

_, fdr, _, _ = multipletests(pvals, method='fdr_bh')
temporal_genes = expr.index[fdr < 0.05].tolist()
```

### QC Checkpoint: Temporal DE

```r
# Gate 1: Sufficient temporal genes detected
sig_genes <- temporal_results[temporal_results$adj.P.Val < 0.05, ]
n_sig <- nrow(sig_genes)
message(sprintf('Significant temporal genes: %d', n_sig))
if (n_sig < 100) message('WARNING: Few temporal genes. Check time point spacing or consider relaxing FDR.')
if (n_sig > 10000) message('WARNING: Many temporal genes. Consider stricter FDR or inspect batch effects.')

# Gate 2: Residual distribution check
residuals <- residuals(fit, expr)
message(sprintf('Residual mean: %.4f, SD: %.4f', mean(residuals), sd(residuals)))
```

## Step 2: Filter Significant Genes

```r
# FDR <0.05: Standard threshold for temporal DE
# More permissive (0.1) acceptable for exploratory clustering
sig_genes <- rownames(temporal_results[temporal_results$adj.P.Val < 0.05, ])
expr_sig <- expr[sig_genes, ]
message(sprintf('Genes passing FDR <0.05: %d', length(sig_genes)))
```

## Step 3: Mfuzz Soft Clustering

```r
library(Mfuzz)

eset <- ExpressionSet(assayData = as.matrix(expr_sig))

# Standardize expression profiles (mean=0, sd=1 per gene)
eset <- standardise(eset)

# Estimate fuzzifier m
# mestimate() calculates optimal m from data geometry; typical range 1.5-2.5
m <- mestimate(eset)
message(sprintf('Estimated fuzzifier m = %.2f', m))

# Cluster count: start with sqrt(n_genes/2), refine with gap statistic
# Typical range 4-20 depending on temporal complexity
n_clusters <- 8
cl <- mfuzz(eset, c = n_clusters, m = m)

# Filter low-membership genes
# Membership >0.5: gene clearly belongs to one cluster
# Lower to 0.3 for exploratory analysis with more overlap
core_genes <- acore(eset, cl, min.acore = 0.5)
```

### Python Alternative (tslearn)

```python
from tslearn.clustering import TimeSeriesKMeans

# Row-wise z-scoring: normalize each gene across its timepoints (not per-timepoint)
expr_scaled = (expr_sig.values - expr_sig.values.mean(axis=1, keepdims=True)) / expr_sig.values.std(axis=1, keepdims=True)

# n_clusters: 4-20 depending on complexity; evaluate with silhouette score
model = TimeSeriesKMeans(n_clusters=8, metric='softdtw', metric_params={'gamma': 0.01},
                         max_iter=50, random_state=42)
labels = model.fit_predict(expr_scaled.reshape(expr_scaled.shape[0], expr_scaled.shape[1], 1))
```

### QC Checkpoint: Clustering

```r
# Gate 1: No empty clusters
cluster_sizes <- table(cl$cluster)
message('Cluster sizes:')
print(cluster_sizes)
if (any(cluster_sizes == 0)) message('WARNING: Empty clusters found. Reduce n_clusters.')

# Gate 2: Membership quality
for (i in seq_along(core_genes)) {
    n_core <- nrow(core_genes[[i]])
    message(sprintf('Cluster %d: %d core genes (membership >0.5)', i, n_core))
}

# Gate 3: Silhouette score (optional validation)
library(cluster)
hard_labels <- cl$cluster
sil <- silhouette(hard_labels, dist(exprs(eset)))
message(sprintf('Mean silhouette: %.3f', mean(sil[, 3])))
```

## Step 4a: Rhythm Detection (Optional - Circadian Designs)

Only applicable when sampling covers 24h+ cycles with sufficient resolution (every 2-4h).

### R (MetaCycle)

```r
library(MetaCycle)

# Expects genes as rows, time points as columns
# Column names must be numeric time values (hours)
expr_for_meta <- expr_sig
colnames(expr_for_meta) <- meta$time_hours

write.csv(expr_for_meta, 'expr_for_metacycle.csv')

# Period range 20-28h: standard circadian search window
# Adjust for ultradian (4-12h) or infradian (>28h) rhythms
meta2d('expr_for_metacycle.csv', filestyle = 'csv',
       minper = 20, maxper = 28,
       timepoints = sort(unique(meta$time_hours)),
       outdir = 'metacycle_results')
```

### Python (CosinorPy)

```python
from cosinorpy import file_parser, cosinor

# fit_group expects long-format DataFrame with columns 'x' (time), 'y' (expression), 'test' (gene name)
# Reshape expression matrix to long format before passing
# period=24: standard circadian; adjust for other periodicities
results = cosinor.fit_group(expr_long, period=24, n_components=1)
rhythmic = results[results['p'] < 0.05]
```

## Step 4b: GAM Trajectory Fitting

### R (mgcv)

```r
library(mgcv)

cluster_trajectories <- list()
for (cl_id in 1:n_clusters) {
    cl_genes <- names(cl$cluster[cl$cluster == cl_id])
    mean_profile <- colMeans(expr_sig[cl_genes, ])

    df_gam <- data.frame(time = meta$time, expr = mean_profile)

    # k: basis dimension; k=5 sufficient for most time courses
    # Increase to k=10 for >20 time points; decrease to k=3 for <6 time points
    gam_fit <- gam(expr ~ s(time, k = 5), data = df_gam)

    cluster_trajectories[[cl_id]] <- list(
        fit = gam_fit,
        r_squared = summary(gam_fit)$r.sq,
        edf = summary(gam_fit)$edf
    )
    message(sprintf('Cluster %d: R^2 = %.3f, EDF = %.2f', cl_id,
                    summary(gam_fit)$r.sq, summary(gam_fit)$edf))
}
```

### Python (pygam)

```python
from pygam import LinearGAM, s
import numpy as np

for cl_id in range(n_clusters):
    cl_mask = labels == cl_id
    mean_profile = expr_scaled[cl_mask].mean(axis=0)

    # n_splines=5: sufficient for most time courses
    gam = LinearGAM(s(0, n_splines=5)).fit(meta['time'].values.reshape(-1, 1), mean_profile)
    print(f'Cluster {cl_id}: GCV = {gam.statistics_["GCV"]:.4f}')
```

## Step 5: Per-Cluster Pathway Enrichment

### R (clusterProfiler)

```r
library(clusterProfiler)
library(org.Hs.eg.db)

# Background = all temporal genes tested (not the full genome)
all_temporal_entrez <- bitr(rownames(expr_sig), fromType = 'SYMBOL', toType = 'ENTREZID',
                            OrgDb = org.Hs.eg.db)

enrichment_results <- list()
for (i in seq_along(core_genes)) {
    genes <- core_genes[[i]]$NAME

    entrez <- bitr(genes, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)

    # GO Biological Process enrichment with proper background
    ego <- enrichGO(gene = entrez$ENTREZID,
                    universe = all_temporal_entrez$ENTREZID,
                    OrgDb = org.Hs.eg.db,
                    ont = 'BP', pAdjustMethod = 'BH',
                    pvalueCutoff = 0.05, qvalueCutoff = 0.05,
                    readable = TRUE)

    # Simplify redundant GO terms (parent-child hierarchy creates redundancy)
    if (nrow(as.data.frame(ego)) > 0) {
        ego <- simplify(ego, cutoff = 0.7, by = 'p.adjust')
    }

    enrichment_results[[i]] <- ego
    message(sprintf('Cluster %d: %d significant GO terms', i, nrow(as.data.frame(ego))))
}
```

### Python (gseapy)

```python
import gseapy as gp

# Use all temporal genes as background for proper enrichment statistics
all_temporal_genes = list(expr_sig.index)

for cl_id in range(n_clusters):
    cl_genes = [g for g, l in zip(expr_sig.index, labels) if l == cl_id]

    enr = gp.enrichr(gene_list=cl_genes, gene_sets='GO_Biological_Process_2023',
                     organism='human', background=all_temporal_genes,
                     outdir=f'enrichr_cluster_{cl_id}')
    sig_terms = enr.results[enr.results['Adjusted P-value'] < 0.05]
    print(f'Cluster {cl_id}: {len(sig_terms)} significant GO terms')
```

### QC Checkpoint: Enrichment

```r
# Gate: At least 3 clusters should have significant GO terms
clusters_with_terms <- sum(sapply(enrichment_results, function(x) nrow(as.data.frame(x)) > 0))
message(sprintf('Clusters with significant GO terms: %d / %d', clusters_with_terms, length(enrichment_results)))
if (clusters_with_terms < 3) {
    message('WARNING: Few clusters enriched. Check gene ID conversion or relax thresholds.')
}
```

## Parameter Recommendations

| Step | Parameter | Recommendation |
|------|-----------|----------------|
| Temporal DE | Spline df | 3 (default); increase to 4-5 for >10 time points |
| Temporal DE | FDR | 0.05 (standard); 0.1 for exploratory clustering |
| Mfuzz | fuzzifier m | Use mestimate(); typical range 1.5-2.5 |
| Mfuzz | n_clusters | 4-20; start with sqrt(n_genes/2), refine with gap statistic |
| Mfuzz | min membership | 0.5 (core genes); 0.3 (exploratory) |
| MetaCycle | period range | 20-28h (circadian); adjust for other periodicities |
| GAM | k (basis dim) | 5 (default); 3 for <6 time points; 10 for >20 |
| clusterProfiler | pvalueCutoff | 0.05 (standard); 0.1 (permissive) |

## Troubleshooting

| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| < 100 temporal genes | Insufficient replicates or noisy data | Add replicates; use DESeq2 LRT instead of limma |
| Empty Mfuzz clusters | Too many clusters | Reduce n_clusters; check gap statistic |
| All genes in one cluster | Fuzzifier too low or too few clusters | Increase m or n_clusters |
| No rhythmic genes | Non-circadian design or low power | Verify 24h+ sampling; increase resolution |
| GAM overfitting | k too high for time points | Set k = min(n_timepoints - 1, 5) |
| Few enriched clusters | Gene ID conversion failure | Check species; verify Entrez ID mapping |
| Low membership scores | High expression noise | Increase fuzzifier m; apply stricter gene filtering |

## Related Skills

- differential-expression/timeseries-de - Temporal DE methods
- temporal-genomics/temporal-clustering - Cluster analysis details
- temporal-genomics/circadian-rhythms - Rhythm detection details
- temporal-genomics/trajectory-modeling - GAM fitting details
- pathway-analysis/go-enrichment - Enrichment analysis details
- temporal-genomics/periodicity-detection - Unknown-period discovery
More from GPTomics/bioSkills