bio-workflows-biomarker-pipeline
$
npx mdskill add GPTomics/bioSkills/bio-workflows-biomarker-pipelineBuilds validated biomarker panels from omics data using machine learning pipelines
- Solves the task of discovering diagnostic or prognostic biomarker signatures from expression data
- Depends on scikit-learn, pandas, numpy, and machine learning subskills for feature selection and model validation
- Uses Boruta/LASSO for feature selection, nested CV for model training, and SHAP for interpretation
- Delivers validated biomarker panels with performance metrics and interpretable feature importance
SKILL.md
.github/skills/bio-workflows-biomarker-pipelineView on GitHub ↗
---
name: bio-workflows-biomarker-pipeline
description: End-to-end biomarker discovery workflow from expression data to validated biomarker panels. Covers feature selection with Boruta/LASSO, classifier training with nested CV, and SHAP interpretation. Use when building and validating diagnostic or prognostic biomarker signatures from omics data.
tool_type: python
primary_tool: sklearn
workflow: true
depends_on:
- machine-learning/biomarker-discovery
- machine-learning/model-validation
- machine-learning/omics-classifiers
- machine-learning/prediction-explanation
qc_checkpoints:
- after_selection: "Selected features >5 and <200, stability >0.6"
- after_cv: "Nested CV AUC >0.7, fold variance <0.1"
- after_interpretation: "Top 20 SHAP features: >50% overlap with selected features"
---
## Version Compatibility
Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+, scanpy 1.10+, scikit-learn 1.4+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Biomarker Discovery Pipeline
**"Build a validated biomarker panel from my omics data"** → Orchestrate feature selection (Boruta/LASSO), nested cross-validation classifier training, and SHAP interpretation to produce a robust, validated biomarker signature.
Complete pipeline from expression data to validated biomarker panels with classifier.
## Workflow Overview
```
Expression matrix + Metadata
|
v
[1. Data Preparation] -----> StandardScaler, train/test split
|
v
[2. Feature Selection] ----> Boruta or LASSO stability selection
|
v
[3. Model Training] -------> RandomForest/XGBoost with nested CV
|
v
[4. Model Interpretation] -> SHAP values, feature importance
|
v
[5. Validation] -----------> Hold-out test, bootstrap CI
|
v
Validated biomarker panel + classifier
```
## Step 1: Data Preparation
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
expr = pd.read_csv('expression.csv', index_col=0)
meta = pd.read_csv('metadata.csv', index_col=0)
X = expr.T # samples x genes
y = meta.loc[X.index, 'condition'].values
# test_size=0.2: Standard 80/20 split; use 0.3 for <100 samples
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Fit scaler on training only to prevent data leakage
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
```
**QC Checkpoint 1:** Check class balance, sample counts per group
- Minimum 10 samples per class recommended
- Classes should be reasonably balanced (ratio <3:1)
## Step 2: Feature Selection
### Option A: Boruta (All-Relevant Selection)
```python
from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
# Pre-filter if >10k features
if X_train_scaled.shape[1] > 10000:
selector = SelectKBest(f_classif, k=5000)
selector.fit(X_train_scaled, y_train)
X_train_filt = X_train_scaled[:, selector.get_support()]
feature_mask = selector.get_support()
else:
X_train_filt = X_train_scaled
feature_mask = None
# max_depth=5: Shallow trees for stable importances
rf = RandomForestClassifier(n_estimators=100, max_depth=5, n_jobs=-1, random_state=42)
# max_iter=100: Usually sufficient; 200 if many tentative
boruta = BorutaPy(rf, n_estimators='auto', max_iter=100, random_state=42, verbose=0)
boruta.fit(X_train_filt, y_train)
selected_idx = boruta.support_
print(f'Selected {selected_idx.sum()} features')
```
### Option B: LASSO Stability Selection
```python
from sklearn.linear_model import LogisticRegressionCV
import numpy as np
# n_bootstrap=100: Quick; use 500 for publication
n_bootstrap = 100
stability_scores = np.zeros(X_train_scaled.shape[1])
for i in range(n_bootstrap):
idx = np.random.choice(len(y_train), size=len(y_train), replace=True)
# Cs=10: 10 regularization values to search
model = LogisticRegressionCV(penalty='l1', solver='saga', Cs=10, cv=3, random_state=i, max_iter=1000)
model.fit(X_train_scaled[idx], y_train[idx])
stability_scores += (model.coef_[0] != 0).astype(int)
stability_scores /= n_bootstrap
# stability_threshold=0.6: Standard; 0.8 for strict
selected_idx = stability_scores > 0.6
print(f'Selected {selected_idx.sum()} features (stability >0.6)')
```
**QC Checkpoint 2:**
- Selected features: 5-200 range
- Too few (<5): lower threshold, increase iterations
- Too many (>200): increase threshold, add pre-filtering
## Step 3: Model Training with Nested CV
```python
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
X_train_sel = X_train_scaled[:, selected_idx]
X_test_sel = X_test_scaled[:, selected_idx]
# outer_cv=5: Standard for performance estimation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# n_estimators=100: Sufficient for most omics
clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
cv_scores = cross_val_score(clf, X_train_sel, y_train, cv=outer_cv, scoring='roc_auc')
print(f'Nested CV AUC: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}')
```
**QC Checkpoint 3:**
- AUC >0.7 acceptable, >0.8 good
- Fold variance <0.1 (stable performance)
- Check for overfitting: train AUC should not be >>test AUC
## Step 4: Model Interpretation
```python
import shap
import matplotlib.pyplot as plt
clf.fit(X_train_sel, y_train)
# SHAP v0.47+: call explainer directly
explainer = shap.TreeExplainer(clf)
shap_values = explainer(X_train_sel)
# Beeswarm: shows importance AND direction
shap.plots.beeswarm(shap_values, max_display=20, show=False)
plt.tight_layout()
plt.savefig('shap_beeswarm.png', dpi=150, bbox_inches='tight')
plt.close()
# Extract top features
import numpy as np
mean_shap = np.abs(shap_values.values).mean(axis=0)
top_shap_idx = np.argsort(mean_shap)[-20:]
```
**QC Checkpoint 4:**
- Top 20 SHAP features should have >50% overlap with selected features
- SHAP directions should be biologically plausible
## Step 5: Final Validation
```python
from sklearn.metrics import roc_auc_score, classification_report
import numpy as np
y_prob = clf.predict_proba(X_test_sel)[:, 1]
test_auc = roc_auc_score(y_test, y_prob)
print(f'Hold-out test AUC: {test_auc:.3f}')
# Bootstrap CI for AUC
# n_bootstrap=1000: Standard for publication-quality CI
n_bootstrap = 1000
boot_aucs = []
for i in range(n_bootstrap):
idx = np.random.choice(len(y_test), size=len(y_test), replace=True)
boot_aucs.append(roc_auc_score(y_test[idx], y_prob[idx]))
ci_lower, ci_upper = np.percentile(boot_aucs, [2.5, 97.5])
print(f'95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]')
print(classification_report(y_test, clf.predict(X_test_sel)))
```
## Parameter Recommendations
| Step | Parameter | Recommendation |
|------|-----------|----------------|
| Split | test_size | 0.2 (standard), 0.3 for small datasets |
| Boruta | max_iter | 100 (sufficient), 200 if tentative features |
| LASSO | n_bootstrap | 100 (quick), 500 for publication |
| LASSO | stability_threshold | 0.6 (standard), 0.8 for strict |
| Nested CV | outer_folds | 5 (standard), 10 for small datasets |
| Nested CV | inner_folds | 3 (sufficient for tuning) |
| RF | n_estimators | 100-500 |
| XGBoost | learning_rate | 0.1 (conservative) |
## Troubleshooting
| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| No features selected | Too strict threshold | Lower stability threshold, increase iterations |
| Too many features (>200) | Noisy data | Add pre-filtering, increase regularization |
| Low CV AUC (<0.6) | No signal, low power | Check data quality, add samples |
| High variance across folds | Small sample size | Use more folds, LOOCV |
| SHAP features differ from selected | Model using different signal | Review feature correlations |
## Export Results
```python
import pandas as pd
import joblib
# Save biomarker panel
feature_names = X_train.columns[selected_idx].tolist()
pd.DataFrame({'feature': feature_names}).to_csv('biomarker_panel.csv', index=False)
# Save model and scaler for deployment
joblib.dump(clf, 'biomarker_classifier.joblib')
joblib.dump(scaler, 'feature_scaler.joblib')
```
## Related Skills
- machine-learning/biomarker-discovery - Detailed feature selection methods
- machine-learning/model-validation - Nested CV implementation details
- machine-learning/omics-classifiers - Classifier options and tuning
- machine-learning/prediction-explanation - SHAP and LIME interpretation
- differential-expression/de-results - Pre-filter with DE genes
- pathway-analysis/go-enrichment - Functional enrichment of biomarkers
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.