bio-machine-learning-prediction-explanation
$
npx mdskill add GPTomics/bioSkills/bio-machine-learning-prediction-explanationExplains ML predictions on omics data using SHAP and LIME
- Identifies which genes or features drive classifier decisions
- Uses SHAP values and LIME for feature attribution
- Analyzes tree-based models with TreeExplainer for exact Shapley values
- Returns feature importance scores and visual explanations
SKILL.md
.github/skills/bio-machine-learning-prediction-explanationView on GitHub ↗
---
name: bio-machine-learning-prediction-explanation
description: Explains machine learning predictions on omics data using SHAP values and LIME for feature attribution. Identifies which genes or features drive classifier decisions. Use when interpreting biomarker classifiers or understanding model predictions.
tool_type: python
primary_tool: shap
---
## Version Compatibility
Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+, scikit-learn 1.4+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Model Interpretation for Omics Classifiers
**"Which genes drive my classifier's predictions?"** → Compute per-feature attribution scores using SHAP values or LIME to explain which genes or features contribute most to model decisions.
- Python: `shap.TreeExplainer(model).shap_values(X)`, `lime.lime_tabular.LimeTabularExplainer()`
## SHAP TreeExplainer
**Goal:** Compute exact SHAP values for tree-based models to quantify each feature's contribution to predictions.
**Approach:** Use TreeExplainer for polynomial-time exact Shapley value computation on Random Forest or boosted tree models.
```python
import shap
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model)
# CORRECT (v0.47+): Call explainer directly, NOT .shap_values()
shap_values = explainer(X_test)
# shap_values is an Explanation object
# .values has shape (n_samples, n_features) for binary
# .base_values has expected value
print(f'SHAP values shape: {shap_values.values.shape}')
```
## Summary Plot (Global Feature Importance)
```python
import shap
import matplotlib.pyplot as plt
# Beeswarm plot: shows impact direction and magnitude
shap.plots.beeswarm(shap_values, max_display=20, show=False)
plt.tight_layout()
plt.savefig('shap_summary.png', dpi=150, bbox_inches='tight')
plt.close()
# Bar plot: mean absolute SHAP values
shap.plots.bar(shap_values, max_display=20, show=False)
plt.savefig('shap_bar.png', dpi=150, bbox_inches='tight')
```
## Force Plot (Individual Prediction)
```python
# Explain single prediction
sample_idx = 0
shap.plots.force(shap_values[sample_idx], matplotlib=True, show=False)
plt.savefig('shap_force_single.png', dpi=150, bbox_inches='tight')
# Waterfall plot (cleaner alternative)
shap.plots.waterfall(shap_values[sample_idx], max_display=15, show=False)
plt.savefig('shap_waterfall.png', dpi=150, bbox_inches='tight')
```
## SHAP for XGBoost
```python
from xgboost import XGBClassifier
import shap
xgb = XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss')
xgb.fit(X_train, y_train)
explainer = shap.TreeExplainer(xgb)
shap_values = explainer(X_test)
# For XGBoost, shap_values contains log-odds contributions
shap.plots.beeswarm(shap_values, max_display=20)
```
## LIME (Local Interpretable Model-agnostic Explanations)
```python
from lime.lime_tabular import LimeTabularExplainer
import numpy as np
explainer = LimeTabularExplainer(
X_train.values,
feature_names=X_train.columns.tolist(),
class_names=['control', 'disease'],
mode='classification'
)
# Explain single instance
sample_idx = 0
exp = explainer.explain_instance(
X_test.iloc[sample_idx].values,
model.predict_proba,
num_features=20
)
exp.save_to_file('lime_explanation.html')
# Or get as list: exp.as_list()
```
## Extract Top Features from SHAP
```python
import pandas as pd
import numpy as np
# Mean absolute SHAP value per feature
mean_shap = np.abs(shap_values.values).mean(axis=0)
feature_importance = pd.DataFrame({
'feature': X_test.columns,
'mean_shap': mean_shap
}).sort_values('mean_shap', ascending=False)
top_features = feature_importance.head(20)
top_features.to_csv('shap_top_features.csv', index=False)
```
## Dependence Plot (Feature Interactions)
```python
# Shows how SHAP value varies with feature value
# Automatically colors by interacting feature
shap.plots.scatter(shap_values[:, 'GENE1'], color=shap_values, show=False)
plt.savefig('shap_dependence.png', dpi=150, bbox_inches='tight')
```
## Multi-class SHAP
```python
explainer = shap.TreeExplainer(model)
shap_values = explainer(X_test)
# For multi-class, shap_values.values has shape (n_samples, n_features, n_classes)
# Access class-specific values:
class_idx = 1
shap.plots.beeswarm(shap_values[:, :, class_idx], max_display=20)
```
## Related Skills
- machine-learning/omics-classifiers - Train models to interpret
- machine-learning/biomarker-discovery - Compare with selection-based importance
- data-visualization/heatmaps-clustering - Visualize SHAP values as heatmap
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.