bio-machine-learning-survival-analysis

Name: bio-machine-learning-survival-analysis
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-machine-learning-survival-analysis

Estimate patient survival using Kaplan-Meier curves and Cox regression.

Predicts time-to-event outcomes from clinical and omics features.
Depends on the lifelines Python library for survival analysis.
Selects statistical methods based on data type and censoring patterns.
Delivers survival curves, p-values, and hazard ratios via plots.

SKILL.md

.github/skills/bio-machine-learning-survival-analysisView on GitHub ↗

---
name: bio-machine-learning-survival-analysis
description: Analyzes time-to-event data using Kaplan-Meier curves, log-rank tests, and Cox proportional hazards regression with lifelines. Builds survival models from clinical and omics features. Use when predicting patient survival or modeling time-to-event outcomes.
tool_type: python
primary_tool: lifelines
---

## Version Compatibility

Reference examples tested with: matplotlib 3.8+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Survival Prediction with lifelines

**"Analyze patient survival data"** → Estimate survival curves (Kaplan-Meier), compare groups (log-rank test), and model time-to-event outcomes with Cox proportional hazards regression.
- Python: `lifelines.KaplanMeierFitter()`, `lifelines.CoxPHFitter()`

## Kaplan-Meier Curves

**Goal:** Estimate and visualize the survival probability function from time-to-event data.

**Approach:** Fit a nonparametric Kaplan-Meier estimator to censored survival data and plot the step function.

```python
from lifelines import KaplanMeierFitter
import matplotlib.pyplot as plt

kmf = KaplanMeierFitter()

# T: time to event or censoring
# E: event indicator (1=event occurred, 0=censored)
kmf.fit(T, event_observed=E)

# Plot survival curve
kmf.plot_survival_function()
plt.xlabel('Time (months)')
plt.ylabel('Survival probability')
plt.savefig('km_curve.png', dpi=150)
```

## Compare Groups with Log-Rank Test

**Goal:** Test whether survival distributions differ significantly between risk groups.

**Approach:** Fit separate Kaplan-Meier curves per group, overlay them, and apply a log-rank test for statistical comparison.

```python
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 6))

for group, color in zip(['high', 'low'], ['red', 'blue']):
    mask = df['risk_group'] == group
    kmf = KaplanMeierFitter()
    kmf.fit(df.loc[mask, 'time'], event_observed=df.loc[mask, 'event'], label=group)
    kmf.plot_survival_function(ax=ax, color=color)

# Log-rank test
high = df[df['risk_group'] == 'high']
low = df[df['risk_group'] == 'low']
results = logrank_test(high['time'], low['time'], event_observed_A=high['event'], event_observed_B=low['event'])
print(f'Log-rank p-value: {results.p_value:.4e}')

ax.set_xlabel('Time (months)')
ax.set_ylabel('Survival probability')
ax.set_title(f'Log-rank p = {results.p_value:.4e}')
plt.savefig('km_comparison.png', dpi=150)
```

## Cox Proportional Hazards Regression

**Goal:** Model the effect of covariates on survival time using a semi-parametric hazard model.

**Approach:** Fit a Cox PH model to extract hazard ratios, confidence intervals, and a concordance index for predictive accuracy.

```python
from lifelines import CoxPHFitter

# Prepare data: must have 'time' and 'event' columns
# Include covariates as additional columns
cph = CoxPHFitter()
cph.fit(df, duration_col='time', event_col='event')

# Summary with hazard ratios
cph.print_summary()

# Get hazard ratios as DataFrame
hr = cph.summary[['exp(coef)', 'exp(coef) lower 95%', 'exp(coef) upper 95%', 'p']]
print(hr)

# Concordance index (c-index): 0.5=random, 1.0=perfect
print(f'C-index: {cph.concordance_index_:.3f}')
```

## Multivariate Cox Model

**Goal:** Assess the independent prognostic value of clinical and molecular features in a single model.

**Approach:** Combine clinical covariates and gene expression values into a regularized Cox model to identify independently prognostic variables.

```python
from lifelines import CoxPHFitter
import pandas as pd

# Combine clinical and omics features
cox_df = pd.DataFrame({
    'time': meta['survival_months'],
    'event': meta['vital_status'],
    'age': meta['age'],
    'stage': meta['stage_numeric'],
    'GENE1': expr.loc['GENE1'],
    'GENE2': expr.loc['GENE2']
})

cph = CoxPHFitter(penalizer=0.1)  # L2 regularization for stability
cph.fit(cox_df, duration_col='time', event_col='event')
cph.print_summary()
```

## Predict Risk Scores

**Goal:** Stratify patients into risk groups based on a fitted Cox model.

**Approach:** Compute partial hazard scores from model coefficients and split at the median to define high/low risk groups for downstream KM visualization.

```python
# Partial hazard (risk score)
risk_scores = cph.predict_partial_hazard(cox_df)

# Median risk split for KM plot
df['risk_group'] = (risk_scores > risk_scores.median()).map({True: 'high', False: 'low'})
```

## Check Proportional Hazards Assumption

**Goal:** Verify that the proportional hazards assumption holds for all covariates.

**Approach:** Run the built-in Schoenfeld residual tests and inspect diagnostic plots for time-varying effects.

```python
# Test PH assumption
cph.check_assumptions(df, p_value_threshold=0.05, show_plots=True)
```

## Survival at Specific Time

**Goal:** Extract survival probability estimates at clinically meaningful time points.

**Approach:** Query the fitted Kaplan-Meier survival function at specific durations and report the median survival time.

```python
# Survival probability at specific times
survival_probs = kmf.survival_function_at_times([12, 24, 60])
print(survival_probs)

# Median survival
print(f'Median survival: {kmf.median_survival_time_:.1f}')
```

## Feature Selection for Survival

**Goal:** Screen thousands of genes to identify those significantly associated with patient survival.

**Approach:** Fit univariate Cox models for each gene, extract hazard ratios and p-values, and rank candidates for multivariate modeling.

```python
from lifelines import CoxPHFitter
import pandas as pd

# Univariate screening
results = []
for gene in expr.index[:1000]:
    cox_df = pd.DataFrame({
        'time': meta['survival_months'],
        'event': meta['vital_status'],
        'gene': expr.loc[gene]
    })
    cph = CoxPHFitter()
    cph.fit(cox_df, duration_col='time', event_col='event')
    results.append({
        'gene': gene,
        'hr': cph.hazard_ratios_['gene'],
        'p': cph.summary.loc['gene', 'p']
    })

results_df = pd.DataFrame(results)
sig_genes = results_df[results_df['p'] < 0.05].sort_values('p')
```

## Related Skills

- clinical-databases/variant-prioritization - Clinical variant interpretation
- differential-expression/de-results - Find DE genes for survival model
- machine-learning/biomarker-discovery - Select predictive features