score-trajectory-analysis

Name: score-trajectory-analysis
Author: yogsoth-ai/de-anthropocentric-research-engine

$npx mdskill add yogsoth-ai/de-anthropocentric-research-engine/score-trajectory-analysis

Analyzes historical benchmark scores to detect saturation and inflection points

Tracks progress of SOTA scores over time to identify trends and phase shifts
Uses Papers With Code, official leaderboards, and web search tools to gather data
Fits saturation curves and identifies inflection points using statistical models
Returns time-series data, fitted curves, and insights about benchmark saturation

SKILL.md

.github/skills/score-trajectory-analysisView on GitHub ↗

---
name: score-trajectory-analysis
description: Collect historical scores, fit saturation curves, detect inflection points
execution: tactic
used-by: benchmark-archaeology
---

# Score Trajectory Analysis Tactic

Collect historical SOTA scores for a benchmark, arrange as time-series, fit saturation curves, and detect inflection points indicating phase transitions in benchmark difficulty.

## Stages

### Stage 1: Multi-Source Score Collection

Gather historical scores from multiple sources to build comprehensive timeline.

**Sources** (search in order):
1. Papers With Code leaderboards (primary)
2. Official benchmark leaderboards/websites
3. Individual papers reporting SOTA (via dare-ss, dare-scholar)
4. Blog posts and technical reports (via brave-search, dare-web)

**Per data point, collect**:
- Model name and family
- Score (primary metric)
- Date (publication/release date)
- Paper/source reference
- Model size (parameters) if available
- Training data scale if available

**Minimum**: 10 data points spanning at least 2 years.

### Stage 2: Time-Series Arrangement

- Sort by date
- Compute SOTA envelope (monotonically non-decreasing maximum)
- Identify score jumps > 2 standard deviations
- Note human baseline and theoretical ceiling positions
- Flag suspicious entries (unreplicated claims, withdrawn papers)

### Stage 3: Curve Fitting

Fit multiple saturation models to the SOTA envelope:
- **Logistic**: S(t) = L / (1 + e^(-k(t-t0)))
- **Exponential decay to ceiling**: S(t) = C - (C-S0) * e^(-lambda*t)
- **Linear** (for pre-saturation benchmarks)
- **Piecewise** (for benchmarks with phase transitions)

Report goodness-of-fit (R-squared) for each model. Select best-fit.

### Stage 4: Saturation/Inflection Detection

Classify benchmark status:
- **Pre-saturation**: Linear or early logistic, >20% headroom remaining
- **Approaching**: Mid-logistic, 5-20% headroom, decelerating gains
- **Saturated**: <5% headroom, gains < noise level
- **Supersaturated**: Multiple models at ceiling, benchmark no longer discriminative

Detect inflection points:
- Acceleration phases (new paradigm unlocks rapid progress)
- Deceleration phases (diminishing returns begin)
- Step functions (single breakthrough causes discontinuous jump)

## Output

```yaml
trajectory:
  benchmark: string
  metric: string
  data_points: int
  time_span: string
  sota_envelope:
    - {date, score, model, source}
  best_fit_model: logistic|exponential|linear|piecewise
  fit_r_squared: float
  saturation_status: pre-saturation|approaching|saturated|supersaturated
  headroom: float
  inflection_points:
    - {date, type: acceleration|deceleration|step, cause: string}
  estimated_ceiling: float
  time_to_ceiling: string
```

## Yield Report

| Metric | Minimum |
|--------|---------|
| Data points collected | 10 |
| Sources consulted | 3 |
| Curve models fitted | 3 |
| Saturation classification produced | 1 |

More from yogsoth-ai/de-anthropocentric-research-engine