score-trajectory-analysis
$
npx mdskill add yogsoth-ai/de-anthropocentric-research-engine/score-trajectory-analysisAnalyzes historical benchmark scores to detect saturation and inflection points
- Tracks progress of SOTA scores over time to identify trends and phase shifts
- Uses Papers With Code, official leaderboards, and web search tools to gather data
- Fits saturation curves and identifies inflection points using statistical models
- Returns time-series data, fitted curves, and insights about benchmark saturation
SKILL.md
.github/skills/score-trajectory-analysisView on GitHub ↗
---
name: score-trajectory-analysis
description: Collect historical scores, fit saturation curves, detect inflection points
execution: tactic
used-by: benchmark-archaeology
---
# Score Trajectory Analysis Tactic
Collect historical SOTA scores for a benchmark, arrange as time-series, fit saturation curves, and detect inflection points indicating phase transitions in benchmark difficulty.
## Stages
### Stage 1: Multi-Source Score Collection
Gather historical scores from multiple sources to build comprehensive timeline.
**Sources** (search in order):
1. Papers With Code leaderboards (primary)
2. Official benchmark leaderboards/websites
3. Individual papers reporting SOTA (via dare-ss, dare-scholar)
4. Blog posts and technical reports (via brave-search, dare-web)
**Per data point, collect**:
- Model name and family
- Score (primary metric)
- Date (publication/release date)
- Paper/source reference
- Model size (parameters) if available
- Training data scale if available
**Minimum**: 10 data points spanning at least 2 years.
### Stage 2: Time-Series Arrangement
- Sort by date
- Compute SOTA envelope (monotonically non-decreasing maximum)
- Identify score jumps > 2 standard deviations
- Note human baseline and theoretical ceiling positions
- Flag suspicious entries (unreplicated claims, withdrawn papers)
### Stage 3: Curve Fitting
Fit multiple saturation models to the SOTA envelope:
- **Logistic**: S(t) = L / (1 + e^(-k(t-t0)))
- **Exponential decay to ceiling**: S(t) = C - (C-S0) * e^(-lambda*t)
- **Linear** (for pre-saturation benchmarks)
- **Piecewise** (for benchmarks with phase transitions)
Report goodness-of-fit (R-squared) for each model. Select best-fit.
### Stage 4: Saturation/Inflection Detection
Classify benchmark status:
- **Pre-saturation**: Linear or early logistic, >20% headroom remaining
- **Approaching**: Mid-logistic, 5-20% headroom, decelerating gains
- **Saturated**: <5% headroom, gains < noise level
- **Supersaturated**: Multiple models at ceiling, benchmark no longer discriminative
Detect inflection points:
- Acceleration phases (new paradigm unlocks rapid progress)
- Deceleration phases (diminishing returns begin)
- Step functions (single breakthrough causes discontinuous jump)
## Output
```yaml
trajectory:
benchmark: string
metric: string
data_points: int
time_span: string
sota_envelope:
- {date, score, model, source}
best_fit_model: logistic|exponential|linear|piecewise
fit_r_squared: float
saturation_status: pre-saturation|approaching|saturated|supersaturated
headroom: float
inflection_points:
- {date, type: acceleration|deceleration|step, cause: string}
estimated_ceiling: float
time_to_ceiling: string
```
## Yield Report
| Metric | Minimum |
|--------|---------|
| Data points collected | 10 |
| Sources consulted | 3 |
| Curve models fitted | 3 |
| Saturation classification produced | 1 |
More from yogsoth-ai/de-anthropocentric-research-engine
- abductive-hypothesis-generationStrategy: 面对异常的最佳解释推理
- ablation-brainstormRemove components one by one, observe system changes to reveal hidden dependencies and generate ideas from structural gaps.
- ablation-component-mappingMap system architecture to ablatable units for ablation studies
- ablation-designDesign ablation studies to isolate component contributions in ML systems
- ablation-executionRemove components one by one from a system, record the response/impact of each removal.
- abp-vulnerability-classificationClassify assumptions on 2 axes — load-bearing (how much conclusion depends on it) × vulnerable (how likely to be false). Focuses attention on High-Load × High-Vulnerable quadrant.
- abstraction-extractionExtract abstract principles from concrete domain cases. Strips domain-specific details to reveal transferable mechanisms.
- abstraction-ladderPerform bisociation at multiple abstraction levels
- abstraction-ladderingMove between concrete and abstract framings — 3 levels up (Why?) and 3 levels down (How?) to find the most productive research level.
- abstraction-to-designAbstract biological principle to design principle. Bridge from biology to engineering.