performance-extraction

$npx mdskill add yogsoth-ai/de-anthropocentric-research-engine/performance-extraction

Extract structured performance data from research papers and leaderboards

  • Solves the task of gathering performance metrics with conditions for AI methods
  • Relies on original papers, leaderboards, and reproducibility studies as primary sources
  • Prioritizes primary sources and cross-references with third-party reproductions
  • Delivers tuples of (Task, Dataset, Metric, Score, Conditions) with full provenance

SKILL.md

.github/skills/performance-extractionView on GitHub ↗
---
name: performance-extraction
description: Systematically extract performance data and conditions from papers — 30 methods, 150 data points, 40 web searches budget
used-by: baseline-establishment
---

# Performance Extraction


## Purpose

Extract structured performance data from papers, leaderboards, and reproducibility studies. Each data point is a (Task, Dataset, Metric, Score, Conditions) tuple with full provenance. Prioritizes primary sources (original papers) but cross-references against leaderboards and third-party reproductions.

## Budget

| Resource | Floor | Target |
|----------|-------|--------|
| Methods covered | 20 | 30 |
| Data points extracted | 100 | 150 |
| Web searches | 25 | 40 |
| Papers read | 15 | 30 |

## State Ledger

```
<HARD-GATE>
| Metric | Current | Target | Status |
|--------|---------|--------|--------|
| Methods covered | 0 | 30 | BLOCKED |
| Data points extracted | 0 | 150 | BLOCKED |
| Web searches used | 0 | 40 | — |
| Papers read | 0 | 30 | — |
| Datasets covered | 0 | 5 | — |
| Metrics tracked | 0 | 3 | — |
</HARD-GATE>
```

Cannot exit until data_points >= 120 (80% of target).

## Available Tactics

- **leaderboard-harvesting** — Bulk data collection from structured platforms

## Available SOPs

- **score-extraction** — Extract tuples from individual papers
- **condition-cataloging** — Record conditions alongside scores

## Execution Guidance

1. For each method in the inventory, locate the original paper
2. Use score-extraction SOP on each paper to pull all reported results
3. Cross-reference against Papers With Code / benchmark leaderboards
4. Use condition-cataloging to record experimental setup for each score
5. Flag scores that lack essential condition information
6. Track provenance: which table/figure in which paper
7. Prefer results from official implementations over third-party

## Output Format

```json
{
  "data_points": [
    {
      "method": "string",
      "task": "string",
      "dataset": "string",
      "split": "test|val|dev",
      "metric": "string",
      "score": 0.0,
      "confidence_interval": [0.0, 0.0],
      "conditions": {
        "hardware": "string",
        "training_data_size": "string",
        "hyperparams_reported": true,
        "seeds_reported": true,
        "compute_budget": "string"
      },
      "provenance": {
        "paper_id": "string",
        "table_or_figure": "string",
        "is_primary_source": true
      }
    }
  ],
  "coverage_summary": {
    "methods_covered": 0,
    "datasets_covered": 0,
    "metrics_tracked": [],
    "missing_data_flags": []
  }
}
```

More from yogsoth-ai/de-anthropocentric-research-engine