performance-extraction

Name: performance-extraction
Author: yogsoth-ai/de-anthropocentric-research-engine

$npx mdskill add yogsoth-ai/de-anthropocentric-research-engine/performance-extraction

Extract structured performance data from research papers and leaderboards

Solves the task of gathering performance metrics with conditions for AI methods
Relies on original papers, leaderboards, and reproducibility studies as primary sources
Prioritizes primary sources and cross-references with third-party reproductions
Delivers tuples of (Task, Dataset, Metric, Score, Conditions) with full provenance

SKILL.md

.github/skills/performance-extractionView on GitHub ↗

---
name: performance-extraction
description: Systematically extract performance data and conditions from papers — 30 methods, 150 data points, 40 web searches budget
used-by: baseline-establishment
---

# Performance Extraction


## Purpose

Extract structured performance data from papers, leaderboards, and reproducibility studies. Each data point is a (Task, Dataset, Metric, Score, Conditions) tuple with full provenance. Prioritizes primary sources (original papers) but cross-references against leaderboards and third-party reproductions.

## Budget

| Resource | Floor | Target |
|----------|-------|--------|
| Methods covered | 20 | 30 |
| Data points extracted | 100 | 150 |
| Web searches | 25 | 40 |
| Papers read | 15 | 30 |

## State Ledger

```
<HARD-GATE>
| Metric | Current | Target | Status |
|--------|---------|--------|--------|
| Methods covered | 0 | 30 | BLOCKED |
| Data points extracted | 0 | 150 | BLOCKED |
| Web searches used | 0 | 40 | — |
| Papers read | 0 | 30 | — |
| Datasets covered | 0 | 5 | — |
| Metrics tracked | 0 | 3 | — |
</HARD-GATE>
```

Cannot exit until data_points >= 120 (80% of target).

## Available Tactics

- **leaderboard-harvesting** — Bulk data collection from structured platforms

## Available SOPs

- **score-extraction** — Extract tuples from individual papers
- **condition-cataloging** — Record conditions alongside scores

## Execution Guidance

1. For each method in the inventory, locate the original paper
2. Use score-extraction SOP on each paper to pull all reported results
3. Cross-reference against Papers With Code / benchmark leaderboards
4. Use condition-cataloging to record experimental setup for each score
5. Flag scores that lack essential condition information
6. Track provenance: which table/figure in which paper
7. Prefer results from official implementations over third-party

## Output Format

```json
{
  "data_points": [
    {
      "method": "string",
      "task": "string",
      "dataset": "string",
      "split": "test|val|dev",
      "metric": "string",
      "score": 0.0,
      "confidence_interval": [0.0, 0.0],
      "conditions": {
        "hardware": "string",
        "training_data_size": "string",
        "hyperparams_reported": true,
        "seeds_reported": true,
        "compute_budget": "string"
      },
      "provenance": {
        "paper_id": "string",
        "table_or_figure": "string",
        "is_primary_source": true
      }
    }
  ],
  "coverage_summary": {
    "methods_covered": 0,
    "datasets_covered": 0,
    "metrics_tracked": [],
    "missing_data_flags": []
  }
}
```

More from yogsoth-ai/de-anthropocentric-research-engine

Skill	Description
abductive-hypothesis-generation	Strategy: 面对异常的最佳解释推理
ablation-brainstorm	Remove components one by one, observe system changes to reveal hidden dependencies and generate ideas from structural gaps.
ablation-component-mapping	Map system architecture to ablatable units for ablation studies
ablation-design	Design ablation studies to isolate component contributions in ML systems
ablation-execution	Remove components one by one from a system, record the response/impact of each removal.
abp-vulnerability-classification	Classify assumptions on 2 axes — load-bearing (how much conclusion depends on it) × vulnerable (how likely to be false). Focuses attention on High-Load × High-Vulnerable quadrant.
abstraction-extraction	Extract abstract principles from concrete domain cases. Strips domain-specific details to reveal transferable mechanisms.
abstraction-ladder	Perform bisociation at multiple abstraction levels
abstraction-laddering	Move between concrete and abstract framings — 3 levels up (Why?) and 3 levels down (How?) to find the most productive research level.
abstraction-to-design	Abstract biological principle to design principle. Bridge from biology to engineering.