leaderboard-harvesting

Name: leaderboard-harvesting
Author: yogsoth-ai/de-anthropocentric-research-engine

$npx mdskill add yogsoth-ai/de-anthropocentric-research-engine/leaderboard-harvesting

Collects performance data from leaderboards and research papers

Solves the problem of finding and organizing benchmark results across platforms
Uses Papers With Code, GitHub, and academic papers as primary data sources
Validates scores through cross-checking across multiple sources
Produces a deduplicated dataset with provenance tracking for each score

SKILL.md

.github/skills/leaderboard-harvestingView on GitHub ↗

---
name: leaderboard-harvesting
description: Systematically collect performance data from platforms and papers
execution: tactic
used-by: baseline-establishment
---

# Leaderboard Harvesting


## Purpose

Harvest structured performance data from leaderboard platforms (Papers With Code, benchmark-specific sites), survey papers, and official benchmark repositories. Produces deduplicated, provenance-tracked score collections.

## Stages

### Stage 1: Platform Scan

Identify and scrape all relevant leaderboard sources:
- Papers With Code task pages
- Benchmark-specific leaderboards (e.g., GLUE, ImageNet, WMT)
- GitHub benchmark repositories
- Survey papers with comprehensive comparison tables

**Yield**: List of leaderboard URLs + initial method counts per source.

### Stage 2: Paper Extraction

For methods not covered by leaderboards, extract scores directly from papers:
- Original method papers (primary source)
- Ablation studies and follow-up papers
- Reproduction studies and benchmarking papers

**Yield**: Raw score tuples with paper provenance.

### Stage 3: Cross-Validation

Compare scores across sources for the same method-dataset-metric triple:
- Flag discrepancies > 1 standard deviation
- Prefer primary sources when conflicts exist
- Note which scores come from official vs. unofficial implementations

**Yield**: Validated score set with confidence annotations.

### Stage 4: Dedup and Merge

Consolidate all sources into a single canonical dataset:
- Resolve method name aliases
- Merge duplicate entries with provenance tracking
- Assign confidence levels based on source agreement

**Yield**: Unified performance dataset ready for analysis.

## Minimum Yield

| Metric | Floor |
|--------|-------|
| Leaderboard sources checked | 3 |
| Methods with scores | 15 |
| Cross-validated score pairs | 10 |
| Deduplication conflicts resolved | 5 |

## SOPs Used

- method-discovery (for finding methods on leaderboards)
- score-extraction (for paper-based extraction)
- discrepancy-identification (for cross-validation)

More from yogsoth-ai/de-anthropocentric-research-engine

Skill	Description
abductive-hypothesis-generation	Strategy: 面对异常的最佳解释推理
ablation-brainstorm	Remove components one by one, observe system changes to reveal hidden dependencies and generate ideas from structural gaps.
ablation-component-mapping	Map system architecture to ablatable units for ablation studies
ablation-design	Design ablation studies to isolate component contributions in ML systems
ablation-execution	Remove components one by one from a system, record the response/impact of each removal.
abp-vulnerability-classification	Classify assumptions on 2 axes — load-bearing (how much conclusion depends on it) × vulnerable (how likely to be false). Focuses attention on High-Load × High-Vulnerable quadrant.
abstraction-extraction	Extract abstract principles from concrete domain cases. Strips domain-specific details to reveal transferable mechanisms.
abstraction-ladder	Perform bisociation at multiple abstraction levels
abstraction-laddering	Move between concrete and abstract framings — 3 levels up (Why?) and 3 levels down (How?) to find the most productive research level.
abstraction-to-design	Abstract biological principle to design principle. Bridge from biology to engineering.