leaderboard-harvesting

$npx mdskill add yogsoth-ai/de-anthropocentric-research-engine/leaderboard-harvesting

Collects performance data from leaderboards and research papers

  • Solves the problem of finding and organizing benchmark results across platforms
  • Uses Papers With Code, GitHub, and academic papers as primary data sources
  • Validates scores through cross-checking across multiple sources
  • Produces a deduplicated dataset with provenance tracking for each score

SKILL.md

.github/skills/leaderboard-harvestingView on GitHub ↗
---
name: leaderboard-harvesting
description: Systematically collect performance data from platforms and papers
execution: tactic
used-by: baseline-establishment
---

# Leaderboard Harvesting


## Purpose

Harvest structured performance data from leaderboard platforms (Papers With Code, benchmark-specific sites), survey papers, and official benchmark repositories. Produces deduplicated, provenance-tracked score collections.

## Stages

### Stage 1: Platform Scan

Identify and scrape all relevant leaderboard sources:
- Papers With Code task pages
- Benchmark-specific leaderboards (e.g., GLUE, ImageNet, WMT)
- GitHub benchmark repositories
- Survey papers with comprehensive comparison tables

**Yield**: List of leaderboard URLs + initial method counts per source.

### Stage 2: Paper Extraction

For methods not covered by leaderboards, extract scores directly from papers:
- Original method papers (primary source)
- Ablation studies and follow-up papers
- Reproduction studies and benchmarking papers

**Yield**: Raw score tuples with paper provenance.

### Stage 3: Cross-Validation

Compare scores across sources for the same method-dataset-metric triple:
- Flag discrepancies > 1 standard deviation
- Prefer primary sources when conflicts exist
- Note which scores come from official vs. unofficial implementations

**Yield**: Validated score set with confidence annotations.

### Stage 4: Dedup and Merge

Consolidate all sources into a single canonical dataset:
- Resolve method name aliases
- Merge duplicate entries with provenance tracking
- Assign confidence levels based on source agreement

**Yield**: Unified performance dataset ready for analysis.

## Minimum Yield

| Metric | Floor |
|--------|-------|
| Leaderboard sources checked | 3 |
| Methods with scores | 15 |
| Cross-validated score pairs | 10 |
| Deduplication conflicts resolved | 5 |

## SOPs Used

- method-discovery (for finding methods on leaderboards)
- score-extraction (for paper-based extraction)
- discrepancy-identification (for cross-validation)

More from yogsoth-ai/de-anthropocentric-research-engine