evaluation-protocol-comparison

Name: evaluation-protocol-comparison
Author: yogsoth-ai/de-anthropocentric-research-engine

$npx mdskill add yogsoth-ai/de-anthropocentric-research-engine/evaluation-protocol-comparison

Compares benchmark implementations across research papers to highlight protocol differences

Identifies inconsistencies in benchmark evaluation protocols between studies
Leverages paper search tools like dare-ss and dare-scholar for data collection
Extracts and aligns protocol elements like data splits, prompting, and metrics
Produces structured comparison to inform reproducibility and fairness analysis

SKILL.md

.github/skills/evaluation-protocol-comparisonView on GitHub ↗

---
name: evaluation-protocol-comparison
description: Compare implementation differences of same benchmark across papers
execution: tactic
used-by: benchmark-archaeology
---

# Evaluation Protocol Comparison Tactic

Compare how different papers implement the same benchmark to expose hidden protocol variance that undermines cross-paper score comparability.

## Stages

### Stage 1: Paper Collection (Same Benchmark)

Collect 10-15 papers that report results on the target benchmark:
- Prioritize diversity: different labs, years, model families
- Include the original benchmark paper as reference protocol
- Include papers from different venues (top conferences, workshops, preprints)
- Search via dare-ss (ss_relevance_search) and dare-scholar (paper_searching)

**Search queries**: "[benchmark name] evaluation", "[benchmark name] results", "[benchmark name] state-of-the-art"

### Stage 2: Protocol Element Extraction

For each paper, run protocol-element-extraction SOP to extract:

| Element Category | Specific Parameters |
|-----------------|-------------------|
| **Data** | Split version, subset selection, preprocessing, filtering |
| **Prompting** | Template format, few-shot examples (count, selection), instruction wording |
| **Generation** | Decoding strategy, temperature, top-p/top-k, max tokens, stop criteria |
| **Evaluation** | Metric implementation, postprocessing, normalization, scoring script version |
| **Infrastructure** | Framework, precision (fp16/bf16/fp32), batch size, hardware |

### Stage 3: Difference Matrix Construction

Build a comparison matrix:
- Rows = protocol elements
- Columns = papers
- Cells = specific value used
- Highlight deviations from original protocol

Compute per-element variance:
- **None**: All papers use identical value
- **Low**: Minor variations (e.g., different random seeds)
- **Medium**: Substantive differences (e.g., different few-shot examples)
- **High**: Fundamental disagreements (e.g., different splits, different metrics)
- **Extreme**: Papers appear to evaluate different things under same name

### Stage 4: Impact Assessment

For each high-variance element:
- Search for ablation studies showing impact of that element
- Estimate score range attributable to protocol choice vs model quality
- Identify which protocol choices systematically favor certain model families
- Flag "protocol p-hacking" — suspicious correlation between protocol choice and reported improvement

## Output

```yaml
protocol_comparison:
  benchmark: string
  papers_compared: int
  reference_protocol: string  # original benchmark paper
  difference_matrix:
    - element: string
      category: data|prompting|generation|evaluation|infrastructure
      variance_level: none|low|medium|high|extreme
      values: list[{paper, value}]
      impact_estimate: string
  highest_variance_elements:
    - element: string
      score_impact: string
      favors: string  # which model family benefits
  protocol_p_hacking_flags:
    - paper: string
      suspicious_choice: string
      benefit: string
  cross_paper_comparability: high|moderate|low|unreliable
  standardization_recommendations:
    - element: string
      recommended_value: string
      rationale: string
```

## Yield Report

| Metric | Minimum |
|--------|---------|
| Papers compared | 8 |
| Protocol elements extracted per paper | 10 |
| High-variance elements identified | 2 |
| Impact estimates produced | 3 |

More from yogsoth-ai/de-anthropocentric-research-engine

Skill	Description
abductive-hypothesis-generation	Strategy: 面对异常的最佳解释推理
ablation-brainstorm	Remove components one by one, observe system changes to reveal hidden dependencies and generate ideas from structural gaps.
ablation-component-mapping	Map system architecture to ablatable units for ablation studies
ablation-design	Design ablation studies to isolate component contributions in ML systems
ablation-execution	Remove components one by one from a system, record the response/impact of each removal.
abp-vulnerability-classification	Classify assumptions on 2 axes — load-bearing (how much conclusion depends on it) × vulnerable (how likely to be false). Focuses attention on High-Load × High-Vulnerable quadrant.
abstraction-extraction	Extract abstract principles from concrete domain cases. Strips domain-specific details to reveal transferable mechanisms.
abstraction-ladder	Perform bisociation at multiple abstraction levels
abstraction-laddering	Move between concrete and abstract framings — 3 levels up (Why?) and 3 levels down (How?) to find the most productive research level.
abstraction-to-design	Abstract biological principle to design principle. Bridge from biology to engineering.