hugging-face-evaluation

Name: hugging-face-evaluation
Author: huggingface/community-evals

$npx mdskill add huggingface/community-evals/hugging-face-evaluation

Submit structured evaluation metrics to Hugging Face model repositories using the official `.eval_results/` format.

Adds benchmark scores to model pages, enabling leaderboard participation and community review.
Integrates with the Hugging Face CLI for managing Pull Requests and uploading data.
Requires adherence to specific YAML file naming conventions for different scoring variants.
Manages submission by creating necessary files and interacting with the Hub via CLI commands.

SKILL.md

.github/skills/hugging-face-evaluationView on GitHub ↗

---
name: hugging-face-evaluation
description: Add evaluation results to Hugging Face model repositories using the .eval_results/ format. Uses HF CLI for PR management and manual YAML creation.
---

# Overview

This skill adds structured evaluation results to HuggingFace model repositories using the [`.eval_results/` format](https://huggingface.co/docs/hub/eval-results).

**What This Enables:**
- Results appear on model pages with benchmark links
- Scores are aggregated into benchmark dataset leaderboards
- Community contributions via Pull Requests

# Important

Evaluation PRs can only be opened on the Hugging Face Hub. They cannot be opened on the GitHub repository.

# Version
3.0.0

# Workflow Overview

The actual workflow uses:
1. **HF CLI** (`hf upload`, `hf download`) for PR operations
2. **Manual YAML creation** in `/tmp/pr-reviews/`
3. **`check_prs.py`** script to check for existing PRs
4. **curl** to fetch model cards and leaderboard data

See `references/hf_cli_for_prs.md` for detailed CLI instructions.

---

# CRITICAL: Multiple Scores for One Benchmark

Models can have multiple scores for the same benchmark (with/without tools). **Each variant MUST be in a separate file.**

## File Naming Convention

| Condition | File Name | Notes Field |
|-----------|-----------|-------------|
| Default (no tools) | `hle.yaml` | None (omit notes) |
| With tools | `hle_with_tools.yaml` | `notes: "With tools"` |

## Notes Field Rules

1. **No tools = No notes field** - Default assumption is "without tools"
2. **With tools = Add notes** - Only add when tools ARE used
3. **Standardized format** - Always use `notes: "With tools"` (capital W)

**CORRECT:**
```yaml
# hle.yaml (no tools - DEFAULT)
- dataset:
    id: cais/hle
    task_id: hle
  value: 22.1
  source:
    url: https://huggingface.co/org/model
    name: Model Card
    user: username
```

```yaml
# hle_with_tools.yaml (with tools)
- dataset:
    id: cais/hle
    task_id: hle
  value: 44.9
  source:
    url: https://huggingface.co/org/model
    name: Model Card
    user: username
  notes: "With tools"
```

**INCORRECT:**
```yaml
notes: "Without tools"  # Don't add notes for default
notes: "w/ tools"       # Use standardized format
notes: "with tools"     # Capital W required
```

---

# Core Workflow

## Step 1: Check for Existing PRs

**ALWAYS check before creating new PRs:**

```bash
uv run scripts/check_prs.py --repo-id "org/model-name"
```

If PRs exist, update them instead of creating new ones.

## Step 2: Fetch Model Card and Extract Scores

```bash
# Get model README
curl -s "https://huggingface.co/org/model-name/raw/main/README.md" | grep -i -A10 "HLE\|GPQA\|MMLU"
```

Or use MCP tools:
```
mcp__hf-mcp-server__hub_repo_details
  repo_ids: ["org/model-name"]
  include_readme: true
```

## Step 3: Create YAML File

```bash
mkdir -p /tmp/pr-reviews/new-prs
cd /tmp/pr-reviews/new-prs

cat > hle.yaml << 'EOF'
- dataset:
    id: cais/hle
    task_id: hle
  value: 22.1
  date: '2026-02-03'
  source:
    url: https://huggingface.co/org/model-name
    name: Model Card
    user: burtenshaw
EOF
```

## Step 4: Create PR

```bash
hf upload org/model-name hle.yaml .eval_results/hle.yaml \
  --repo-type model --create-pr \
  --commit-message "Add HLE evaluation result"
```

## Step 5: Get PR Number

```bash
uv run scripts/check_prs.py --repo-id "org/model-name"
```

---

# Updating Existing PRs

```bash
# Download PR contents
hf download org/model-name --repo-type model \
  --revision refs/pr/<PR_NUMBER> \
  --include ".eval_results/*" \
  --local-dir /tmp/pr-reviews/model-pr<PR_NUMBER>

# Edit the YAML file, then upload
hf upload org/model-name /tmp/pr-reviews/updated.yaml .eval_results/hle.yaml \
  --repo-type model \
  --revision refs/pr/<PR_NUMBER> \
  --commit-message "Update evaluation result"
```

---

# Deleting Files from PRs

Use Python API:
```bash
uv run --with huggingface_hub python3 << 'EOF'
from huggingface_hub import HfApi
api = HfApi()
api.delete_file(
    path_in_repo=".eval_results/old_file.yaml",
    repo_id="org/model-name",
    repo_type="model",
    revision="refs/pr/<PR_NUMBER>",
    commit_message="Remove file"
)
EOF
```

---

# Fetching Leaderboard Data

```bash
# HLE leaderboard (requires auth for private datasets)
curl -s "https://huggingface.co/api/datasets/cais/hle/leaderboard" \
  -H "Authorization: Bearer $HF_TOKEN"

# MMLU-Pro leaderboard (public)
curl -s "https://huggingface.co/api/datasets/TIGER-Lab/MMLU-Pro/leaderboard"

# Model eval results
curl -s "https://huggingface.co/api/models/org/model?expand[]=evalResults"
```

---

# .eval_results/ Format

```yaml
# .eval_results/hle.yaml
- dataset:
    id: cais/hle              # Required: Hub Benchmark dataset ID
    task_id: hle              # Required: task id from dataset's eval.yaml
  value: 22.2                 # Required: metric value
  date: "2026-01-14"          # Optional: ISO-8601 date
  source:                     # Optional: attribution
    url: https://huggingface.co/org/model
    name: Model Card
    user: username
```

---

# Supported Benchmarks

| Benchmark | Hub Dataset ID | Task ID |
|-----------|---------------|---------|
| HLE | cais/hle | hle |
| GPQA | Idavidrein/gpqa | diamond |
| MMLU-Pro | TIGER-Lab/MMLU-Pro | mmlu_pro |

---

# Tool-Using Agent Models

Models like MiroThinker, Nemotron-Orchestrator are inherently tool-using agents. For these:

1. Use `hle_with_tools.yaml` as filename
2. Add `notes: "With tools"`
3. Look for terms: "search agent", "agentic", "orchestrator", "code-interpreter"

---

# Environment Setup

```bash
export HF_TOKEN="your-huggingface-token"
```

---

# Scripts Reference

```bash
# Check for existing PRs (ALWAYS do this first)
uv run scripts/check_prs.py --repo-id "org/model-name"
```

See `references/hf_cli_for_prs.md` for complete HF CLI workflow documentation.

---

# Best Practices

1. **Always check for existing PRs** before creating new ones
2. **Separate files for variants** - `hle.yaml` for default, `hle_with_tools.yaml` for tools
3. **Notes only for non-default** - Omit notes for standard evaluations
4. **Standardized format** - Use `"With tools"` exactly (capital W)
5. **Verify scores** - Compare YAML against model card before submitting