hugging-face-evaluation

$npx mdskill add huggingface/community-evals/hugging-face-evaluation

Submit structured evaluation metrics to Hugging Face model repositories using the official `.eval_results/` format.

  • Adds benchmark scores to model pages, enabling leaderboard participation and community review.
  • Integrates with the Hugging Face CLI for managing Pull Requests and uploading data.
  • Requires adherence to specific YAML file naming conventions for different scoring variants.
  • Manages submission by creating necessary files and interacting with the Hub via CLI commands.

SKILL.md

.github/skills/hugging-face-evaluationView on GitHub ↗
---
name: hugging-face-evaluation
description: Add evaluation results to Hugging Face model repositories using the .eval_results/ format. Uses HF CLI for PR management and manual YAML creation.
---

# Overview

This skill adds structured evaluation results to HuggingFace model repositories using the [`.eval_results/` format](https://huggingface.co/docs/hub/eval-results).

**What This Enables:**
- Results appear on model pages with benchmark links
- Scores are aggregated into benchmark dataset leaderboards
- Community contributions via Pull Requests

# Important

Evaluation PRs can only be opened on the Hugging Face Hub. They cannot be opened on the GitHub repository.

# Version
3.0.0

# Workflow Overview

The actual workflow uses:
1. **HF CLI** (`hf upload`, `hf download`) for PR operations
2. **Manual YAML creation** in `/tmp/pr-reviews/`
3. **`check_prs.py`** script to check for existing PRs
4. **curl** to fetch model cards and leaderboard data

See `references/hf_cli_for_prs.md` for detailed CLI instructions.

---

# CRITICAL: Multiple Scores for One Benchmark

Models can have multiple scores for the same benchmark (with/without tools). **Each variant MUST be in a separate file.**

## File Naming Convention

| Condition | File Name | Notes Field |
|-----------|-----------|-------------|
| Default (no tools) | `hle.yaml` | None (omit notes) |
| With tools | `hle_with_tools.yaml` | `notes: "With tools"` |

## Notes Field Rules

1. **No tools = No notes field** - Default assumption is "without tools"
2. **With tools = Add notes** - Only add when tools ARE used
3. **Standardized format** - Always use `notes: "With tools"` (capital W)

**CORRECT:**
```yaml
# hle.yaml (no tools - DEFAULT)
- dataset:
    id: cais/hle
    task_id: hle
  value: 22.1
  source:
    url: https://huggingface.co/org/model
    name: Model Card
    user: username
```

```yaml
# hle_with_tools.yaml (with tools)
- dataset:
    id: cais/hle
    task_id: hle
  value: 44.9
  source:
    url: https://huggingface.co/org/model
    name: Model Card
    user: username
  notes: "With tools"
```

**INCORRECT:**
```yaml
notes: "Without tools"  # Don't add notes for default
notes: "w/ tools"       # Use standardized format
notes: "with tools"     # Capital W required
```

---

# Core Workflow

## Step 1: Check for Existing PRs

**ALWAYS check before creating new PRs:**

```bash
uv run scripts/check_prs.py --repo-id "org/model-name"
```

If PRs exist, update them instead of creating new ones.

## Step 2: Fetch Model Card and Extract Scores

```bash
# Get model README
curl -s "https://huggingface.co/org/model-name/raw/main/README.md" | grep -i -A10 "HLE\|GPQA\|MMLU"
```

Or use MCP tools:
```
mcp__hf-mcp-server__hub_repo_details
  repo_ids: ["org/model-name"]
  include_readme: true
```

## Step 3: Create YAML File

```bash
mkdir -p /tmp/pr-reviews/new-prs
cd /tmp/pr-reviews/new-prs

cat > hle.yaml << 'EOF'
- dataset:
    id: cais/hle
    task_id: hle
  value: 22.1
  date: '2026-02-03'
  source:
    url: https://huggingface.co/org/model-name
    name: Model Card
    user: burtenshaw
EOF
```

## Step 4: Create PR

```bash
hf upload org/model-name hle.yaml .eval_results/hle.yaml \
  --repo-type model --create-pr \
  --commit-message "Add HLE evaluation result"
```

## Step 5: Get PR Number

```bash
uv run scripts/check_prs.py --repo-id "org/model-name"
```

---

# Updating Existing PRs

```bash
# Download PR contents
hf download org/model-name --repo-type model \
  --revision refs/pr/<PR_NUMBER> \
  --include ".eval_results/*" \
  --local-dir /tmp/pr-reviews/model-pr<PR_NUMBER>

# Edit the YAML file, then upload
hf upload org/model-name /tmp/pr-reviews/updated.yaml .eval_results/hle.yaml \
  --repo-type model \
  --revision refs/pr/<PR_NUMBER> \
  --commit-message "Update evaluation result"
```

---

# Deleting Files from PRs

Use Python API:
```bash
uv run --with huggingface_hub python3 << 'EOF'
from huggingface_hub import HfApi
api = HfApi()
api.delete_file(
    path_in_repo=".eval_results/old_file.yaml",
    repo_id="org/model-name",
    repo_type="model",
    revision="refs/pr/<PR_NUMBER>",
    commit_message="Remove file"
)
EOF
```

---

# Fetching Leaderboard Data

```bash
# HLE leaderboard (requires auth for private datasets)
curl -s "https://huggingface.co/api/datasets/cais/hle/leaderboard" \
  -H "Authorization: Bearer $HF_TOKEN"

# MMLU-Pro leaderboard (public)
curl -s "https://huggingface.co/api/datasets/TIGER-Lab/MMLU-Pro/leaderboard"

# Model eval results
curl -s "https://huggingface.co/api/models/org/model?expand[]=evalResults"
```

---

# .eval_results/ Format

```yaml
# .eval_results/hle.yaml
- dataset:
    id: cais/hle              # Required: Hub Benchmark dataset ID
    task_id: hle              # Required: task id from dataset's eval.yaml
  value: 22.2                 # Required: metric value
  date: "2026-01-14"          # Optional: ISO-8601 date
  source:                     # Optional: attribution
    url: https://huggingface.co/org/model
    name: Model Card
    user: username
```

---

# Supported Benchmarks

| Benchmark | Hub Dataset ID | Task ID |
|-----------|---------------|---------|
| HLE | cais/hle | hle |
| GPQA | Idavidrein/gpqa | diamond |
| MMLU-Pro | TIGER-Lab/MMLU-Pro | mmlu_pro |

---

# Tool-Using Agent Models

Models like MiroThinker, Nemotron-Orchestrator are inherently tool-using agents. For these:

1. Use `hle_with_tools.yaml` as filename
2. Add `notes: "With tools"`
3. Look for terms: "search agent", "agentic", "orchestrator", "code-interpreter"

---

# Environment Setup

```bash
export HF_TOKEN="your-huggingface-token"
```

---

# Scripts Reference

```bash
# Check for existing PRs (ALWAYS do this first)
uv run scripts/check_prs.py --repo-id "org/model-name"
```

See `references/hf_cli_for_prs.md` for complete HF CLI workflow documentation.

---

# Best Practices

1. **Always check for existing PRs** before creating new ones
2. **Separate files for variants** - `hle.yaml` for default, `hle_with_tools.yaml` for tools
3. **Notes only for non-default** - Omit notes for standard evaluations
4. **Standardized format** - Use `"With tools"` exactly (capital W)
5. **Verify scores** - Compare YAML against model card before submitting