huggingface-community-evals

Name: huggingface-community-evals
Author: huggingface/skills

$npx mdskill add huggingface/skills/huggingface-community-evals

Execute local model evaluations for Hugging Face Hub models using specialized inference backends.

Perform local benchmarking for model selection and backend comparison.
Integrates with `inspect-ai`, `lighteval`, and backends like vLLM/Transformers.
Determines appropriate evaluation script based on required inference provider.
Outputs local evaluation results directly to the developer environment.

SKILL.md

.github/skills/huggingface-community-evalsView on GitHub ↗

---
name: huggingface-community-evals
description: Run evaluations for Hugging Face Hub models using inspect-ai and lighteval on local hardware. Use for backend selection, local GPU evals, and choosing between vLLM / Transformers / accelerate. Not for HF Jobs orchestration, model-card PRs, .eval_results publication, or community-evals automation.
---

# Overview

This skill is for **running evaluations against models on the Hugging Face Hub on local hardware**.

It covers:
- `inspect-ai` with local inference
- `lighteval` with local inference
- choosing between `vllm`, Hugging Face Transformers, and `accelerate`
- smoke tests, task selection, and backend fallback strategy

It does **not** cover:
- Hugging Face Jobs orchestration
- model-card or `model-index` edits
- README table extraction
- Artificial Analysis imports
- `.eval_results` generation or publishing
- PR creation or community-evals automation

If the user wants to **run the same eval remotely on Hugging Face Jobs**, hand off to the `hugging-face-jobs` skill and pass it one of the local scripts in this skill.

If the user wants to **publish results into the community evals workflow**, stop after generating the evaluation run and hand off that publishing step to `~/code/community-evals`.

> All paths below are relative to the directory containing this `SKILL.md`.

# When To Use Which Script

| Use case | Script |
|---|---|
| Local `inspect-ai` eval on a Hub model via inference providers | `scripts/inspect_eval_uv.py` |
| Local GPU eval with `inspect-ai` using `vllm` or Transformers | `scripts/inspect_vllm_uv.py` |
| Local GPU eval with `lighteval` using `vllm` or `accelerate` | `scripts/lighteval_vllm_uv.py` |
| Extra command patterns | `examples/USAGE_EXAMPLES.md` |

# Prerequisites

- Prefer `uv run` for local execution.
- Set `HF_TOKEN` for gated/private models.
- For local GPU runs, verify GPU access before starting:

```bash
uv --version
printenv HF_TOKEN >/dev/null
nvidia-smi
```

If `nvidia-smi` is unavailable, either:
- use `scripts/inspect_eval_uv.py` for lighter provider-backed evaluation, or
- hand off to the `hugging-face-jobs` skill if the user wants remote compute.

# Core Workflow

1. Choose the evaluation framework.
   - Use `inspect-ai` when you want explicit task control and inspect-native flows.
   - Use `lighteval` when the benchmark is naturally expressed as a lighteval task string, especially leaderboard-style tasks.
2. Choose the inference backend.
   - Prefer `vllm` for throughput on supported architectures.
   - Use Hugging Face Transformers (`--backend hf`) or `accelerate` as compatibility fallbacks.
3. Start with a smoke test.
   - `inspect-ai`: add `--limit 10` or similar.
   - `lighteval`: add `--max-samples 10`.
4. Scale up only after the smoke test passes.
5. If the user wants remote execution, hand off to `hugging-face-jobs` with the same script + args.

# Quick Start

## Option A: inspect-ai with local inference providers path

Best when the model is already supported by Hugging Face Inference Providers and you want the lowest local setup overhead.

```bash
uv run scripts/inspect_eval_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task mmlu \
  --limit 20
```

Use this path when:
- you want a quick local smoke test
- you do not need direct GPU control
- the task already exists in `inspect-evals`

## Option B: inspect-ai on Local GPU

Best when you need to load the Hub model directly, use `vllm`, or fall back to Transformers for unsupported architectures.

Local GPU:

```bash
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task gsm8k \
  --limit 20
```

Transformers fallback:

```bash
uv run scripts/inspect_vllm_uv.py \
  --model microsoft/phi-2 \
  --task mmlu \
  --backend hf \
  --trust-remote-code \
  --limit 20
```

## Option C: lighteval on Local GPU

Best when the task is naturally expressed as a `lighteval` task string, especially Open LLM Leaderboard style benchmarks.

Local GPU:

```bash
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \
  --max-samples 20 \
  --use-chat-template
```

`accelerate` fallback:

```bash
uv run scripts/lighteval_vllm_uv.py \
  --model microsoft/phi-2 \
  --tasks "leaderboard|mmlu|5" \
  --backend accelerate \
  --trust-remote-code \
  --max-samples 20
```

# Remote Execution Boundary

This skill intentionally stops at **local execution and backend selection**.

If the user wants to:
- run these scripts on Hugging Face Jobs
- pick remote hardware
- pass secrets to remote jobs
- schedule recurring runs
- inspect / cancel / monitor jobs

then switch to the **`hugging-face-jobs`** skill and pass it one of these scripts plus the chosen arguments.

# Task Selection

`inspect-ai` examples:
- `mmlu`
- `gsm8k`
- `hellaswag`
- `arc_challenge`
- `truthfulqa`
- `winogrande`
- `humaneval`

`lighteval` task strings use `suite|task|num_fewshot`:
- `leaderboard|mmlu|5`
- `leaderboard|gsm8k|5`
- `leaderboard|arc_challenge|25`
- `lighteval|hellaswag|0`

Multiple `lighteval` tasks can be comma-separated in `--tasks`.

# Backend Selection

- Prefer `inspect_vllm_uv.py --backend vllm` for fast GPU inference on supported architectures.
- Use `inspect_vllm_uv.py --backend hf` when `vllm` does not support the model.
- Prefer `lighteval_vllm_uv.py --backend vllm` for throughput on supported models.
- Use `lighteval_vllm_uv.py --backend accelerate` as the compatibility fallback.
- Use `inspect_eval_uv.py` when Inference Providers already cover the model and you do not need direct GPU control.

# Hardware Guidance

| Model size | Suggested local hardware |
|---|---|
| `< 3B` | consumer GPU / Apple Silicon / small dev GPU |
| `3B - 13B` | stronger local GPU |
| `13B+` | high-memory local GPU or hand off to `hugging-face-jobs` |

For smoke tests, prefer cheaper local runs plus `--limit` or `--max-samples`.

# Troubleshooting

- CUDA or vLLM OOM:
  - reduce `--batch-size`
  - reduce `--gpu-memory-utilization`
  - switch to a smaller model for the smoke test
  - if necessary, hand off to `hugging-face-jobs`
- Model unsupported by `vllm`:
  - switch to `--backend hf` for `inspect-ai`
  - switch to `--backend accelerate` for `lighteval`
- Gated/private repo access fails:
  - verify `HF_TOKEN`
- Custom model code required:
  - add `--trust-remote-code`

# Examples

See:
- `examples/USAGE_EXAMPLES.md` for local command patterns
- `scripts/inspect_eval_uv.py`
- `scripts/inspect_vllm_uv.py`
- `scripts/lighteval_vllm_uv.py`