lm-evaluation-harness

Name: lm-evaluation-harness
Author: mkurman/zorai

$npx mdskill add mkurman/zorai/lm-evaluation-harness

Evaluate language models across 200+ standardized benchmarks.

Assesses reasoning, coding, and truthfulness via MMLU, GSM8K, and HumanEval.
Integrates with HuggingFace transformers, vLLM, SGLang, and OpenAI API.
Executes unified evaluation suites using command-line or Python API.
Outputs aggregated scores and per-task metrics for model comparison.

SKILL.md

.github/skills/lm-evaluation-harnessView on GitHub ↗

---
name: lm-evaluation-harness
description: LLM evaluation framework (EleutherAI lm-evaluation-harness). Unified benchmark evaluation for language models with 200+ tasks, support for HuggingFace transformers, vLLM, SGLang, OpenAI API, GGUF, and custom models. Used by HuggingFace Open LLM Leaderboard. Covers MMLU, HellaSwag, ARC, GSM8K, HumanEval, BBH, TruthfulQA, and more.
license: MIT license
tags: [llm-evaluation, benchmark-suite, leaderboard-eval, few-shot-eval, lm-evaluation-harness]
metadata:
    skill-author: K-Dense Inc.
---|
| `leaderboard` | MMLU, ARC, HellaSwag, TruthfulQA, Winogrande, GSM8K | Open LLM Leaderboard suite |
| `mmlu` | 57 subjects (STEM, humanities, social sciences) | World knowledge + reasoning |
| `gsm8k` | Grade-school math word problems | Mathematical reasoning |
| `hellaswag` | Commonsense NLI | Commonsense reasoning |
| `arc_challenge` | Science exam questions | Scientific reasoning |
| `truthfulqa` | Adversarial questions | Truthfulness/hallucination |
| `humaneval` | Python code generation | Coding ability |
| `bigbench` | 200+ BIG-Bench tasks | Broad capability assessment |
| `ifeval` | Instruction-following | Instruction adherence |

**Run leaderboard suite:**
```bash
lm-eval run \
    --model hf \
    --model_args pretrained=your-model \
    --tasks leaderboard \
    --device cuda:0 \
    --batch_size auto
```

### 4. Python API

```python
from lm_eval import simple_evaluate

results = simple_evaluate(
    model="hf",
    model_args={"pretrained": "meta-llama/Llama-3.2-1B"},
    tasks=["mmlu", "hellaswag", "gsm8k"],
    device="cuda:0",
    batch_size="auto",
    limit=100,  # Optional: limit samples per task
)

# Access results
for task, metrics in results["results"].items():
    print(f"{task}: {metrics}")

# Formatted table
print(results["configs"])
print(results["samples"])  # Per-sample outputs
```

### 5. Evaluating LoRA / PEFT Adapters

```bash
lm-eval run \
    --model hf \
    --model_args pretrained=meta-llama/Llama-3.2-1B,peft=/path/to/lora_adapter \
    --tasks mmlu \
    --device cuda:0
```

In Python:
```python
from lm_eval import simple_evaluate

results = simple_evaluate(
    model="hf",
    model_args={
        "pretrained": "meta-llama/Llama-3.2-1B",
        "peft": "/path/to/lora_adapter",
    },
    tasks=["mmlu"],
)
```

### 6. Multi-GPU Evaluation

**Data-parallel (model fits on single GPU):**
```bash
accelerate launch -m lm_eval \
    --model hf \
    --model_args pretrained=model-name \
    --tasks lambada_openai,arc_easy \
    --batch_size 16
```

**Model-parallel (model too large for one GPU):**
```bash
lm-eval run \
    --model hf \
    --model_args pretrained=model-name,parallelize=True \
    --tasks mmlu \
    --batch_size 8
```

**Both (data + model parallel):**
```bash
accelerate launch --multi_gpu --num_processes 4 \
    -m lm_eval \
    --model hf \
    --model_args pretrained=model-name,parallelize=True \
    --tasks mmlu \
    --batch_size 8
```

### 7. API Model Evaluation

**OpenAI-compatible API:**
```bash
export OPENAI_API_KEY=your-key

lm-eval run \
    --model openai-completions \
    --model_args model=gpt-4o,base_url=https://api.openai.com/v1/completions \
    --tasks mmlu \
    --batch_size 32
```

**Local server (vLLM served):**
```bash
lm-eval run \
    --model local-completions \
    --model_args model=local-model,base_url=http://localhost:8000/v1/completions \
    --tasks mmlu
```

### 8. Custom Tasks (YAML Config)

Create a custom task at `lm_eval/tasks/my_task/my_task.yaml`:
```yaml
task: my_custom_task
dataset_path: my-dataset
dataset_name: default
output_type: multiple_choice
training_split: train
validation_split: validation
doc_to_text: "Question: {{question}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_target: "{{answer}}"
doc_to_choice: "{{choices}}"
metric_list:
  - metric: acc
```

Run it:
```bash
lm-eval run --model hf --model_args pretrained=model-name --tasks my_custom_task
```

### 9. Few-Shot Configuration

```bash
lm-eval run \
    --model hf \
    --model_args pretrained=model-name \
    --tasks mmlu \
    --num_fewshot 5 \
    --fewshot_random_seed 42
```

### 10. Logging and Output Formats

```bash
# JSON output
lm-eval run --model hf --model_args pretrained=model-name \
    --tasks mmlu --output_path results/

# W&B logging
lm-eval run --model hf --model_args pretrained=model-name \
    --tasks mmlu --wandb_args project=eval-runs
```

## Key Evaluation Tips

1. **Use `--batch_size auto`** — automatic batch size detection maximizes throughput
2. **Always set a `--seed` for reproducibility** across evaluation runs
3. **Prefer vLLM backend for models >7B** — 5-10x faster than HF
4. **Use `--limit 100` during development** to test task setup quickly
5. **mmlu uses 5-shot by default**, while most tasks are 0-shot
6. **Install model backends separately** — base package is lightweight by design
7. **GGUF eval requires explicit tokenizer path** — skip this and loading may hang

## References

- [Full CLI Reference](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md)
- [Configuration Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/config_files.md)
- [Python API Documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/python-api.md)
- [Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks)
- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)

More from mkurman/zorai

Skill	Description
account-management	>
agile-scrum	>
albumentations	Fast image augmentation library (Albumentations). 70+ transforms for classification, segmentation, object detection, keypoints, and pose estimation. Optimized OpenCV-based pipeline with unified API across all CV tasks. Supports images, masks, bounding boxes, and keypoints simultaneously. Note: classic Albumentations (MIT) is no longer maintained; successor AlbumentationsX uses AGPL-3.0. For torchvision-native augmentations, use torchvision.transforms.v2.
aml-compliance	Anti-Money Laundering (AML) and Know Your Customer (KYC) compliance workflow. Sanctions screening, PEP detection, transaction monitoring, suspicious activity reporting (SAR), and OFAC compliance.
anki-connect	This skill is for interacting with Anki through AnkiConnect, and should be used whenever a user asks to interact with Anki, including to read or modify decks, notes, cards, models, media, or sync operations.
approval-checkpoint-long-task	Canonical long-task pack for daemon-managed work with deliberate approval checkpoints, status summaries, rollback notes, and mobile-safe governance-aware updates.
auditing-goal-artifacts	Use when reviewing recent zorai goal run outputs, closure markers, ledgers, or evidence bundles to judge whether completion is credible or to identify remaining uncertainty.
autogen	AutoGen (Microsoft) — multi-agent conversation framework. Agent-to-agent chat, code generation & execution, tool use, group chat, and human-in-the-loop. Build collaborative AI systems with specialized agents.
backtrader	Python backtesting framework for trading strategies. Data feeds, brokers, analyzers, and live trading support. Strategy development with commission models, slippage, and signal-based execution.
beautiful-mermaid	Render Mermaid diagrams as SVG and PNG using the Beautiful Mermaid library. Use when the user asks to render a Mermaid diagram.