openai-evals

Name: openai-evals
Author: mkurman/zorai

$npx mdskill add mkurman/zorai/openai-evals

Evaluate LLM quality and detect regressions with community benchmarks.

Enables systematic testing of model performance across reasoning and truthfulness.
Integrates with OpenAI's eval registry for standardized benchmark suites.
Executes custom completion functions or model-graded scoring automatically.
Outputs structured accuracy metrics for CI/CD pipeline integration.

SKILL.md

.github/skills/openai-evalsView on GitHub ↗

---
name: openai-evals
description: LLM evaluation framework and registry (OpenAI Evals). Framework for evaluating LLMs and LLM-based systems with a registry of community-contributed eval templates. Supports model-graded evals, classification, simple completion matching, and custom completion functions. Use for systematic LLM quality testing, regression detection, and prompt engineering validation.
license: MIT license
tags: [model-graded-evals, regression-testing, prompt-validation, eval-registry, openai-evals]
metadata:
    skill-author: K-Dense Inc.
---|------|--------|
| `mmlu` | Match | Knowledge (57 subjects) |
| `hellaswag` | Match | Commonsense reasoning |
| `truthfulqa` | Model-graded | Truthfulness |
| `gsm8k` | Match | Math reasoning |
| `humaneval` | Custom | Code generation |
| `ifeval` | Model-graded | Instruction following |
| `bbq` | Model-graded | Bias detection |
| `factuality` | Model-graded | Factual accuracy |
| `translation` | Model-graded | Translation quality |

Browse full registry: `evals/registry/evals/`

### 10. Production Eval Pipeline Pattern

```python
# CI/CD eval pipeline
def run_eval_suite(model_name, eval_names):
    results = {}
    for eval_name in eval_names:
        cmd = f"oaieval {model_name} {eval_name} --max_samples 200"
        result = subprocess.run(cmd, shell=True, capture_output=True)
        results[eval_name] = parse_accuracy(result.stdout)
    return results

# Regression test
previous = {"mmlu": 0.86, "gsm8k": 0.92, "hellaswag": 0.85}
current = run_eval_suite("my-finetuned-model", ["mmlu", "gsm8k", "hellaswag"])

for name, score in current.items():
    if score < previous[name] - 0.02:  # 2% regression threshold
        alert(f"Regression in {name}: {previous[name]:.2f} → {score:.2f}")
```

## Key Patterns

1. **Start with templates** — most evals don't need custom Python code
2. **Use model-graded evals** for subjective quality (fluency, helpfulness, safety)
3. **Use match evals** for objective metrics (classification, multiple choice)
4. **`git lfs fetch --all`** is required before running community evals
5. **Custom completion functions** enable testing non-OpenAI models
6. **Record paths** enable debugging individual failures
7. **Private evals** can test proprietary data without exposing it

## References

- [OpenAI Evals Repo](https://github.com/openai/evals)
- [Build an Eval Guide](https://github.com/openai/evals/blob/main/docs/build-eval.md)
- [Run Evals Guide](https://github.com/openai/evals/blob/main/docs/run-evals.md)
- [Eval Templates Reference](https://github.com/openai/evals/blob/main/docs/eval-templates.md)
- [Custom Eval Example](https://github.com/openai/evals/blob/main/docs/custom-eval.md)
- [Completion Functions Guide](https://github.com/openai/evals/blob/main/docs/completion-fns.md)

More from mkurman/zorai

Skill	Description
account-management	>
agile-scrum	>
albumentations	Fast image augmentation library (Albumentations). 70+ transforms for classification, segmentation, object detection, keypoints, and pose estimation. Optimized OpenCV-based pipeline with unified API across all CV tasks. Supports images, masks, bounding boxes, and keypoints simultaneously. Note: classic Albumentations (MIT) is no longer maintained; successor AlbumentationsX uses AGPL-3.0. For torchvision-native augmentations, use torchvision.transforms.v2.
aml-compliance	Anti-Money Laundering (AML) and Know Your Customer (KYC) compliance workflow. Sanctions screening, PEP detection, transaction monitoring, suspicious activity reporting (SAR), and OFAC compliance.
anki-connect	This skill is for interacting with Anki through AnkiConnect, and should be used whenever a user asks to interact with Anki, including to read or modify decks, notes, cards, models, media, or sync operations.
approval-checkpoint-long-task	Canonical long-task pack for daemon-managed work with deliberate approval checkpoints, status summaries, rollback notes, and mobile-safe governance-aware updates.
auditing-goal-artifacts	Use when reviewing recent zorai goal run outputs, closure markers, ledgers, or evidence bundles to judge whether completion is credible or to identify remaining uncertainty.
autogen	AutoGen (Microsoft) — multi-agent conversation framework. Agent-to-agent chat, code generation & execution, tool use, group chat, and human-in-the-loop. Build collaborative AI systems with specialized agents.
backtrader	Python backtesting framework for trading strategies. Data feeds, brokers, analyzers, and live trading support. Strategy development with commission models, slippage, and signal-based execution.
beautiful-mermaid	Render Mermaid diagrams as SVG and PNG using the Beautiful Mermaid library. Use when the user asks to render a Mermaid diagram.