bare-eval
$
npx mdskill add yonatangross/orchestkit/bare-evalRun isolated eval and grading calls using Claude's bare mode without plugin interference.
- Helps with grading skill outputs, benchmarking prompts, and testing triggers in isolation.
- Depends on Claude CLI version 2.1.81 or higher and requires an ANTHROPIC_API_KEY.
- Constructs claude -p --bare invocations to bypass hooks and plugins for clean execution.
- Delivers results via text output from single-turn calls for fast evaluation pipelines.
SKILL.md
.github/skills/bare-evalView on GitHub ↗
---
name: bare-eval
description: "Run isolated eval and grading calls using CC 2.1.81 --bare mode. Constructs claude -p --bare invocations for skill evaluation, trigger testing, and LLM grading without plugin/hook interference. Use when running eval pipelines, grading skill outputs, benchmarking prompt quality, or testing trigger accuracy in isolation."
tags: [eval, bare, grading, pipeline, testing, ci]
version: 1.0.0
author: OrchestKit
user-invocable: false
complexity: medium
context: inherit
persuasion-type: discipline
effort: low
---
# Bare Eval — Isolated Evaluation Calls
Run `claude -p --bare` for fast, clean eval/grading without plugin overhead.
**CC 2.1.81 required.** The `--bare` flag skips hooks, LSP, plugin sync, and skill directory walks.
## When to Use
- Grading skill outputs against assertions
- Trigger classification (which skill matches a prompt)
- Description optimization iterations
- Any scripted `-p` call that doesn't need plugins
## When NOT to Use
- Testing skill routing (needs `--plugin-dir`)
- Testing agent orchestration (needs full plugin context)
- Interactive sessions
## Prerequisites
```bash
# --bare requires ANTHROPIC_API_KEY (OAuth/keychain disabled)
export ANTHROPIC_API_KEY="sk-ant-..."
# Verify CC version
claude --version # Must be >= 2.1.81
```
## Quick Reference
| Call Type | Command Pattern |
|-----------|----------------|
| Grading | `claude -p "$prompt" --bare --max-turns 1 --output-format text` |
| Trigger | `claude -p "$prompt" --bare --json-schema "$schema" --output-format json` |
| Optimize | `echo "$prompt" \| claude -p --bare --max-turns 1 --output-format text` |
| Force-skill | `claude -p "$prompt" --bare --print --append-system-prompt "$content"` |
## Invocation Patterns
Load detailed patterns and examples:
```
Read("${CLAUDE_SKILL_DIR}/references/invocation-patterns.md")
```
## Grading Schemas
JSON schemas for structured eval output:
```
Read("${CLAUDE_SKILL_DIR}/references/grading-schemas.md")
```
## Pipeline Integration
OrchestKit's eval scripts (`npm run eval:skill`) auto-detect bare mode:
```bash
# eval-common.sh detects ANTHROPIC_API_KEY → sets BARE_MODE=true
# Scripts add --bare to all non-plugin calls automatically
```
**Bare calls:** Trigger classification, force-skill, baseline, all grading.
**Never bare:** `run_with_skill` (needs plugin context for routing tests).
## Performance
| Scenario | Without --bare | With --bare | Savings |
|----------|---------------|-------------|---------|
| Single grading call | ~3-5s startup | ~0.5-1s | 2-4x |
| Trigger (per prompt) | ~3-5s | ~0.5-1s | 2-4x |
| Full eval (50 calls) | ~150-250s overhead | ~25-50s | 3-5x |
## Rules
```
Read("${CLAUDE_SKILL_DIR}/rules/_sections.md")
```
## Troubleshooting
```
Read("${CLAUDE_SKILL_DIR}/references/troubleshooting.md")
```
## Related
- `eval:skill` npm script — unified skill evaluation runner
- `eval:trigger` — trigger accuracy testing
- `eval:quality` — A/B quality comparison
- `optimize-description.sh` — iterative description improvement
- Version compatibility: `doctor/references/version-compatibility.md`
More from yonatangross/orchestkit
- agent-orchestrationAgent orchestration patterns for agentic loops, multi-agent coordination, alternative frameworks, and multi-scenario workflows. Use when building autonomous agent loops, coordinating multiple agents, evaluating CrewAI/AutoGen/Swarm, or orchestrating complex multi-step scenarios.
- ai-ui-generationAI-assisted UI generation patterns for json-render, v0, Bolt, and Cursor workflows. Covers prompt engineering for component generation, review checklists for AI-generated code, design token injection, refactoring for design system conformance, and CI gates for quality assurance. Use when generating UI components with AI tools, rendering multi-surface MCP visual output, reviewing AI-generated code, or integrating AI output into design systems.
- analyticsQuery cross-project usage analytics. Use when reviewing agent, skill, hook, or team performance across OrchestKit projects. Also replay sessions, estimate costs, and view model delegation trends.
- animation-motion-designAnimation and motion design patterns using Motion library (formerly Framer Motion) and View Transitions API. Use when implementing component animations, page transitions, micro-interactions, gesture-driven UIs, or ensuring motion accessibility with prefers-reduced-motion.
- architecture-patternsArchitecture validation and patterns for clean architecture, backend structure enforcement, project structure validation, test standards, and context-aware sizing. Use when designing system boundaries, enforcing layered architecture, validating project structure, defining test standards, or choosing the right architecture tier for project scope.
- ascii-visualizerASCII diagram patterns for architecture, workflows, file trees, and data visualizations. Use when creating terminal-rendered diagrams, box-drawing layouts, progress bars, swimlanes, or blast radius visualizations.
- assessAssesses and rates quality 0-10 with pros/cons analysis. Use when evaluating code, designs, or approaches.
- async-jobsAsync job processing patterns for background tasks, Celery workflows, task scheduling, retry strategies, and distributed task execution. Use when implementing background job processing, task queues, or scheduled task systems.
- audit-fullFull-codebase audit using 1M context window. Security, architecture, and dependency analysis in a single pass. Use when you need whole-project analysis.
- audit-skillsAudits all OrchestKit skills for quality, completeness, and compliance with authoring standards. Use when checking skill health, before releases, or after bulk skill edits to surface SKILL.md files that are too long, have missing frontmatter, lack rules/references, or are unregistered in manifests.