testing-llm
$
npx mdskill add yonatangross/orchestkit/testing-llmProvides patterns for testing LLM integrations, evaluating AI outputs, and mocking responses for deterministic CI.
- Helps developers test AI features, validate LLM outputs, and build evaluation pipelines.
- Integrates with DeepEval and RAGAS for quality metrics and uses tools like VCR.py for mocking.
- Applies agentic workflows such as generator, healer, and planner to automate test processes.
- Delivers results through structured rules, API references, and automated test transformations.
SKILL.md
.github/skills/testing-llmView on GitHub ↗
---
name: testing-llm
license: MIT
compatibility: "Claude Code 2.1.76+."
description: LLM and AI testing patterns — mock responses, evaluation with DeepEval/RAGAS, structured output validation, and agentic test patterns (generator, healer, planner). Use when testing AI features, validating LLM outputs, or building evaluation pipelines.
tags: [testing, llm, ai, deepeval, ragas, evaluation, mocking]
context: fork
agent: test-generator
version: 2.0.0
author: OrchestKit
user-invocable: false
disable-model-invocation: false
complexity: medium
persuasion-type: reference
metadata:
category: document-asset-creation
allowed-tools:
- Read
- Glob
- Grep
- WebFetch
- WebSearch
---
# LLM & AI Testing Patterns
Patterns and tools for testing LLM integrations, evaluating AI output quality, mocking responses for deterministic CI, and applying agentic test workflows (planner, generator, healer).
## Quick Reference
| Area | File | Purpose |
|------|------|---------|
| **Rules** | `rules/llm-evaluation.md` | DeepEval quality metrics, Pydantic schema validation, timeout testing |
| **Rules** | `rules/llm-mocking.md` | Mock LLM responses, VCR.py recording, custom request matchers |
| **Reference** | `references/deepeval-ragas-api.md` | Full API reference for DeepEval and RAGAS metrics |
| **Reference** | `references/generator-agent.md` | Transforms Markdown specs into Playwright tests |
| **Reference** | `references/healer-agent.md` | Auto-fixes failing tests (selectors, waits, dynamic content) |
| **Reference** | `references/planner-agent.md` | Explores app and produces Markdown test plans |
| **Checklist** | `checklists/llm-test-checklist.md` | Complete LLM testing checklist (setup, coverage, CI/CD) |
| **Example** | `examples/llm-test-patterns.md` | Full examples: mocking, structured output, DeepEval, VCR, golden datasets |
## When to Use This Skill
- Testing code that calls LLM APIs (OpenAI, Anthropic, etc.)
- Validating RAG pipeline output quality
- Setting up deterministic LLM tests in CI
- Building evaluation pipelines with quality gates
- Applying agentic test patterns (plan -> generate -> heal)
## LLM Mock Quick Start
Mock LLM responses for fast, deterministic unit tests:
```python
from unittest.mock import AsyncMock, patch
import pytest
@pytest.fixture
def mock_llm():
mock = AsyncMock()
mock.return_value = {"content": "Mocked response", "confidence": 0.85}
return mock
@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
with patch("app.core.model_factory.get_model", return_value=mock_llm):
result = await synthesize_findings(sample_findings)
assert result["summary"] is not None
```
**Key rule:** NEVER call live LLM APIs in CI. Use mocks for unit tests, VCR.py for integration tests.
## DeepEval Quality Quick Start
Validate LLM output quality with multi-dimensional metrics:
```python
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
retrieval_context=["Paris is the capital of France."],
)
assert_test(test_case, [
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
])
```
## Quality Metrics Thresholds
| Metric | Threshold | Purpose |
|--------|-----------|---------|
| Answer Relevancy | >= 0.7 | Response addresses question |
| Faithfulness | >= 0.8 | Output matches context |
| Hallucination | <= 0.3 | No fabricated facts |
| Context Precision | >= 0.7 | Retrieved contexts relevant |
| Context Recall | >= 0.7 | All relevant contexts retrieved |
## Structured Output Validation
Always validate LLM output with Pydantic schemas:
```python
from pydantic import BaseModel, Field
class LLMResponse(BaseModel):
answer: str = Field(min_length=1)
confidence: float = Field(ge=0.0, le=1.0)
sources: list[str] = Field(default_factory=list)
async def test_structured_output():
result = await get_llm_response("test query")
parsed = LLMResponse.model_validate(result)
assert 0 <= parsed.confidence <= 1.0
```
## VCR.py for Integration Tests
Record and replay LLM API calls for deterministic integration tests:
```python
@pytest.fixture(scope="module")
def vcr_config():
import os
return {
"record_mode": "none" if os.environ.get("CI") else "new_episodes",
"filter_headers": ["authorization", "x-api-key"],
}
@pytest.mark.vcr()
async def test_llm_integration():
response = await llm_client.complete("Say hello")
assert "hello" in response.content.lower()
```
## Agentic Test Workflow
The three-agent pattern for end-to-end test automation:
```
Planner -> specs/*.md -> Generator -> tests/*.spec.ts -> Healer (auto-fix)
```
1. **Planner** (`references/planner-agent.md`): Explores your app, produces Markdown test plans from PRDs or natural language requests. Requires `seed.spec.ts` for app context.
2. **Generator** (`references/generator-agent.md`): Converts Markdown specs into Playwright tests. Actively validates selectors against the running app. Uses semantic locators (getByRole, getByLabel, getByText).
3. **Healer** (`references/healer-agent.md`): Automatically fixes failing tests by replaying failures, inspecting the DOM, and patching locators/waits. Max 3 healing attempts per test.
## Edge Cases to Always Test
For every LLM integration, cover these paths:
- **Empty/null inputs** -- empty strings, None values
- **Long inputs** -- truncation behavior near token limits
- **Timeouts** -- fail-open vs fail-closed behavior
- **Schema violations** -- invalid structured output
- **Prompt injection** -- adversarial input resistance
- **Unicode** -- non-ASCII characters in prompts and responses
See `checklists/llm-test-checklist.md` for the complete checklist.
## Anti-Patterns
| Anti-Pattern | Correct Approach |
|-------------|-----------------|
| Live LLM calls in CI | Mock for unit, VCR for integration |
| Random seeds | Fixed seeds or mocked responses |
| Single metric evaluation | 3-5 quality dimensions |
| No timeout handling | Always set < 1s timeout in tests |
| Hardcoded API keys | Environment variables, filtered in VCR |
| Asserting only `is not None` | Schema validation + quality metrics |
## Related Skills
- `ork:testing-unit` — Unit testing fundamentals, AAA pattern
- `ork:testing-integration` — Integration testing for AI pipelines
- `ork:golden-dataset` — Evaluation dataset management
More from yonatangross/orchestkit
- agent-orchestrationAgent orchestration patterns for agentic loops, multi-agent coordination, alternative frameworks, and multi-scenario workflows. Use when building autonomous agent loops, coordinating multiple agents, evaluating CrewAI/AutoGen/Swarm, or orchestrating complex multi-step scenarios.
- ai-ui-generationAI-assisted UI generation patterns for json-render, v0, Bolt, and Cursor workflows. Covers prompt engineering for component generation, review checklists for AI-generated code, design token injection, refactoring for design system conformance, and CI gates for quality assurance. Use when generating UI components with AI tools, rendering multi-surface MCP visual output, reviewing AI-generated code, or integrating AI output into design systems.
- analyticsQuery cross-project usage analytics. Use when reviewing agent, skill, hook, or team performance across OrchestKit projects. Also replay sessions, estimate costs, and view model delegation trends.
- animation-motion-designAnimation and motion design patterns using Motion library (formerly Framer Motion) and View Transitions API. Use when implementing component animations, page transitions, micro-interactions, gesture-driven UIs, or ensuring motion accessibility with prefers-reduced-motion.
- architecture-patternsArchitecture validation and patterns for clean architecture, backend structure enforcement, project structure validation, test standards, and context-aware sizing. Use when designing system boundaries, enforcing layered architecture, validating project structure, defining test standards, or choosing the right architecture tier for project scope.
- ascii-visualizerASCII diagram patterns for architecture, workflows, file trees, and data visualizations. Use when creating terminal-rendered diagrams, box-drawing layouts, progress bars, swimlanes, or blast radius visualizations.
- assessAssesses and rates quality 0-10 with pros/cons analysis. Use when evaluating code, designs, or approaches.
- async-jobsAsync job processing patterns for background tasks, Celery workflows, task scheduling, retry strategies, and distributed task execution. Use when implementing background job processing, task queues, or scheduled task systems.
- audit-fullFull-codebase audit using 1M context window. Security, architecture, and dependency analysis in a single pass. Use when you need whole-project analysis.
- audit-skillsAudits all OrchestKit skills for quality, completeness, and compliance with authoring standards. Use when checking skill health, before releases, or after bulk skill edits to surface SKILL.md files that are too long, have missing frontmatter, lack rules/references, or are unregistered in manifests.