golden-dataset

$npx mdskill add yonatangross/orchestkit/golden-dataset

Manages golden dataset lifecycle for curation, versioning, validation, and CI integration in AI/ML evaluation.

  • Helps with building evaluation datasets, managing versions, validating quality scores, and integrating golden tests into pipelines.
  • Integrates with tools like Read, Glob, Grep, WebFetch, and WebSearch for data handling.
  • Decides based on patterns for curation, management, validation, and workflow phases loaded from rule files.
  • Presents results through structured categories and quick reference tables for easy navigation.

SKILL.md

.github/skills/golden-datasetView on GitHub ↗
---
name: golden-dataset
license: MIT
compatibility: "Claude Code 2.1.76+."
description: Golden dataset lifecycle patterns for curation, versioning, quality validation, and CI integration. Use when building evaluation datasets, managing dataset versions, validating quality scores, or integrating golden tests into pipelines.
tags: [golden-dataset, evaluation, dataset-curation, dataset-validation, quality, llm-testing]
context: fork
agent: data-pipeline-engineer
version: 2.0.0
author: OrchestKit
user-invocable: false
disable-model-invocation: true
complexity: medium
persuasion-type: guidance
metadata:
  category: document-asset-creation
allowed-tools:
  - Read
  - Glob
  - Grep
  - WebFetch
  - WebSearch
---

# Golden Dataset

Comprehensive patterns for building, managing, and validating golden datasets for AI/ML evaluation. Each category has individual rule files in `rules/` loaded on-demand.

## Quick Reference

| Category | Rules | Impact | When to Use |
| -------- | ----- | ------ | ----------- |
| [Curation](#curation) | 3 | HIGH | Content collection, annotation pipelines, diversity analysis |
| [Management](#management) | 3 | HIGH | Versioning, backup/restore, CI/CD automation |
| [Validation](#validation) | 3 | CRITICAL | Quality scoring, drift detection, regression testing |
| [Add Workflow](#add-workflow) | 1 | HIGH | 9-phase curation, quality scoring, bias detection, silver-to-gold |

Total: 10 rules across 4 categories

## Curation

Content collection, multi-agent annotation, and diversity analysis for golden datasets.

| Rule | File | Key Pattern |
| ---- | ---- | ----------- |
| Collection | `rules/curation-collection.md` | Content type classification, quality thresholds, duplicate prevention |
| Annotation | `rules/curation-annotation.md` | Multi-agent pipeline, consensus aggregation, Langfuse tracing |
| Diversity | `rules/curation-diversity.md` | Difficulty stratification, domain coverage, balance guidelines |

## Management

Versioning, storage, and CI/CD automation for golden datasets.

| Rule | File | Key Pattern |
| ---- | ---- | ----------- |
| Versioning | `rules/management-versioning.md` | JSON backup format, embedding regeneration, disaster recovery |
| Storage | `rules/management-storage.md` | Backup strategies, URL contract, data integrity checks |
| CI Integration | `rules/management-ci.md` | GitHub Actions automation, pre-deployment validation, weekly backups |

## Validation

Quality scoring, drift detection, and regression testing for golden datasets.

| Rule | File | Key Pattern |
| ---- | ---- | ----------- |
| Quality | `rules/validation-quality.md` | Schema validation, content quality, referential integrity |
| Drift | `rules/validation-drift.md` | Duplicate detection, semantic similarity, coverage gap analysis |
| Regression | `rules/validation-regression.md` | Difficulty distribution, pre-commit hooks, full dataset validation |

## Add Workflow

Structured workflow for adding new documents to the golden dataset.

| Rule | File | Key Pattern |
| ---- | ---- | ----------- |
| Add Document | `rules/curation-add-workflow.md` | 9-phase curation, parallel quality analysis, bias detection |

## Quick Start Example

```python
from app.shared.services.embeddings import embed_text

async def validate_before_add(document: dict, source_url_map: dict) -> dict:
    """Pre-addition validation for golden dataset entries."""
    errors = []

    # 1. URL contract check
    if "placeholder" in document.get("source_url", ""):
        errors.append("URL must be canonical, not a placeholder")

    # 2. Content quality
    if len(document.get("title", "")) < 10:
        errors.append("Title too short (min 10 chars)")

    # 3. Tag requirements
    if len(document.get("tags", [])) < 2:
        errors.append("At least 2 domain tags required")

    return {"valid": len(errors) == 0, "errors": errors}
```

## Key Decisions

| Decision | Recommendation |
| -------- | -------------- |
| Backup format | JSON (version controlled, portable) |
| Embedding storage | Exclude from backup (regenerate on restore) |
| Quality threshold | >= 0.70 quality score for inclusion |
| Confidence threshold | >= 0.65 for auto-include |
| Duplicate threshold | >= 0.90 similarity blocks, >= 0.85 warns |
| Min tags per entry | 2 domain tags |
| Min test queries | 3 per document |
| Difficulty balance | Trivial 3, Easy 3, Medium 5, Hard 3 minimum |
| CI frequency | Weekly automated backup (Sunday 2am UTC) |

## Common Mistakes

1. Using placeholder URLs instead of canonical source URLs
2. Skipping embedding regeneration after restore
3. Not validating referential integrity between documents and queries
4. Over-indexing on articles (neglecting tutorials, research papers)
5. Missing difficulty distribution balance in test queries
6. Not running verification after backup/restore operations
7. Testing restore procedures in production instead of staging
8. Committing SQL dumps instead of JSON (not version-control friendly)

## Evaluations

See `test-cases.json` for 9 test cases across all categories.

## Related Skills

- `ork:rag-retrieval` - Retrieval evaluation using golden dataset
- `langfuse-observability` - Tracing patterns for curation workflows
- `ork:testing-unit` - Unit testing patterns and strategies
- `ai-native-development` - Embedding generation for restore

## Capability Details

### curation

**Keywords:** golden dataset, curation, content collection, annotation, quality criteria

**Solves:**

- Classify document content types for golden dataset
- Run multi-agent quality analysis pipelines
- Generate test queries for new documents

### management

**Keywords:** golden dataset, backup, restore, versioning, disaster recovery

**Solves:**

- Backup and restore golden datasets with JSON
- Regenerate embeddings after restore
- Automate backups with CI/CD

### validation

**Keywords:** golden dataset, validation, schema, duplicate detection, quality metrics

**Solves:**

- Validate entries against document schema
- Detect duplicate or near-duplicate entries
- Analyze dataset coverage and distribution gaps

More from yonatangross/orchestkit

SkillDescription
agent-orchestrationAgent orchestration patterns for agentic loops, multi-agent coordination, alternative frameworks, and multi-scenario workflows. Use when building autonomous agent loops, coordinating multiple agents, evaluating CrewAI/AutoGen/Swarm, or orchestrating complex multi-step scenarios.
ai-ui-generationAI-assisted UI generation patterns for json-render, v0, Bolt, and Cursor workflows. Covers prompt engineering for component generation, review checklists for AI-generated code, design token injection, refactoring for design system conformance, and CI gates for quality assurance. Use when generating UI components with AI tools, rendering multi-surface MCP visual output, reviewing AI-generated code, or integrating AI output into design systems.
analyticsQuery cross-project usage analytics. Use when reviewing agent, skill, hook, or team performance across OrchestKit projects. Also replay sessions, estimate costs, and view model delegation trends.
animation-motion-designAnimation and motion design patterns using Motion library (formerly Framer Motion) and View Transitions API. Use when implementing component animations, page transitions, micro-interactions, gesture-driven UIs, or ensuring motion accessibility with prefers-reduced-motion.
architecture-patternsArchitecture validation and patterns for clean architecture, backend structure enforcement, project structure validation, test standards, and context-aware sizing. Use when designing system boundaries, enforcing layered architecture, validating project structure, defining test standards, or choosing the right architecture tier for project scope.
ascii-visualizerASCII diagram patterns for architecture, workflows, file trees, and data visualizations. Use when creating terminal-rendered diagrams, box-drawing layouts, progress bars, swimlanes, or blast radius visualizations.
assessAssesses and rates quality 0-10 with pros/cons analysis. Use when evaluating code, designs, or approaches.
async-jobsAsync job processing patterns for background tasks, Celery workflows, task scheduling, retry strategies, and distributed task execution. Use when implementing background job processing, task queues, or scheduled task systems.
audit-fullFull-codebase audit using 1M context window. Security, architecture, and dependency analysis in a single pass. Use when you need whole-project analysis.
audit-skillsAudits all OrchestKit skills for quality, completeness, and compliance with authoring standards. Use when checking skill health, before releases, or after bulk skill edits to surface SKILL.md files that are too long, have missing frontmatter, lack rules/references, or are unregistered in manifests.