run-judges

$npx mdskill add closedloop-ai/claude-plugins/run-judges

Run parallel judges to validate plans, code, or PRDs.

  • Evaluates implementation plans, code artifacts, or PRDs automatically.
  • Depends on specialized judge agents configured by artifact type.
  • Selects evaluation batches based on plan, code, or PRD categories.
  • Outputs structured JSON files with aggregated CaseScore results.
SKILL.md
.github/skills/run-judgesView on GitHub ↗
---
name: run-judges
description: Orchestrate parallel judge agent execution, aggregate CaseScore results, write plan-judges.json, code-judges.json, or prd-judges.json, and validate output. Supports evaluating implementation plans (16 judges), code artifacts (11 judges), or PRD artifacts (4 judges) via --artifact-type parameter.
context: fork
---

# Run Judges Skill

## Purpose

Execute specialized judge agents in parallel to evaluate implementation plan quality (16 judges), code quality (11 judges), or PRD quality (4 judges). Aggregates results into `$CLOSEDLOOP_WORKDIR/plan-judges.json` (plan), `$CLOSEDLOOP_WORKDIR/code-judges.json` (code), or `$CLOSEDLOOP_WORKDIR/prd-judges.json` (prd) with validated output format.

## Parameters

**--workdir**: Path to the working directory containing judge artifacts (optional)

- Resolved in order: `--workdir` argument → `$CLOSEDLOOP_WORKDIR` environment variable → `.closedloop-ai/judges` (default, relative to current working directory)
- The directory is created automatically if it does not exist
- All output files (`plan-judges.json`, `code-judges.json`, `prd-judges.json`, `judge-input.json`, `perf.jsonl`, etc.) are written to this resolved directory

**--artifact-type**: Artifact category to evaluate (plan | code | prd), default: plan

- **plan** (default): Evaluate implementation plan with 16 judges, 4 batches, output to plan-judges.json
- **code**: Evaluate implemented code with 11 judges, 3 batches, output to code-judges.json
- **prd**: Evaluate PRD document with 4 judges, single parallel batch, output to prd-judges.json

## Judge Input Contract (`judge-input.json`)

The judge input contract is maintained in:

`skills/run-judges/references/judge-input-contract.md` (resolve to an absolute path at runtime via `Glob`)

This keeps orchestration flow readable while preserving a single source of truth for contract fields and semantics.

## Task Context

You are orchestrating quality evaluation for a ClosedLoop artifact (implementation plan, code, or PRD). Your responsibilities:

**For plan artifacts (default):**
1. Launch context-manager-for-judges agent to prepare compressed plan context
2. Build `judge-input.json` with plan task/context mapping
3. Launch all 16 judge agents in parallel batches
4. Aggregate their CaseScore outputs into a valid EvaluationReport
5. Write the report to `$CLOSEDLOOP_WORKDIR/plan-judges.json`
6. Validate output structure and completeness

**For code artifacts (--artifact-type code):**
1. Launch context-manager-for-judges agent to prepare compressed context
2. Build `judge-input.json` with code task/context mapping
3. Launch 11 judge agents in parallel batches
4. Aggregate their CaseScore outputs into a valid EvaluationReport
5. Write the report to `$CLOSEDLOOP_WORKDIR/code-judges.json`
6. Validate output structure and completeness

**For PRD artifacts (--artifact-type prd):**
1. Check `$CLOSEDLOOP_WORKDIR/prd.md` exists (graceful exit if missing)
2. Build `judge-input.json` with evaluation_type: "prd" and primary_artifact pointing to prd.md
3. Launch all 4 PRD judges in a single parallel batch
4. Aggregate all 4 CaseScores into a valid EvaluationReport
5. Write the report to `$CLOSEDLOOP_WORKDIR/prd-judges.json`
6. Validate output structure and completeness

**Success criteria:**
- All judges executed (or error CaseScores generated for failures)
- Valid JSON written to appropriate output file
- Validation script passes with zero errors

---

## Threshold Overrides

The run-judges skill supports per-artifact-type threshold customization via JSON configuration files. This allows you to adjust evaluation strictness for different artifact types (e.g., applying a lower threshold for test-judge when evaluating code vs plan).

### Configuration Schema

Threshold overrides are defined in a JSON file with the following structure:

```json
{
  "overrides": {
    "artifact_type:judge_name": <threshold_float>
  }
}
```

Where:
- **Key format**: `"artifact_type:judge_name"` (e.g., `"code:test-judge"`, `"plan:technical-accuracy-judge"`)
- **Value**: Threshold as a float in range `[0.0, 1.0]`

**Example configuration:**
```json
{
  "overrides": {
    "code:test-judge": 0.75,
    "plan:technical-accuracy-judge": 0.85
  }
}
```

### Loading Precedence

The skill checks the following locations in order, using the first valid configuration found:

1. **Run-specific overrides** (highest precedence):
   - Path: `$CLOSEDLOOP_WORKDIR/.closedloop-ai/settings/threshold-overrides.json`
   - Use case: Override thresholds for a specific ClosedLoop run

2. **Repo-level defaults** (fallback):
   - Path: `<project-root>/.closedloop-ai/settings/threshold-overrides.json`
   - Use case: Set project-wide threshold defaults

3. **Hardcoded defaults** (graceful degradation):
   - If no configuration file exists at any location, use built-in defaults
   - No error is raised for missing configuration files

### Default Overrides

The following default overrides apply when evaluating code artifacts:

| Judge | Code Threshold | Plan Threshold | Rationale |
|-------|----------------|----------------|-----------|
| `test-judge` | 0.75 | 0.8 | Code may have tests written separately from implementation, lower threshold accounts for incremental test development |

All other judges use the same threshold (typically 0.8) across artifact types.

### Validation and Error Handling

When loading threshold overrides, the skill applies the following validation rules:

**Schema Validation:**
- Configuration must contain an `"overrides"` key
- Each key must match the pattern `artifact_type:judge_name`
- Each value must be a float in range `[0.0, 1.0]`
- Keys must reference valid artifact types (`plan`, `code`, `prd`) and judge names

**Error Behavior:**
- **Malformed JSON**: Log warning and continue with hardcoded defaults
  ```
  Warning: Invalid threshold-overrides.json, skipping overrides: {error}
  ```
- **Invalid schema**: Log warning and continue with hardcoded defaults
- **File not found**: Silently use defaults (no warning logged)

**Error recovery ensures the skill always completes judge execution**, even if threshold configuration is incorrect.

### Integration with Judge Execution

When executing judges:

1. **Before launching judge batches**: Load threshold overrides from the precedence chain
2. **Merge with defaults**: Loaded overrides take precedence over hardcoded defaults
3. **Apply per-judge**: Each judge receives its artifact-type-specific threshold via the evaluation context
4. **CaseScore validation**: Thresholds are used to determine `final_status` (pass/fail) based on metric scores

**When artifact type is code**:
- Load threshold overrides before executing judge batches
- Apply code-specific thresholds to each judge's evaluation criteria
- Merge loaded overrides with defaults (loaded values take precedence)

---

## Performance Instrumentation (Mandatory)

You MUST emit a `pipeline_step` event to `$CLOSEDLOOP_WORKDIR/perf.jsonl` at the **end** of each phase below. This keeps perf telemetry in the canonical schema and adds nested metadata for judge/sub-agent work.

**Context:** `CLOSEDLOOP_WORKDIR`, `CLOSEDLOOP_RUN_ID`, and `CLOSEDLOOP_ITERATION` are set by the run-loop. `CLOSEDLOOP_PARENT_STEP` and `CLOSEDLOOP_PARENT_STEP_NAME` are set as env vars on the `claude` invocation by run-loop; they are inherited by all Bash tool calls — no sourcing needed.
Use `sub_step` as numeric phase order and optional `sub_step_name` to capture the judge/sub-agent name when applicable (for batch-level phases where many judges run, use the batch label).

**Sub-step numbering:**

| Artifact | sub_step | sub_step_name   |
|----------|----------|-----------------|
| plan     | 0        | context_manager |
| plan     | 1–4      | batch_1 … batch_4 |
| plan     | 5        | aggregate       |
| plan     | 6        | validate        |
| code     | 0        | context_manager |
| code     | 1–3      | batch_1 … batch_3 |
| code     | 4        | aggregate       |
| code     | 5        | validate        |
| prd      | 0        | context_prep (skipped — prd mode does not use context-manager-for-judges) |
| prd      | 1        | prd_judges      |
| prd      | 2        | aggregate       |
| prd      | 3        | validate        |

**Start of phase (run Bash once at the beginning of each phase):** Set the two sub-step variables at the top for the current phase, then run the block. It writes start time to a temp file so the end-of-phase Bash can compute duration. `CLOSEDLOOP_PARENT_STEP` and `CLOSEDLOOP_PARENT_STEP_NAME` are already in the environment (set by run-loop on the `claude` invocation).

```bash
# Set these two values for the current phase:
SUB_STEP_NUM=0
SUB_STEP_LABEL="context_manager"   # context_manager | batch_1 … | aggregate | validate

mkdir -p "$CLOSEDLOOP_WORKDIR/.closedloop-ai"
{
  echo "SUB_STEP=${SUB_STEP_NUM}"
  echo "SUB_STEP_NAME=${SUB_STEP_LABEL}"
  echo "PARENT_STEP=${CLOSEDLOOP_PARENT_STEP:-0}"
  echo "PARENT_STEP_NAME=${CLOSEDLOOP_PARENT_STEP_NAME:-unknown}"
  echo "STARTED_AT=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
  echo "START_EPOCH=$(date +%s)"
} > "$CLOSEDLOOP_WORKDIR/.closedloop-ai/perf-substep-start.env"
```

**End of phase (run Bash once at the end of each phase, after the phase work is done):** Read start time, compute duration, append one line to `perf.jsonl`, then remove the temp file.

```bash
source "$CLOSEDLOOP_WORKDIR/.closedloop-ai/perf-substep-start.env"
END_EPOCH=$(date +%s)
ENDED_AT=$(date -u +%Y-%m-%dT%H:%M:%SZ)
DURATION=$((END_EPOCH - START_EPOCH))
jq -n -c \
  --arg event "pipeline_step" \
  --arg run_id "${CLOSEDLOOP_RUN_ID:-unknown}" \
  --argjson iteration "${CLOSEDLOOP_ITERATION:-0}" \
  --argjson step "$PARENT_STEP" \
  --arg step_name "$PARENT_STEP_NAME" \
  --argjson sub_step "$SUB_STEP" \
  --arg sub_step_name "$SUB_STEP_NAME" \
  --arg started_at "$STARTED_AT" \
  --arg ended_at "$ENDED_AT" \
  --argjson duration_s "$DURATION" \
  --argjson exit_code 0 \
  --argjson skipped false \
  '{event:$event,run_id:$run_id,iteration:$iteration,step:$step,step_name:$step_name,sub_step:$sub_step,sub_step_name:$sub_step_name,started_at:$started_at,ended_at:$ended_at,duration_s:$duration_s,exit_code:$exit_code,skipped:$skipped}' >> "$CLOSEDLOOP_WORKDIR/perf.jsonl"
rm -f "$CLOSEDLOOP_WORKDIR/.closedloop-ai/perf-substep-start.env"
```

**Order of operations per phase:** Run the "start of phase" Bash first (set `SUB_STEP_NUM` and `SUB_STEP_LABEL` at the top, then run the block), then perform the phase work, then run the "end of phase" Bash.

---

## Execution Workflow

### Working Directory Resolution

**Before any other step**, resolve the working directory and export it as `CLOSEDLOOP_WORKDIR`:

```bash
# Resolve working directory (precedence: --workdir arg > env var > default)
if [ -n "$ARG_WORKDIR" ]; then
  WORKDIR="$ARG_WORKDIR"
elif [ -n "$CLOSEDLOOP_WORKDIR" ]; then
  WORKDIR="$CLOSEDLOOP_WORKDIR"
else
  WORKDIR="$(pwd)/.closedloop-ai/judges"
fi

mkdir -p "$WORKDIR"
export CLOSEDLOOP_WORKDIR="$WORKDIR"
```

Where `$ARG_WORKDIR` is the value passed via `--workdir` in the invocation prompt. All subsequent references to `$CLOSEDLOOP_WORKDIR` use this resolved value.

---

### Agents Snapshot (Pre-Step)

**Before any judge execution**, ensure a snapshot of judge agent definitions exists in `$CLOSEDLOOP_WORKDIR/agents-snapshot/`. This preserves the exact agent versions used for each evaluation run.

**Action:** Run the snapshot script via Bash:

```bash
bash "${CLAUDE_PLUGIN_ROOT}/skills/run-judges/scripts/ensure_agents_snapshot.sh" "$CLOSEDLOOP_WORKDIR"
```

The script is idempotent — it skips if `manifest.json` already exists.

**Error handling:** If the script fails or is not found, log a warning and continue — snapshot failure must not block judge execution.

---

### Step 0: Mandatory Contract Pre-Read

Before any prerequisite checks or judge launches:

1. Resolve the contract file path using `Glob` with:
   - `**/skills/run-judges/references/judge-input-contract.md`
2. Read the resolved `judge-input-contract.md` file in full.
3. Apply the contract requirements when constructing `$CLOSEDLOOP_WORKDIR/judge-input.json`.
4. If the file is missing, ambiguous (multiple matches), or unreadable, fail fast with a clear error (do not proceed with judge execution).

### Prerequisites Check

**Performance:** At the start of this phase run the "start of phase" Bash with `SUB_STEP_NUM=0` and `SUB_STEP_LABEL=context_manager` for both plan and code modes. At the end of the phase run the "end of phase" Bash.

**Before starting, verify required inputs exist:**

**For plan artifacts (default):**
```bash
# Validate input files exist
if [ ! -f "$CLOSEDLOOP_WORKDIR/prd.md" ]; then
  echo "WARNING: $CLOSEDLOOP_WORKDIR/prd.md not found. Skipping judges."
  exit 0  # Graceful skip - do not fail workflow
fi

if [ ! -f "$CLOSEDLOOP_WORKDIR/plan.json" ]; then
  echo "WARNING: $CLOSEDLOOP_WORKDIR/plan.json not found. Skipping judges."
  exit 0
fi
```

**Investigation log resolution (plan mode):**

After validating `prd.md` and `plan.json`, resolve supporting context for plan judges:

1. **Use existing file first**
   - If `$CLOSEDLOOP_WORKDIR/investigation-log.md` exists, use it as-is.

2. **Check `@code:pre-explorer` availability before invoking**
   - Perform an explicit capability probe for `@code:pre-explorer` in the active Claude/plugin environment.
   - Treat "unknown agent", "agent not found", or plugin resolution errors as **pre-explorer unavailable**.
   - Recommended probe pattern:
     - Attempt a minimal `Task()` call targeting `@code:pre-explorer`.
     - If the platform rejects the agent type before execution, classify as unavailable and continue to internal fallback.

3. **If available, invoke pre-explorer**
   - Launch `@code:pre-explorer` with `WORKDIR=$CLOSEDLOOP_WORKDIR` to generate missing pre-exploration artifacts.
   - Re-check for `$CLOSEDLOOP_WORKDIR/investigation-log.md` after completion.

4. **If unavailable or invocation failed, run internal fallback**
   - Generate `investigation-log.md` with a lightweight local-only investigation.
   - Keep it fast and deterministic (no external web research).
   - Internal fallback should:
     - Read `prd.md` and extract top entities/actions as search seeds.
     - Run targeted `Glob`/`Grep` against the local repository for likely implementation files.
     - Record top relevant files and short rationale under `Files Discovered` / `Key Findings`.
     - Add requirement-to-code evidence links under `Requirements Mapping`.
   - Use the canonical sections:
     - `## Search Strategy`
     - `## Files Discovered`
     - `## Key Findings`
     - `## Requirements Mapping`
     - `## Uncertainties`

5. **Never block plan context preparation on investigation context**
   - If log generation still fails, emit a warning and continue.

6. **Prepare plan-context.json via context-manager-for-judges**
   - Launch `@judges:context-manager-for-judges` with `artifact_type=plan`.
   - Verify `$CLOSEDLOOP_WORKDIR/plan-context.json` exists.
   - If missing after invocation, log warning and activate **compatibility mode** for this run:
     - Compatibility mode allows one emergency fallback to raw `plan.json` + `prd.md`.
     - Use compatibility mode only when context generation fails.

7. **Plan-mode source-of-truth policy**
   - Normal mode: `plan-context.json` is primary and required.
   - Compatibility mode: `plan.json` + `prd.md` may be used for this run only.

8. **Build plan-mode `judge-input.json`**
   - Set `evaluation_type` = `plan`.
   - Set `task` to plan quality evaluation objective (16-plan-judge workflow).
   - Set `primary_artifact` to `plan-context.json` in normal mode.
   - In compatibility mode, set primary to `plan.json` and include `prd.md` as supporting.
   - Include `investigation-log.md` as supporting artifact when available.
   - Set `source_of_truth` ordering from primary to secondary artifacts.

**For code artifacts (--artifact-type code):**
```bash
# Resolve investigation context for code judges (best effort)
if [ ! -f "$CLOSEDLOOP_WORKDIR/investigation-log.md" ]; then
  echo "INFO: investigation-log.md missing. Attempting best-effort generation via @code:pre-explorer..."
  # Launch @code:pre-explorer with WORKDIR=$CLOSEDLOOP_WORKDIR
  # If unavailable/fails, continue with warning (non-blocking for code judges)
fi

# Launch context-manager-for-judges agent to prepare compressed context
# This agent reads code artifacts (git diff, changed-files.json, etc.)
# and produces code-context.json with token-budgeted compression

# investigation-log.md is optional secondary context for code judging
if [ ! -f "$CLOSEDLOOP_WORKDIR/investigation-log.md" ]; then
  echo "WARNING: investigation-log.md unavailable. Continuing code judges with code-context.json only."
fi

# Verify code-context.json exists after context manager completes
if [ ! -f "$CLOSEDLOOP_WORKDIR/code-context.json" ]; then
  echo "ERROR: Context preparation failed - code-context.json not found"
  # Abort with error CaseScore for all judges
  # Generate error report with final_status=3, justification="Context preparation failed"
  exit 1
fi

# Build code-mode judge-input.json
# - evaluation_type: "code"
# - task: code quality evaluation objective (11-code-judge workflow)
# - primary_artifact: code-context.json
# - supporting_artifacts: investigation-log.md (optional), plus any other run artifacts
# - source_of_truth: ["code_context", ...]
```

**For PRD artifacts (--artifact-type prd):**

PRD mode does NOT use context-manager-for-judges. Context preparation is lightweight: verify the PRD document exists, then build judge-input.json directly from it.

```bash
# PRD mode context prep: check prd.md exists
if [ ! -f "$CLOSEDLOOP_WORKDIR/prd.md" ]; then
  echo "WARNING: $CLOSEDLOOP_WORKDIR/prd.md not found. Skipping PRD judges."
  exit 0  # Graceful exit — do not fail parent workflow
fi

# Build prd-mode judge-input.json
# - evaluation_type: "prd"
# - task: PRD quality evaluation objective (prd-auditor + 3 critics)
# - primary_artifact: $CLOSEDLOOP_WORKDIR/prd.md
# - supporting_artifacts: [] (none required)
# - source_of_truth: ["prd"]
```

**PRD context prep notes:**
- Missing `prd.md` results in a WARNING and graceful exit (code 0), not an error
- No context manager is launched; `judge-input.json` is built directly with `primary_artifact` pointing to `$CLOSEDLOOP_WORKDIR/prd.md`
- Performance: emit sub_step=0 (context_prep, skipped=true) perf event immediately, then proceed to sub_step=1 (prd_judges)

**If required files are missing:**
- Plan mode: Exit gracefully with code 0 (do not fail parent workflow)
- Code mode: Exit with error if context preparation fails
- PRD mode: Exit gracefully with code 0 if prd.md is not found

## Artifact Type Configuration

The run-judges skill supports three artifact types with different judge configurations:

### Plan Artifacts (Default)
- **Judges**: 16 total
- **Batches**: 4 sequential batches (max 4 concurrent per batch)
- **Output**: `plan-judges.json`
- **Report ID**: `{RUN_ID}-plan-judges`
- **Validation**: `--category plan` (16 judges expected)

### Code Artifacts (--artifact-type code)
- **Judges**: 11 total (excludes goal-alignment-judge, verbosity-judge)
- **Batches**: 3 sequential batches (max 4 concurrent per batch)
- **Output**: `code-judges.json`
- **Report ID**: `{RUN_ID}-code-judges`
- **Validation**: `--category code` (11 judges expected)

**Code Judge Batches:**

**Batch 1: Core Principles (4 judges)**
- `judges:dry-judge`
- `judges:ssot-judge`
- `judges:kiss-judge`
- `judges:code-organization-judge`

**Batch 2: Best Practices + SOLID Principles (4 judges)**
- `judges:custom-best-practices-judge`
- `judges:readability-judge`
- `judges:solid-isp-dip-judge`
- `judges:solid-liskov-substitution-judge`

**Batch 3: Technical Quality + Testing (3 judges)**
- `judges:solid-open-closed-judge`
- `judges:technical-accuracy-judge`
- `judges:test-judge`

### PRD Artifacts (--artifact-type prd)
- **Judges**: 4 total (parallel batch)
- **Execution**: single parallel batch
- **Output**: `prd-judges.json`
- **Report ID**: `{RUN_ID}-prd-judges`
- **Validation**: `--category prd` (4 judges expected)
- **Canonical input**: `$CLOSEDLOOP_WORKDIR/prd.md`

**PRD Execution:**

**Batch 1: All PRD Judges (sub_step=1)**
- `judges:prd-auditor` — structural completeness audit of the PRD
- `judges:prd-dependency-judge` — evaluates dependency clarity and completeness
- `judges:prd-testability-judge` — evaluates requirement testability
- `judges:prd-scope-judge` — evaluates scope definition and boundary clarity

---

### Step 1: Launch Judge Agents in Parallel

**Performance:** For each batch/phase, run "start of phase" Bash before launching the batch and "end of phase" Bash after the batch completes. Plan: batch_1=sub_step 1, batch_2=sub_step 2, batch_3=sub_step 3, batch_4=sub_step 4. Code: batch_1=sub_step 1, batch_2=sub_step 2, batch_3=sub_step 3. PRD: prd_judges=sub_step 1.

**Constraint:** The Task tool supports maximum 4 concurrent agents per batch.

**Action:** Launch judges in sequential batches based on artifact type.

<judge_batches>

### Plan Artifact Judge Batches (16 judges, 4 batches)

**Batch 1: Core Principles (DRY/SSOT/KISS + Organization)**

| Agent Type | Evaluates |
|------------|-----------|
| `judges:dry-judge` | Don't Repeat Yourself violations |
| `judges:ssot-judge` | Single Source of Truth violations |
| `judges:kiss-judge` | Keep It Simple violations |
| `judges:code-organization-judge` | File and folder structure organization |

**Batch 2: Best Practices + Response Quality**

| Agent Type | Evaluates |
|------------|-----------|
| `judges:custom-best-practices-judge` | Adherence to custom best practices documents |
| `judges:goal-alignment-judge` | Alignment with stated health goals |
| `judges:readability-judge` | Plan readability, clarity, structure, template adherence |
| `judges:verbosity-judge` | Verbosity calibration to problem complexity |

**Batch 3: SOLID Principles**

| Agent Type | Evaluates |
|------------|-----------|
| `judges:solid-isp-dip-judge` | Interface Segregation & Dependency Inversion Principles |
| `judges:solid-liskov-substitution-judge` | Liskov Substitution Principle adherence |
| `judges:solid-open-closed-judge` | Open/Closed Principle adherence |
| `judges:technical-accuracy-judge` | Technical accuracy (API usage, algorithms) |

**Batch 4: Plan Grounding + Testing**

| Agent Type | Evaluates |
|------------|-----------|
| `judges:test-judge` | Test coverage, assertions, structure, best practices |
| `judges:brownfield-accuracy-judge` | Reuse vs reimplementation, integration-point accuracy, scope accuracy against investigation findings |
| `judges:codebase-grounding-judge` | File-path/module-reference accuracy and existing-code awareness grounded in investigation findings |
| `judges:convention-adherence-judge` | Alignment with established naming, structural, and tooling conventions in the codebase |

### PRD Artifact Judge Batch (4 judges, single parallel batch)

**Batch 1: All PRD Judges (sub_step=1)**

| Agent Type | Evaluates |
|------------|-----------|
| `judges:prd-auditor` | Structural completeness, section coverage, clarity |
| `judges:prd-dependency-judge` | Dependency clarity and completeness |
| `judges:prd-testability-judge` | Requirement testability and measurability |
| `judges:prd-scope-judge` | Scope definition and boundary clarity |

</judge_batches>

<prompt_template>

### Preamble Injection

**Before invoking each judge, prepend the common and artifact-specific preambles:**

1. **Locate preamble files**:
   - `skills/artifact-type-tailored-context/preambles/common_input_preamble.md`
   - `skills/artifact-type-tailored-context/preambles/{artifact_type}_preamble.md`
   - Use Glob tool to find: `**/artifact-type-tailored-context/preambles/*.md`
   - Validate both files exist (fail with error CaseScore if either is missing)

2. **Read preamble content**:
   - Read `common_input_preamble.md`
   - Read `{artifact_type}_preamble.md`
   - Validate combined preamble size is reasonable for judge context (target: < 8000 characters)

3. **Concatenate**:
   - `common_input_preamble + "\n\n---\n\n" + artifact_preamble + "\n\n---\n\n" + judge_prompt`
   - `common_input_preamble.md` is the only runtime source of judge input-loading contract text; judge-specific agent files should not duplicate that contract.

4. **Pass to judge**: Use concatenated prompt as judge's full prompt

**If either preamble file is missing:**
- Generate error CaseScore with `final_status=3`, `justification="Preamble file not found: {path}"`
- Continue with other judges

### Prompt Templates

**For plan artifacts:**
```
WORKDIR=$CLOSEDLOOP_WORKDIR. Read $CLOSEDLOOP_WORKDIR/judge-input.json first.
Evaluate according to `task` and `source_of_truth` ordering.
Treat the envelope's `primary_artifact` as authoritative.
If `fallback_mode.active=true`, use fallback artifacts specified in the envelope.
```

**For code artifacts:**
```
WORKDIR=$CLOSEDLOOP_WORKDIR. Read $CLOSEDLOOP_WORKDIR/judge-input.json first.
Evaluate according to `task` and `source_of_truth` ordering.
Treat the envelope's `primary_artifact` as authoritative.
Apply your {judge_name} criteria to assess code quality.
```

**For PRD artifacts:**
```
WORKDIR=$CLOSEDLOOP_WORKDIR. Read $CLOSEDLOOP_WORKDIR/judge-input.json first.
Evaluate according to `task` and `source_of_truth` ordering.
Treat the envelope's `primary_artifact` ($CLOSEDLOOP_WORKDIR/prd.md) as the authoritative PRD document.
Apply your {judge_name} criteria to assess PRD quality.
```

</prompt_template>

---

### Expected Output Format

<expected_output>
Each judge returns a **CaseScore** JSON object:

```json
{
  "type": "case_score",
  "case_id": "dry-judge",
  "final_status": 1,
  "metrics": [
    {
      "metric_name": "dry_score",
      "threshold": 0.8,
      "score": 0.85,
      "justification": "Plan follows DRY principles..."
    }
  ]
}
```

**Status Code Semantics:**

| Code | Meaning | When to Use |
|------|---------|-------------|
| `1` | Pass | Score meets or exceeds threshold |
| `2` | Fail | Score below threshold |
| `3` | Error | Judge execution failed |

</expected_output>

---

### Error Handling Protocol

<error_handling>

**CRITICAL REQUIREMENT:** If a judge Task call fails, you MUST construct an error CaseScore.

**Error CaseScore Template:**
```json
{
  "type": "case_score",
  "case_id": "{judge-name}",
  "final_status": 3,
  "metrics": [
    {
      "metric_name": "{metric}_score",
      "threshold": 0.8,
      "score": 0.0,
      "justification": "Judge execution failed: {error message}"
    }
  ]
}
```

**Continue-on-failure semantics:**
- Even if ALL judges fail, you MUST aggregate error CaseScores
- Always produce a complete report with 16 CaseScore entries (plan), 11 CaseScore entries (code), or 4 CaseScore entries (prd)
- Never abort the workflow due to judge failures

</error_handling>

---

### Step 2: Aggregate Results into EvaluationReport

**Performance:** Run "start of phase" with sub_step 5 (plan), 4 (code), or 2 (prd), sub_step_name=aggregate. Emit 'end of phase' after the aggregation step regardless of file write outcome.

**Task:** Collect all CaseScore outputs and structure them into an `EvaluationReport`.

<output_structure>

**Output file logic:**
```python
if artifact_type == 'code':
    report_filename = 'code-judges.json'
    report_id = f'{RUN_ID}-code-judges'
elif artifact_type == 'prd':
    report_filename = 'prd-judges.json'
    report_id = f'{RUN_ID}-prd-judges'
else:
    report_filename = 'plan-judges.json'
    report_id = f'{RUN_ID}-plan-judges'
output_path = $CLOSEDLOOP_WORKDIR / report_filename
```

**Plan artifact report structure (plan-judges.json):**
```json
{
  "report_id": "{RUN_ID}-plan-judges",
  "timestamp": "2024-02-03T15:45:30Z",
  "stats": [
    { /* CaseScore from dry-judge */ },
    { /* CaseScore from ssot-judge */ },
    { /* CaseScore from kiss-judge */ },
    { /* CaseScore from code-organization-judge */ },
    { /* CaseScore from custom-best-practices-judge */ },
    { /* CaseScore from goal-alignment-judge */ },
    { /* CaseScore from readability-judge */ },
    { /* CaseScore from verbosity-judge */ },
    { /* CaseScore from solid-isp-dip-judge */ },
    { /* CaseScore from solid-liskov-substitution-judge */ },
    { /* CaseScore from solid-open-closed-judge */ },
    { /* CaseScore from technical-accuracy-judge */ },
    { /* CaseScore from test-judge */ },
    { /* CaseScore from brownfield-accuracy-judge */ },
    { /* CaseScore from codebase-grounding-judge */ },
    { /* CaseScore from convention-adherence-judge */ }
  ]
}
```

**Code artifact report structure (code-judges.json):**
```json
{
  "report_id": "{RUN_ID}-code-judges",
  "timestamp": "2024-02-03T15:45:30Z",
  "stats": [
    { /* CaseScore from dry-judge */ },
    { /* CaseScore from ssot-judge */ },
    { /* CaseScore from kiss-judge */ },
    { /* CaseScore from code-organization-judge */ },
    { /* CaseScore from custom-best-practices-judge */ },
    { /* CaseScore from readability-judge */ },
    { /* CaseScore from solid-isp-dip-judge */ },
    { /* CaseScore from solid-liskov-substitution-judge */ },
    { /* CaseScore from solid-open-closed-judge */ },
    { /* CaseScore from technical-accuracy-judge */ },
    { /* CaseScore from test-judge */ }
  ]
}
```

**PRD artifact report structure (prd-judges.json):**
```json
{
  "report_id": "{RUN_ID}-prd-judges",
  "timestamp": "2024-02-03T15:45:30Z",
  "stats": [
    { /* CaseScore from prd-auditor */ },
    { /* CaseScore from prd-dependency-judge */ },
    { /* CaseScore from prd-testability-judge */ },
    { /* CaseScore from prd-scope-judge */ }
  ]
}
```

**Field requirements:**

| Field | Format | How to Derive |
|-------|--------|---------------|
| `report_id` | `{RUN_ID}-plan-judges`, `{RUN_ID}-code-judges`, or `{RUN_ID}-prd-judges` | Extract RUN_ID from `$CLOSEDLOOP_WORKDIR` directory name, append suffix based on artifact type |
| `timestamp` | ISO 8601 | Generate with `date -u +%Y-%m-%dT%H:%M:%SZ` |
| `stats` | Array[CaseScore] | 16 CaseScore objects for plan, 11 for code, 4 for prd (one per judge) |

</output_structure>

---

### Step 3: Validate Output (MANDATORY)

**Performance:** Run "start of phase" with sub_step 6 (plan), 5 (code), or 3 (prd), sub_step_name=validate. Emit 'end of phase' after each validation attempt regardless of exit code, then apply failure recovery logic.

**CRITICAL:** You MUST run the validation script after writing the judge report. Do not consider the task complete until validation passes.

<validation_workflow>

**Step 3.1: Locate the Validation Script**

The script is in this skill's `scripts/` directory:

```bash
SCRIPT_PATH="scripts/validate_judge_report.py"
```

**Step 3.2: Ensure uv is Installed**

```bash
if ! command -v uv &> /dev/null; then
  # Install uv — alternatives: brew install uv, pip install uv
  curl -LsSf https://astral.sh/uv/install.sh | sh
fi
```

**Step 3.3: Run Validation**

```bash
# CRITICAL: Run from script's directory so uv can find inline dependencies
cd "$(dirname "$SCRIPT_PATH")"

# Determine category based on artifact type
CATEGORY="plan"  # default
if [ "$ARTIFACT_TYPE" = "code" ]; then
  CATEGORY="code"
elif [ "$ARTIFACT_TYPE" = "prd" ]; then
  CATEGORY="prd"
fi

# Run validation with appropriate category
uv run "$SCRIPT_PATH" --workdir "$CLOSEDLOOP_WORKDIR" --category "$CATEGORY"
```

**Argument requirements:**
- `--workdir` must be the **absolute path** to `$CLOSEDLOOP_WORKDIR`
- `--category` must be `plan` (16 judges), `code` (11 judges), or `prd` (4 judges)
- This is where `plan-judges.json`, `code-judges.json`, or `prd-judges.json` is located

</validation_workflow>

---

### Validation Checks

<validation_checks>

The script validates using strict Pydantic models:

| Check | Requirement |
|-------|-------------|
| **JSON syntax** | Valid JSON format |
| **Required fields** | report_id, timestamp, stats array |
| **Judge coverage** | All expected judges present (16 for plan, 11 for code, 4 for prd) |
| **Status values** | final_status ∈ {1, 2, 3} |
| **Metric completeness** | Each judge has ≥1 metric |
| **Report ID format** | Ends with '-judges' (plan), '-code-judges' (code), or '-prd-judges' (prd) |

**Expected judge case_ids for plan artifacts (16 total):**
```
brownfield-accuracy-judge
code-organization-judge
codebase-grounding-judge
convention-adherence-judge
custom-best-practices-judge
dry-judge
goal-alignment-judge
kiss-judge
readability-judge
solid-isp-dip-judge
solid-liskov-substitution-judge
solid-open-closed-judge
ssot-judge
technical-accuracy-judge
test-judge
verbosity-judge
```

**Expected judge case_ids for code artifacts (11 total):**
```
code-organization-judge
custom-best-practices-judge
dry-judge
kiss-judge
readability-judge
solid-isp-dip-judge
solid-liskov-substitution-judge
solid-open-closed-judge
ssot-judge
technical-accuracy-judge
test-judge
```

**Note:** Code artifacts exclude: goal-alignment-judge, verbosity-judge

**Expected judge case_ids for PRD artifacts (4 total):**
```
prd-auditor
prd-dependency-judge
prd-testability-judge
prd-scope-judge
```

**Note:** All 4 PRD judges run in a single parallel batch.

</validation_checks>

---

### Validation Exit Codes

| Code | Meaning | Action |
|------|---------|--------|
| `0` | Valid | Task complete ✓ |
| `1` | Invalid | Read error, fix report JSON, re-validate |

---

### If Validation Fails

<failure_recovery>

**Follow this sequence:**

1. **Read error message** - Understand what failed
2. **Fix report JSON** - Correct the specific validation error
3. **Re-run validation** - Repeat until exit code 0
4. **Never skip validation** - Do not mark task complete until validation passes

</failure_recovery>

---

## Reference: Pydantic Models

<pydantic_schema>

The validation script uses these strict Pydantic models:

```python
class MetricStatistics(BaseModel):
    """A single metric evaluation result."""
    metric_name: str
    threshold: Optional[float] = None
    score: float
    justification: str

class CaseScore(BaseModel):
    """Score for a single judge evaluation."""
    type: Optional[str] = "case_score"
    case_id: str
    final_status: int  # 1=pass, 2=fail, 3=error
    metrics: List[MetricStatistics]

class EvaluationReport(BaseModel):
    """Top-level report containing all judge evaluations."""
    report_id: str
    timestamp: str
    stats: List[CaseScore]
```

**Model constraints:**
- `ConfigDict(strict=True)` enforces exact type matching
- `final_status` validator rejects values outside {1, 2, 3}

</pydantic_schema>

---

## Success Checklist

<completion_criteria>

Before marking this task complete, verify:

**For all artifact types:**
- [ ] **Agents snapshot** - `agents-snapshot/manifest.json` exists in `$CLOSEDLOOP_WORKDIR` (created if missing, skipped if present)

**For plan artifacts (default):**
- [ ] **Input validation** - prd.md and plan.json exist (or graceful skip)
- [ ] **Context preparation** - context-manager-for-judges launched with `artifact_type=plan`
- [ ] **Plan context validation** - `plan-context.json` exists, or compatibility mode explicitly activated
- [ ] **Judge input contract** - `judge-input.json` exists with required fields
- [ ] **Investigation context resolution** - `investigation-log.md` reused, generated via pre-explorer, or best-effort generated internally
- [ ] **Parallel execution** - All 16 judges launched in 4 batches (max 4 per batch)
- [ ] **Result aggregation** - Valid EvaluationReport with 16 CaseScore entries
- [ ] **File output** - `plan-judges.json` written to `$CLOSEDLOOP_WORKDIR`
- [ ] **Validation passed** - Script exits with code 0 using `--category plan`

**For code artifacts (--artifact-type code):**
- [ ] **Context preparation** - context-manager-for-judges agent launched successfully
- [ ] **Context validation** - code-context.json exists at `$CLOSEDLOOP_WORKDIR`
- [ ] **Judge input contract** - `judge-input.json` exists with required fields
- [ ] **Investigation context resolution** - `investigation-log.md` reused or generated best-effort; missing file does not block code judging
- [ ] **Preamble injection** - common_input_preamble.md + code_preamble.md prepended to all judge prompts
- [ ] **Parallel execution** - All 11 judges launched in 3 batches (max 4 per batch)
- [ ] **Result aggregation** - Valid EvaluationReport with 11 CaseScore entries
- [ ] **File output** - `code-judges.json` written to `$CLOSEDLOOP_WORKDIR`
- [ ] **Report ID format** - report_id ends with '-code-judges'
- [ ] **Validation passed** - Script exits with code 0 using `--category code`

**For PRD artifacts (--artifact-type prd):**
- [ ] **prd.md existence check** - `$CLOSEDLOOP_WORKDIR/prd.md` found, or graceful exit with WARNING (code 0)
- [ ] **No context manager** - context-manager-for-judges is NOT launched for prd mode
- [ ] **Judge input contract** - `judge-input.json` written with `evaluation_type="prd"` and `primary_artifact=$CLOSEDLOOP_WORKDIR/prd.md`
- [ ] **Parallel execution** - All 4 PRD judges launched in a single parallel batch (sub_step=1)
- [ ] **Result aggregation** - Valid EvaluationReport with 4 CaseScore entries (sub_step=2)
- [ ] **File output** - `prd-judges.json` written to `$CLOSEDLOOP_WORKDIR`
- [ ] **Report ID format** - report_id ends with '-prd-judges'
- [ ] **Validation passed** - Script exits with code 0 using `--category prd` (sub_step=3)

</completion_criteria>

---

## Troubleshooting Guide

<troubleshooting>

| Error Message | Root Cause | Solution |
|---------------|------------|----------|
| "Report file does not exist" | File not written to correct location | Verify `$CLOSEDLOOP_WORKDIR` is set; check write path matches artifact type (plan-judges.json, code-judges.json, or prd-judges.json) |
| "Invalid JSON" | Syntax error in output file | Run `python3 -m json.tool "$CLOSEDLOOP_WORKDIR/{plan,code,prd}-judges.json"` to identify syntax error |
| "Missing expected judges" | Incomplete batch execution | Verify all batches launched (4 for plan, 3 for code, 1 for prd); check error CaseScores for failures; plan expects 16 judges, code expects 11, prd expects 4 |
| "final_status must be 1, 2, or 3" | Invalid status code | Use only: 1 (pass), 2 (fail), 3 (error) |
| "report_id should end with '-plan-judges'" | Incorrect ID format for plan | Use pattern: `{RUN_ID}-plan-judges` for plan artifacts |
| "report_id should end with '-code-judges'" | Incorrect ID format for code | Use pattern: `{RUN_ID}-code-judges` for code artifacts |
| "Judge {name} has no metrics" | Empty metrics array | Each CaseScore must have ≥1 MetricStatistics entry |
| "Context preparation failed" | context-manager-for-judges failed | Check context-manager agent output; verify artifact files exist |
| "judge-input.json missing" | Orchestrator did not generate envelope | Build `$CLOSEDLOOP_WORKDIR/judge-input.json` before launching judges |
| "judge-input schema invalid" | Missing required envelope fields | Ensure required fields: `evaluation_type`, `task`, `primary_artifact`, `supporting_artifacts`, `source_of_truth`, `fallback_mode`, `metadata` |
| "plan-context.json not found" | plan context manager did not produce output | Run `@judges:context-manager-for-judges` with `artifact_type=plan`; if still missing, activate one-run compatibility fallback to `plan.json` + `prd.md` |
| "Preamble file not found" | Missing common or artifact preamble .md file | Verify both `skills/artifact-type-tailored-context/preambles/common_input_preamble.md` and `skills/artifact-type-tailored-context/preambles/{artifact_type}_preamble.md` exist |
| "pre-explorer unavailable" | `@code:pre-explorer` not installed/resolvable | Log warning and use internal fallback investigation to create `investigation-log.md` |
| "investigation-log.md missing after fallback" | Both pre-explorer and internal fallback failed | Log warning and continue; do not block context preparation |
| "investigation-log.md missing in code mode" | pre-explorer unavailable or generation failed during code preflight | Log warning and continue with `code-context.json` only (non-blocking) |
| "Invalid --artifact-type value" | Unsupported artifact type | Use only 'plan', 'code', or 'prd' |
| "prd.md not found" | PRD document missing from workdir | Emit WARNING and exit gracefully (code 0); do not fail the parent workflow |
| "report_id should end with '-prd-judges'" | Incorrect ID format for prd | Use pattern: `{RUN_ID}-prd-judges` for PRD artifacts |

</troubleshooting>

---

## Error Handling Requirements

### Invalid Artifact Type

If `--artifact-type` value is not 'plan', 'code', or 'prd':
- Fail immediately with clear error message
- Do not attempt judge execution
- Exit with non-zero status

### Context Manager Timeout (Code Mode)

If context-manager-for-judges agent exceeds 5 minutes:
- Abort judge execution
- Generate error CaseScores for all 11 judges
- Each error CaseScore: `final_status=3`, `justification="Context preparation timeout"`
- Write complete report with all error CaseScores

### Context Manager Timeout (Plan Mode)

If context-manager-for-judges agent exceeds 5 minutes in plan mode:
- Attempt one emergency compatibility fallback to raw `plan.json` + `prd.md`
- If fallback files are unavailable, abort plan judge execution and emit clear error

### Individual Judge Failures

If a single judge Task call fails during execution:
- **Do not abort** the entire workflow
- Generate error CaseScore for that judge only
- Continue with remaining judges in batch and subsequent batches
- Include error CaseScore in final aggregated report

### Plan Mode Execution Flow

When `--artifact-type` is not specified or equals 'plan':
- Execute standard 16-judge plan logic
- Launch 4 batches with existing judge assignments
- Write to `plan-judges.json` (not `code-judges.json`)
- Launch context-manager-for-judges for plan context preparation
- Use `plan-context.json` as primary input; use one-run compatibility fallback only if context preparation fails
- Build and pass `judge-input.json` envelope to judges
- Prepend preambles to judge prompts
- Use default validation with `--category plan`

This is the standard plan mode flow; orchestrators must support context-manager launch, judge-input.json construction, and preamble injection. The compatibility fallback (raw `plan.json` + `prd.md`) activates only when context preparation fails (e.g., context-manager timeout), not for orchestrators that have not been updated.

### PRD Mode Execution Flow

When `--artifact-type prd` is specified:
- Check `$CLOSEDLOOP_WORKDIR/prd.md` exists; emit WARNING and exit gracefully (code 0) if missing
- Do NOT launch context-manager-for-judges
- Build `judge-input.json` with `evaluation_type="prd"` and `primary_artifact=$CLOSEDLOOP_WORKDIR/prd.md`
- Launch all 4 PRD judges in a single parallel batch (sub_step=1)
- Aggregate all 4 CaseScores (sub_step=2) and write to `prd-judges.json`
- Validate with `--category prd` (sub_step=3)

---
More from closedloop-ai/claude-plugins