toolkit-evolution
$
npx mdskill add notque/vexjoy-agent/toolkit-evolutionEvolve the toolkit by diagnosing gaps, building fixes, and testing improvements.
- Identifies missing capabilities and diagnoses root causes from system evidence.
- Uses Read, Write, Edit, Bash, Glob, Grep, Agent, and Skill tools.
- Proposes solutions via multi-persona critique before building isolated branches.
- Delivers validated fixes through A/B testing and automated pull requests.
SKILL.md
.github/skills/toolkit-evolutionView on GitHub ↗
---
name: toolkit-evolution
description: "Closed-loop toolkit self-improvement: discover gaps, diagnose, propose, critique, build, test, evolve."
user-invocable: true
argument-hint: "<optional: focus area like 'routing' or 'hooks'>"
command: evolve
context: fork
allowed-tools:
- Read
- Write
- Edit
- Bash
- Glob
- Grep
- Agent
- Skill
routing:
triggers:
- "evolve toolkit"
- "improve the system"
- "self-improve"
- "toolkit evolution"
- "what should we improve"
- "find improvement opportunities"
- "discover skill gaps"
- "what skills are missing"
- "systematic improvement"
pairs_with:
- multi-persona-critique
- skill-eval
complexity: Complex
category: meta-tooling
---
# Toolkit Evolution
Schedulable (nightly) or manually-invoked 7-phase pipeline for continuous toolkit self-improvement. Discovers gaps, diagnoses problems from evidence, proposes solutions, critiques via multi-persona review, builds winners on isolated branches, A/B tests, and promotes via PR.
Nightly sibling of `auto-dream` (2:07 AM consolidates memories; 3:07 AM this skill diagnoses and builds). They feed each other: dream's graduated learnings inform evolution's diagnosis; evolution's results become dream's next input.
Invoke: `/evolve`, `/evolve routing`, `/evolve hooks`, `/evolve --discover`. Cron setup in `references/evolve-preferred-patterns.md` § Scheduling.
## Reference Loading Table
| Signal | Load These Files | Why |
|---|---|---|
| tasks related to this reference | `diagnose-scripts.md` | Loads detailed guidance from `diagnose-scripts.md`. |
| tasks related to this reference | `evolution-report-template.md` | Loads detailed guidance from `evolution-report-template.md`. |
| implementation patterns | `evolve-preferred-patterns.md` | Loads detailed guidance from `evolve-preferred-patterns.md`. |
| tasks related to this reference | `evolve-scripts.md` | Loads detailed guidance from `evolve-scripts.md`. |
## Instructions
### Phase 0: DISCOVER -- Find what's missing
**Goal**: Identify skills, agents, or capability categories the toolkit should have but doesn't. While later phases improve existing components, this phase finds entirely new capabilities the toolkit is missing.
**Frequency**: Monthly, not every run. The DISCOVER phase only executes if:
- `--discover` flag is passed explicitly, OR
- It has been 30+ days since the last discovery run
Check the last discovery run date using the frequency check command from `references/diagnose-scripts.md` § Discovery Frequency Check.
If neither condition is met, skip directly to Phase 1.
**Step 1: Gather briefing data**
Collect current toolkit state using the briefing data commands from `references/diagnose-scripts.md` § DISCOVER Step 1. Brief all 5 perspective agents with the same baseline.
**Step 2: Dispatch 5 perspective agents in parallel**
See `references/evolve-preferred-patterns.md` § Phase 0 DISCOVER for the full agent table and proposal format. Dispatch all 5 simultaneously.
**Step 3: Deduplicate and filter** -- remove duplicates of existing skills (check `skills/INDEX.json`), remove proposals with no evidence (require at least one concrete data point), group similar proposals and note convergent evidence.
**Step 4: Feed into DIAGNOSE** -- append surviving proposals to the Phase 1 opportunity list with source tagged `[DISCOVER]`.
**Step 5: Save discovery report** to `evolution-reports/discovery-{YYYY-MM-DD}.md` (run `mkdir -p evolution-reports` first). Include briefing data, all proposals, filtering rationale, forwarded proposals, and date stamp.
**Gate**: Discovery report saved. Proposals forwarded to Phase 1. Proceed to DIAGNOSE.
---
### Phase 1: DIAGNOSE -- Find improvement opportunities
**Goal**: Identify 5-10 evidence-backed improvement opportunities from multiple data sources.
**Step 1: Query the learning database for recent failures and routing mismatches**
Run the 4 search queries from `references/diagnose-scripts.md` § DIAGNOSE Step 1.
Look for: routing decision patterns, recurring routing failures and mismatches, skills that consistently underperform, error patterns without automated fixes.
**Step 2: Scan recent git history for patterns**
Run the git history commands from `references/diagnose-scripts.md` § DIAGNOSE Step 2.
**Step 3: Check auto-dream reports for accumulated insights**
Run the dream report check from `references/diagnose-scripts.md` § DIAGNOSE Step 3, then read the most recent dream-analysis file.
**Step 3b: Cross-validate dream insights against current state**
Before treating any dream insight as a proposal signal, verify it still reflects the current repo. Use the cross-validation commands from `references/diagnose-scripts.md` § DIAGNOSE Step 3b.
Mark an insight as STALE if: (a) it names a file that no longer exists, OR (b) it claims recent activity but `git log` shows nothing in the past 7 days.
**Step 4: Check routing-table drift**
Skills present in `skills/INDEX.json` but absent from the routing manifest represent a documentation gap. Run the routing-drift check from `references/diagnose-scripts.md` § DIAGNOSE Step 4.
**Step 4b: Check for orphaned ADR session files**
Run the orphaned session check from `references/diagnose-scripts.md` § DIAGNOSE Step 4b. Flag any found -- do not remove automatically.
**Step 4c: Scan for registered stub hooks**
Run the stub hook audit from `references/diagnose-scripts.md` § DIAGNOSE Step 4c. Flag any stub hook as a cleanup opportunity.
**Step 5: Narrow by focus area (if provided)**
If the user specified a focus area (e.g., "routing", "hooks", "agents"), filter all findings to that domain.
**Step 6: Compile opportunity list**
Output a numbered list of 5-10 improvement opportunities. Each entry must include:
- **What**: One-sentence description of the problem or gap
- **Evidence**: Which data source surfaced it (learning DB entry, git churn, dream report)
- **Impact**: Estimated user impact (High/Medium/Low)
**Gate**: At least 3 evidence-backed opportunities identified. If fewer than 3, expand the time window or broaden the data sources. Do not proceed with speculative opportunities that lack evidence.
---
### Phase 2: PROPOSE -- Generate concrete solutions
**Goal**: Transform opportunities into actionable proposals with clear scope.
**Step 1: Generate proposals**
For each opportunity from Phase 1, propose 1-2 concrete solutions. Each proposal must be actionable:
- "Add failure mode X to agent Y's prompt" (not "improve agent Y")
- "Create a reference file for Z in skill W" (not "enhance skill W")
- "Modify Phase 3 of skill V to include check for Q" (not "make skill V better")
**Step 2: Estimate effort**
| Effort | Definition |
|--------|-----------|
| Small | Single file edit, <30 lines changed |
| Medium | 2-5 files, new reference or script, <200 lines |
| Large | New skill or agent, multiple components, >200 lines |
**Step 3: Check for duplicates**
```bash
cat skills/INDEX.json | python3 -c "import sys,json; idx=json.load(sys.stdin); [print(k,'-',v.get('description','')) for k,v in idx.get('skills',{}).items()]" 2>/dev/null || echo "INDEX.json parse failed -- check manually"
```
Drop any proposal that duplicates an existing skill or capability.
**Step 4: Rank proposals**
Rank by: (Impact score) x (1 / Effort score), where High=3, Medium=2, Low=1 and Small=1, Medium=2, Large=3.
Output: ranked list of 5-10 proposals, each with proposal description, scope, effort, and expected outcome.
**Gate**: All proposals are concrete (specific files/skills named), non-duplicative (verified against INDEX.json), and ranked. Proceed with the top 5.
---
### Phase 3: CRITIQUE -- Multi-persona evaluation
**Goal**: Evaluate proposals from multiple perspectives to surface blind spots.
**Step 1: Check for multi-persona-critique skill**
```bash
test -f skills/research/multi-persona-critique/SKILL.md && echo "AVAILABLE" || echo "NOT AVAILABLE"
```
**Step 2a: If multi-persona-critique is available**
```
Skill(skill="multi-persona-critique", args="Evaluate these toolkit improvement proposals: {proposals}")
```
**Step 2b: If NOT available -- use inline fallback**
See `references/evolve-preferred-patterns.md` § Phase 3 Inline Critique Fallback for the 3-agent dispatch prompts and scoring table.
**Step 3: Synthesize consensus**
For each proposal, average persona scores (STRONG=3, MODERATE=2, WEAK=1):
- Score >= 2.5 = STRONG consensus
- Score 1.5-2.4 = MODERATE consensus
- Score < 1.5 = WEAK consensus (shelve)
**Gate**: All personas have reported. Synthesis complete. At least 1 proposal rated STRONG. If no STRONG proposals, revisit Phase 2 with the critique feedback, or report to user that no high-confidence improvements were found this cycle.
**On early exit (no STRONG proposals): always record to the learning DB before stopping.** See `references/evolve-scripts.md` § Early Exit Record for the learning-db command template.
---
### Phase 4: BUILD -- Implement winners
**Goal**: Implement the top 1-3 STRONG-rated proposals on isolated feature branches.
**Constraint**: Maximum 3 implementations per cycle. Focus over breadth.
**Step 1: Select winners**
Take the top 1-3 proposals rated STRONG by consensus. Do not pad with MODERATE proposals.
**Step 2: Dispatch implementation agents**
For each winner, dispatch an implementation agent in an isolated context. See `references/evolve-scripts.md` § Build Dispatch for the proposal-type to implementation-approach table.
Each implementation must create a feature branch `feat/evolve-{proposal-slug}` and commit with a descriptive message.
**Step 3: Validate** -- run `python3 -m scripts.skill_eval.quick_validate skills/{skill-name}`, `python3 -m py_compile {script}`, and `bash -n {script}` on each implementation.
**Gate**: All implementations committed on feature branches. Basic validation passed. Proceed to testing.
---
### Phase 5: VALIDATE -- A/B test implementations
**Goal**: Empirically verify that each implementation improves outcomes vs baseline.
**Step 1: Create test cases**
For each implementation, create 3-5 realistic test prompts that exercise the changed behavior.
**Step 2: Run comparisons**
See `references/evolve-scripts.md` § Validate Run for the skill-eval command and manual fallback pattern.
**Step 3: Evaluate results**
Win condition for each implementation:
- 60%+ of test cases show improvement on at least one dimension
- No dimension regressed by more than 1 point (on a 5-point scale)
- No new failures introduced
**Gate**: All implementations tested. Win/loss determined for each. Evidence recorded.
---
### Phase 6: EVOLVE -- Promote winners and record learnings
**Goal**: Ship winners via PR, record all outcomes in the learning database.
**Step 1: Handle winners (WIN status)**
For each winning implementation, create a PR using the template from `references/evolve-scripts.md` § Step 1, then merge. After creating the PR, run pr-review to validate, then merge.
The multi-persona critique + A/B testing gate is the review. Auto-merge is safe because the validation happened before this step.
**Step 1b: Clean up the feature branch after merge**
Use the cleanup commands from `references/evolve-scripts.md` § Step 1b.
**Step 2: Handle losers (LOSS status)**
Record what was tried and why it failed using the failure template from `references/evolve-scripts.md` § Step 2.
**Step 3: Record the full cycle**
Record using the full cycle template from `references/evolve-scripts.md` § Step 3.
**Step 4: Write evolution report**
Write the dated report to `evolution-reports/evolution-report-{YYYY-MM-DD}.md` using the template in `references/evolution-report-template.md`. See setup command in `references/evolve-scripts.md` § Step 4.
**Gate**: Winners merged. Learnings recorded for all proposals (wins and losses). Evolution report written. Cycle complete.
---
## Reference Loading
| Signal | Load |
|--------|------|
| Running Phase 0 DISCOVER (frequency check, briefing data commands needed) | `references/diagnose-scripts.md` |
| Running Phase 1 DIAGNOSE (Steps 1-4c commands needed) | `references/diagnose-scripts.md` |
| Phase 0 perspective agent table, proposal format | `references/evolve-preferred-patterns.md` |
| Phase 3 inline critique fallback (multi-persona not available) | `references/evolve-preferred-patterns.md` |
| Failure modes, error handling, cost estimate, cron scheduling | `references/evolve-preferred-patterns.md` |
| Running Phase 6 EVOLVE (PR template, merge, cleanup, learning DB commands) | `references/evolve-scripts.md` |
| Writing or reading the evolution report | `references/evolution-report-template.md` |
---
## References
- `references/evolution-report-template.md` -- Template for the evolution report
- `references/diagnose-scripts.md` -- Phase 0 and Phase 1 bash/Python commands
- `references/evolve-scripts.md` -- Phase 6 PR, merge, cleanup, and learning DB commands
- `references/evolve-preferred-patterns.md` -- Failure modes, error handling, cost, critique fallback, scheduling
- `skills/meta/auto-dream/SKILL.md` -- Nightly sibling: memory consolidation and learning graduation
- `skills/meta/skill-eval/SKILL.md` -- Skill testing and benchmarking
- `skills/research/multi-persona-critique/SKILL.md` -- Multi-persona evaluation (may not exist yet; inline fallback in references)
- `skills/meta/skill-creator/SKILL.md` -- Skill creation methodology
- `skills/meta/agent-comparison/SKILL.md` -- A/B testing methodology
- `skills/infrastructure/headless-cron-creator/SKILL.md` -- Cron job creation patterns
More from notque/vexjoy-agent
- adr-consultationMulti-agent consultation for architecture decisions.
- agent-comparisonA/B test agent variants for quality and token cost.
- agent-evaluationEvaluate agents and skills for quality and standards compliance.
- architecture-deepeningProactive architecture improvement: find shallow modules, propose deepening opportunities, design conversation.
- auto-dreamBackground memory consolidation and learning graduation — overnight knowledge lifecycle.
- bluesky-readerRead public Bluesky feeds via AT Protocol API.
- cobalt-coreCobalt Core infrastructure knowledge: KVM exporters, hypervisor tooling, OpenStack compute.
- code-cleanupDetect stale TODOs, unused imports, and dead code.
- code-lintingRun Python (ruff) and JavaScript (Biome) linting.
- codebase-analyzerStatistical rule discovery from Go codebase patterns.