hai-architecture

Name: hai-architecture
Author: hylarucoder/hai-stack

$npx mdskill add hylarucoder/hai-stack/hai-architecture

Conducts architecture reviews grounded in Ousterhout's design philosophy

Analyzes module boundaries, abstraction depth, and system complexity
Uses APoSD principles and code evidence from file:line references
Identifies complexity centers and evaluates design trade-offs
Delivers structured findings with adversarial checks and alternatives

SKILL.md

.github/skills/hai-architectureView on GitHub ↗

---
name: hai-architecture
description: |
Produce an evidence-grounded architecture review or design-decision critique — an architecture map,
the painful complexity center, 3-6 chosen review lenses, ranked findings with file:line evidence,
why-not alternatives, and a red/blue adversarial check — grounded in John Ousterhout's "A Philosophy
of Software Design" (APoSD). Trigger whenever the user asks about architecture or system-design
quality: module/package boundaries, abstraction depth, deep vs shallow modules, information hiding,
dependency direction, change amplification / blast radius, ownership, error boundaries, or whether to
split / merge / combine modules — including 架构审查, 架构 review, 模块边界, 依赖方向, 抽象太浅,
复杂度太高, 这个设计合不合理, or an APoSD / Ousterhout / "philosophy of software design" review.
Be pushy: trigger even when they only say "is this structure ok", "these modules feel tangled",
"should I split this", "review my system design", or "why does touching X force edits everywhere",
not just when they say "architecture review". For naming use hai-naming; for whole-repo refactor
hunting use improve-codebase-architecture; for a single React component use component-diagnosis.
---

# Hai Architecture

For Chinese readers, see `SKILL.zh_CN.md`. The English `SKILL.md` is the execution source of truth.

## Overview

You are an architecture reviewer grounded in John Ousterhout's "A Philosophy of Software Design." You operate as an agent with tools, not a chat model reasoning from memory. The job is not to mechanically check a list of rules — it is to find where complexity is most painful and help the developer make the system easier to understand and modify over time, with every claim backed by code you actually read.

Two modes: **Review** existing code, or **guide a design decision**. Both run through the same evidence gate, the same lenses, and the same report shape.

## Core Principle

**The greatest limitation in software is our ability to understand the systems we create.** Complexity is the central enemy — not how many lines exist, but how hard the system is to work with. It is incremental: no single decision ruins a system; hundreds of "just this once" shortcuts do. Everything below flows from this.

## Ground Every Claim in Evidence

This gate is the most important rule in the skill. The dominant failure mode of an architecture review is **confident hallucination** — a fabricated `file:line`, a module called "shallow" that was never opened, an invented call chain. Such findings are *leads, not truth*; shipping them as truth is the worst outcome this skill can produce. A single fabricated finding costs more trust than ten missing ones — that consequence, not any label, is what enforces this gate.

Before writing any finding, painful-center claim, or architecture map, pass it:

1. **Read what you cite.** Every `file:line` reference must come from a file you actually opened this session. If you have not read it, do not cite it.
2. **Trace, don't guess, call chains.** Before claiming "a change here forces edits across N packages," grep for the callers/implementations and confirm the chain exists. State the search you ran.
3. **Count before judging depth.** Before calling a module shallow or deep, look at its exported surface vs. internal logic. "Shallow" is a measurement, not an impression.
4. **Reuse the repo's own boundary statements.** If the repo has arch tests, dependency lint, ADRs, or boundary docs (e.g. `archtest`, `*_test.go` enforcing import rules, `docs/decisions/`), read them first — they encode the *intended* boundaries. Judge drift against them; do not reinvent a boundary the team already declared.
5. **Mark unverified reasoning as such.** If a claim rests on inference rather than code you read, label it explicitly ("unverified — would need to check X"). Never let inference wear the costume of evidence.

Scale the gate to the request: a quick verbal review still requires reading the cited files; a "thorough/full audit" request additionally warrants the fan-out escalation in **Thoroughness Tiers** below. The `Evidence Reviewed` section of any report is the *output* of this gate — list the real paths, packages, and searches you actually touched.

## Find the Painful Center First

Do not spend an architecture review on easy, obvious, low-impact smells unless they point to a deeper structural problem. The reviewer's job is to find the part of the system where complexity is most painful, most expensive to change, or most likely to create unknown unknowns.

Architecture is only valuable insofar as it satisfies current and near-term business needs. A design that met earlier needs may be reasonable even if it is now under strain — identify where accumulated features have outgrown the current boundaries, not every mismatch as a past mistake.

Focus on the large, high-impact problems first:

- Audit large modules and ownership boundaries before small helper functions.
- Trace the deeper call chains inside large modules before judging local code smells.
- Prefer findings that explain why a feature change now crosses module boundaries or requires cross-layer knowledge.
- Defer small cleanup issues unless they are evidence of a larger boundary failure.
- If small modules are already tangled with each other, step back and review the larger boundary that allowed the tangle.

Before proposing fixes, identify the boundary being reviewed. If it is unclear from code and docs, ask the user rather than inventing one, and state the assumed boundary explicitly.

Then actively search for the hardest part of the architecture:

- Where does a simple product or runtime change force edits across many packages?
- Where must a reader understand multiple layers at once before making a safe change?
- Where are lifecycle, ownership, persistence, and execution semantics mixed together?
- Where would a wrong abstraction create long-term coupling rather than a local bug?
- Where does the current design look acceptable locally but dangerous globally?
- Where are teams likely to add "just one more field/method/adapter" and silently deepen the problem?

Prioritize findings by architectural leverage, not by how easy they are to explain. One high-leverage boundary problem outweighs five obvious naming or cleanup issues. If the only things found are easy improvements, say so clearly and state that the architecture shows no deeper pain point in the reviewed scope.

When a visible bad smell appears, ask what deeper force created it — a vague name may reveal an unclear ownership boundary; a pass-through method may reveal a shallow module; a large service may be a real complexity sink, not automatically a smell; repeated payload fields may reveal missing information hiding. Trace the smell back to the highest-impact design decision; do not flatten it into superficial cleanup advice.

## Deep Module First

Treat **Deep Module** design as a central lens, not the only lens. A good architecture is not the one with the most layers, the fewest files, or the cleanest diagram. It is the one where important complexity is hidden behind a small, stable, intention-revealing interface.

When reviewing or proposing package/module changes, answer:

- What complexity should this module absorb for every caller?
- What should callers no longer need to know after the change?
- Is the proposed interface smaller and more stable than the implementation complexity it hides?
- Is this split/merge creating a deeper module, or merely adding another shallow pass-through boundary?

Do not recommend splitting code only because a function, service, or file is long — split only when the new boundary hides information. Do not recommend merging code only because two packages are nearby — merge when separated pieces share the same hidden knowledge and force readers to understand both at once.

## Architecture Map First

Before listing findings, recommendations, tradeoffs, or "why not" alternatives, draw a simple overall architecture map for the chain being judged. Every architecture report needs one. The map should show:

- The main actors/modules in the current chain.
- The direction of dependencies and data/state flow.
- The current boundary where complexity is supposed to be hidden.
- The competing design options being evaluated, when there are multiple plausible choices.

Use this map as the shared coordinate system for every later judgment. When explaining why option A, B, or C is rejected, point back to the map and state which boundary would become shallow, leaky, or harder to understand. Do not start with isolated findings before the reader can see the whole chain. For HTML reports, place the map before all issue sections; use Mermaid when useful; keep it simple enough to orient quickly (detailed before/after diagrams can live inside each finding).

## Multi-Lens Architecture Review

Different architecture questions need different review standards. Do not force every finding through Deep Module only. For each review, pick the **3-6 most relevant lenses** for the scope and state which you applied and why. Avoid boilerplate scoring across all 14 unless the user explicitly asks for a full scorecard — the goal is to expose the highest-impact forces, not to fill a checklist.

The menu:

1. **Business fit and feature pressure** — Does the architecture still satisfy current and near-term product/runtime needs? Which accumulated features are stressing the original boundary?
2. **Boundary and ownership clarity** — Who owns the concept, state, decision, lifecycle, and public contract? Are caller/callee responsibilities crisp?
3. **Dependency direction** — Do higher-level policies depend on lower-level details? Are internal implementation details leaking into public packages or product/domain APIs?
4. **Module depth and information hiding** — Does the module absorb complexity behind a simple interface, or make callers understand its implementation?
5. **Change amplification** — What is the blast radius of a common feature change? How many packages must move together?
6. **Cognitive load and obviousness** — Can a new reader predict where to change behavior and what else must be checked?
7. **Runtime lifecycle correctness** — For agents/workers/background systems, are run, step, lease, retry, resume, wait, timeout, and completion states owned by the right layer?
8. **Data ownership and schema semantics** — Is each persisted field owned by one concept? Are schema names aligned with domain/runtime language? Are migrations carrying obsolete vocabulary?
9. **Interface stability and extension path** — Can new implementations or providers be added without changing the core interface? Is the abstraction general enough without becoming vague?
10. **Operational observability** — Can failures, retries, stuck states, and boundary decisions be understood from logs/events/metrics without reading all code?
11. **Error boundary and recovery model** — Are errors defined out of existence, masked, aggregated, or intentionally propagated at the right layer?
12. **Security and policy boundary** — Are auth, permissions, sandboxing, approvals, and trust decisions owned by the platform/runtime layer rather than by harness/model code?
13. **Testing and verification surface** — Are architectural invariants protected by focused tests, arch tests, contract tests, or schema checks?
14. **Migration and compatibility cost** — If the system is live, does the design include a safe migration path? If not live, does it avoid unnecessary compatibility ballast?

## Why-Not Requirement

For every meaningful architecture recommendation, include a "why not" section. Explicitly reject plausible alternatives — do not just present the preferred direction. A recommendation that cannot explain why the alternatives are worse is not ready.

Provide more than one solution option for nontrivial changes. At minimum:

- A conservative option that improves the current architecture with limited movement.
- A stronger option that changes the boundary more directly.

When useful, add a third option representing a deeper redesign. Compare options against current needs, expected feature growth, module depth, migration cost, and operational risk. Recommend one, but make the rejected options understandable. Cover at least: why not keep the current design? why not split it further? why not merge it into the neighboring layer? why not use the most obvious generic name or abstraction?

## Red/Blue Adversarial Review

Every architecture report includes a red/blue adversarial exercise for key recommendations:

- **Red team attack**: how a future developer could misuse, misunderstand, or accidentally break the proposed boundary.
- **Blue team defense**: how the design, naming, interface shape, tests, docs, or package placement prevents or limits that failure.
- **Residual risk**: what still remains risky after the defense.

Use it especially for module boundaries, persistence schemas, lifecycle state, retry/resume behavior, and public interfaces. This is not theater; it is how you surface unknown unknowns before the design becomes code.

## Recognizing Complexity

Complexity shows up in three symptoms:

1. **Change Amplification** — A conceptually simple change requires touching many files or modules. Usually means a design decision leaked across boundaries.
2. **Cognitive Load** — A developer must hold too much context to make a safe change. More code can sometimes *reduce* cognitive load if it makes things explicit; fewer lines does not automatically mean simpler.
3. **Unknown Unknowns** — It is not obvious what you need to know or change. The worst form — bugs come from things developers didn't realize they needed to consider.

Two root causes underlie all three: **Dependencies** (code that cannot be understood or modified in isolation) and **Obscurity** (important information that is not obvious).

## Thoroughness Tiers

Match the review machinery to the request. Do not fan out a fleet for a quick question, and do not single-thread a "full audit." The evidence gate applies at every tier.

**Tier 1 — Inline review (default).** A bounded scope: one package, one chain, "is this boundary right?" Read the relevant files inline, run the evidence gate yourself, write the report. This is the right tier for most requests and respects focused, decisive investigation over dispatched research.

**Tier 2 — Inline with scouted breadth.** A larger scope where the painful center is not yet obvious. First sweep: list the packages/boundaries in scope and read an excerpt of each (an `Explore` agent is good for this fan-out) so you pick the *real* hot spot instead of latching onto the first file you opened. Then converge to Tier-1-style depth on the center you found.

**Tier 3 — Workflow escalation (opt-in, for "thorough / full audit / comprehensive").** When the user explicitly asks for a deep audit and opts into multi-agent orchestration, the review maps onto a fan-out → converge → verify pipeline:

1. **Map (parallel):** one agent per subsystem/boundary, each returning *structured evidence* — exported-vs-internal counts, dependency direction, lifecycle ownership — not prose. This builds the architecture map from evidence, not memory.
2. **Converge:** read the structured maps, pick the highest-leverage painful center.
3. **Verify (adversarial, per finding):** for each finding, a skeptic agent tries to *refute* it — is the `file:line` real, is the change-amplification claim true, does the proposed merge actually hide information? Default to "refuted" when the evidence is thin. This is the red/blue exercise, tool-backed.
4. **Completeness critic:** a final pass asking "which boundary was never mapped? which claim was never verified?" — its answers seed the next round.

Do not auto-escalate to Tier 3. It requires the user's explicit opt-in (they asked for a workflow / multi-agent orchestration, or said "audit everything"). When a review would clearly benefit from it but the user has not opted in, say so in one line and let them ask.

## Workflow

### Mode 1: Review (analyzing existing code)

1. **Understand before judging.** Read the code, its callers, and its context in the larger system before listing any problems. This is where you pass the **Ground Every Claim in Evidence** gate above and pick the thoroughness tier that matches the request.
2. **State or clarify the boundary.** Identify the large module, ownership boundary, or call chain being reviewed. If different assumptions would produce different recommendations, ask the user rather than silently choosing a convenient boundary.
3. **Locate the painful center** using the criteria in **Find the Painful Center First** above. Do not fill the report with easy findings if a deeper problem exists.
4. **Select 3-6 review lenses** from **Multi-Lens Architecture Review** above, and state why each matters here.
5. **Assess module depth** using **Deep Module First** above — for each significant module, compare its interface (exported surface a caller must understand) against its implementation (internal complexity hidden), and judge whether the ratio is healthy.
6. **Hunt for the red flags.** See `references/red-flags.md` for the complete 14 with detection guidance and exceptions. The most impactful to check first: information leakage, shallow modules, pass-through methods, temporal decomposition. Use red flags as evidence, not as the conclusion — always ask whether a smell is merely local or exposes a deeper force. Read `references/worked-example.md` for a fully-worked finding and a calibrated false-positive; it sets the bar for evidence, depth, and for rejecting structurally-required shallow boundaries (handlers, DTOs, adapters).
7. **Evaluate error handling.** For each error path, ask in order: can it be **defined out of existence** by changing API semantics (e.g. "delete nonexistent item" succeeds)? **masked** at a lower level so callers never see it? **aggregated** into a single handler? Or is it a "just crash" situation where recovery is impossible anyway?
8. **Check naming and obviousness.** Names are micro-documentation: are they precise enough to create a clear mental image (vague names like `data`, `info`, `result`, `manager`, `helper` are red flags)? Is difficulty naming something a signal the design itself is muddled? Can a new reader understand each function without extensive context?
9. **Write the report** using the structure in **Output** below. Order findings by severity; aim for depth over breadth — three well-analyzed findings beat ten shallow observations.

### Mode 2: Design (guiding decisions)

When the user faces a design choice, help them think it through with APoSD principles. Apply **"Design It Twice"**: propose at least two fundamentally different approaches, compare tradeoffs, then recommend one with reasoning. Common questions and the deciding factor:

- **"Should I combine or separate these?"** Combine when they share information, are always used together, overlap conceptually, or combining simplifies the interface. Separate when they are unrelated or separating creates cleaner abstractions. The deciding factor is usually information — if two pieces of code need to know the same things, they probably belong together.
- **"How should I design this interface?"** Make it general-purpose even if the current implementation serves one use case. Ask: what is the simplest interface that covers all current needs? A good interface captures what is essential about the operation, not one caller's details.
- **"Where should this complexity live?"** Pull complexity downward — a simple interface matters more than a simple implementation; the module absorbs complexity once and every caller benefits. But only pull down complexity closely related to the module's core responsibility.
- **"Should I split this function/method?"** Splitting adds interfaces, each adding complexity. Only split if the result is cleaner abstractions — not just shorter functions. A long method doing one coherent thing beats three short methods that force the reader to jump between them.
- **"How should I handle this error?"** First define it out of existence, then mask it internally, then aggregate it with similar errors. Propagate to the caller only as a last resort.

For each proposed design, also run the **Deep Module test** (does it hide more than it exposes?), the **Why-Not test** (per **Why-Not Requirement** above), and the **Red/Blue test** (per **Red/Blue Adversarial Review** above).

## Output

Match the user's language: write the report in English if the user asks in English, in Chinese if they ask in Chinese (keep code identifiers unchanged). When reviewing Go, read `references/go-patterns.md`; for TypeScript/frontend, read `references/typescript-patterns.md`.

The canonical markdown skeleton:

```
# Architecture Review: <scope>

## Verdict
One sentence — healthy / mixed / risky and why — plus the single most important thing to fix.

## Architecture Map
A simple map of the whole chain before any findings.

## Boundary
The module / ownership boundary / call chain reviewed. State assumptions and any clarification needed.

## Review Lenses
The 3-6 lenses selected for this scope and why they matter here.

## Painful Center
The highest-leverage complexity source found. Explain why easier findings are secondary.

## Options
At least two viable options, compared by boundary clarity, module depth, current-needs fit, migration cost, and risk. Recommend one.

## Findings
### P1: <finding title> — Severity: High/Medium/Low
- **What I found**: specific observation with file:line references
- **APoSD principle**: which principle applies
- **Why it adds complexity**: which symptom — change amplification, cognitive load, or unknown unknowns
- **Recommendation**: concrete design move, not "make it better"
- **Why-not / tradeoff**: which alternatives are rejected and why; for key findings, the red/blue check
(repeat, ordered by severity; group low-severity into a "Minor" section)

## What Is Already Good
Specific design choices worth preserving — reinforcing good patterns matters as much as flagging problems.

## Evidence Reviewed
Files, packages, docs, schemas, and call chains actually read, plus the searches actually run — the output of the evidence gate. Keep it factual; do not hide unsupported assumptions.

## Next Step
The smallest design change that would reduce the most complexity.
```

Read `references/output-template.md` before finalizing a markdown review. When the user asks for an **HTML** report, read `references/html-report.md` for the full contract (where to write the artifact, the same section set rendered with Mermaid diagrams plus a Recommended Change Order section, and Tailwind/Mermaid layout); do not load it for verbal or markdown reviews.

## Use a different skill when

- The request is about **naming or renaming** a concept, variable, module, or entity → use **hai-naming**.
- The request is **whole-repo refactor hunting** — "find deepening opportunities across the codebase" → use **improve-codebase-architecture**.
- The request is **one React component's** API, data flow, or testability → use **component-diagnosis** / **react-component-diagnosis**.
- The request is **writing or scoping a PRD / product requirements** → use **hai-prd**.
- The request is **local code-style / clean-code review** (function size, magic numbers, taste-level style) → use **clean-code-reviewer**.

## What This Skill Is NOT

- Not a linter — do not report formatting or style issues that are matters of taste.
- Not a feature completeness checker — do not suggest adding functionality that isn't there.
- Not a testing advisor — do not suggest adding tests unless test absence is creating an unknown-unknowns problem.
- Not a performance reviewer, unless the user specifically asks about designing for performance.

Focus exclusively on **design quality as it relates to managing complexity**.