root-cause-tracing
$
npx mdskill add microsoft/FluidFramework/root-cause-tracingBugs often manifest deep in the call stack (git init in wrong directory, file created in wrong location, database opened with wrong path). Your instinct is to fix where the error appears, but that's treating a symptom.
SKILL.md
.github/skills/root-cause-tracingView on GitHub ↗
---
name: root-cause-tracing
description: Use when errors occur deep in execution and you need to trace back to find the original trigger - systematically traces bugs backward through call stack, adding instrumentation when needed, to identify source of invalid data or incorrect behavior
---
# Root Cause Tracing
## Overview
Bugs often manifest deep in the call stack (git init in wrong directory, file created in wrong location, database opened with wrong path). Your instinct is to fix where the error appears, but that's treating a symptom.
**Core principle:** Trace backward through the call chain until you find the original trigger, then fix at the source.
## When to Use
```dot
digraph when_to_use {
"Bug appears deep in stack?" [shape=diamond];
"Can trace backwards?" [shape=diamond];
"Fix at symptom point" [shape=box];
"Trace to original trigger" [shape=box];
"Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"];
"Can trace backwards?" -> "Trace to original trigger" [label="yes"];
"Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"];
}
```
**Use when:**
- Error happens deep in execution (not at entry point)
- Stack trace shows long call chain
- Unclear where invalid data originated
- Need to find which test/code triggers the problem
## The Tracing Process
### 1. Observe the Symptom
```
Error: git init failed in /Users/jesse/project/packages/core
```
### 2. Find Immediate Cause
**What code directly causes this?**
```typescript
await execFileAsync('git', ['init'], { cwd: projectDir });
```
### 3. Ask: What Called This?
```typescript
WorktreeManager.createSessionWorktree(projectDir, sessionId)
→ called by Session.initializeWorkspace()
→ called by Session.create()
→ called by test at Project.create()
```
### 4. Keep Tracing Up
**What value was passed?**
- `projectDir = ''` (empty string!)
- Empty string as `cwd` resolves to `process.cwd()`
- That's the source code directory!
### 5. Find Original Trigger
**Where did empty string come from?**
```typescript
const context = setupCoreTest(); // Returns { tempDir: '' }
Project.create('name', context.tempDir); // Accessed before beforeEach!
```
## Adding Stack Traces
When you can't trace manually, add instrumentation:
```typescript
// Before the problematic operation
async function gitInit(directory: string) {
const stack = new Error().stack;
console.error('DEBUG git init:', {
directory,
cwd: process.cwd(),
nodeEnv: process.env.NODE_ENV,
stack,
});
await execFileAsync('git', ['init'], { cwd: directory });
}
```
**Critical:** Use `console.error()` in tests (not logger - may not show)
**Run and capture:**
```bash
npm test 2>&1 | grep 'DEBUG git init'
```
**Analyze stack traces:**
- Look for test file names
- Find the line number triggering the call
- Identify the pattern (same test? same parameter?)
## Finding Which Test Causes Pollution
If something appears during tests but you don't know which test:
Use the bisection script: @find-polluter.sh
```bash
./find-polluter.sh '.git' 'src/**/*.test.ts'
```
Runs tests one-by-one, stops at first polluter. See script for usage.
## Real Example: Empty projectDir
**Symptom:** `.git` created in `packages/core/` (source code)
**Trace chain:**
1. `git init` runs in `process.cwd()` ← empty cwd parameter
2. WorktreeManager called with empty projectDir
3. Session.create() passed empty string
4. Test accessed `context.tempDir` before beforeEach
5. setupCoreTest() returns `{ tempDir: '' }` initially
**Root cause:** Top-level variable initialization accessing empty value
**Fix:** Made tempDir a getter that throws if accessed before beforeEach
**Also added validation at multiple layers:**
- Layer 1: Project.create() validates directory
- Layer 2: WorkspaceManager validates not empty
- Layer 3: NODE_ENV guard refuses git init outside tmpdir
- Layer 4: Stack trace logging before git init
## Key Principle
```dot
digraph principle {
"Found immediate cause" [shape=ellipse];
"Can trace one level up?" [shape=diamond];
"Trace backwards" [shape=box];
"Is this the source?" [shape=diamond];
"Fix at source" [shape=box];
"NEVER fix just the symptom" [shape=octagon, style=filled, fillcolor=red, fontcolor=white];
"Found immediate cause" -> "Can trace one level up?";
"Can trace one level up?" -> "Trace backwards" [label="yes"];
"Can trace one level up?" -> "NEVER fix just the symptom" [label="no"];
"Trace backwards" -> "Is this the source?";
"Is this the source?" -> "Trace backwards" [label="no - keeps going"];
"Is this the source?" -> "Fix at source" [label="yes"];
}
```
**NEVER fix just where the error appears.** Trace back to find the original trigger.
## Stack Trace Tips
**In tests:** Use `console.error()` not logger - logger may be suppressed
**Before operation:** Log before the dangerous operation, not after it fails
**Include context:** Directory, cwd, environment variables, timestamps
**Capture stack:** `new Error().stack` shows complete call chain
## Real-World Impact
From debugging session (2025-10-03):
- Found root cause through 5-level trace
- Fixed at source (getter validation)
- Added 4 layers of defense
- 1847 tests passed, zero pollution
More from microsoft/FluidFramework
- api-changesUse when customer-facing API changes were made — i.e., API report .md files differ from main. Guides through release tag assignment, API Council review requirements, breaking change classification, deprecation process, and changeset guidance. Triggered automatically by ci-readiness-check when api-report diffs are detected.
- brainstormingIMMEDIATELY USE THIS SKILL when creating or develop anything and before writing code or implementation plans - refines rough ideas into fully-formed designs through structured Socratic questioning, alternative exploration, and incremental validation
- building-ui-uxUse when implementing user interfaces or user experiences - guides through exploration of design variations, frontend setup, iteration, and proper integration
- ci-readiness-checkUse when the user explicitly asks for a CI check or to push their branch — e.g. "ci readiness", "check ci", "pre-push check", "ready for CI", "ci check", "ready to push", "push my changes", "push the branch", "let's push". Catches common CI failures before pushing — formatting, stale API reports, missing changesets, policy violations.
- creating-debug-tests-and-iteratingUse this skill when faced with a difficult debugging task where you need to replicate some bug or behavior in order to see what is going wrong.
- creating-skillsUse when you need to create a new custom skill for a profile - guides through gathering requirements, creating directory structure, writing SKILL.md, and optionally adding bundled scripts
- ff-oce-dashboardGenerate the OCE shift status dashboard. Triggers on: 'generate shift dashboard', 'show dashboard', 'shift status', 'status dashboard', 'what's going on', or any request for a NON-SPECIFIC overview of current OCE status (incidents, pipelines, errors).
- ff-oce-kustoUse this skill for any Kusto query or telemetry investigation specifically related to Fluid Framework or its partners. Triggers include: writing or running a Kusto query against the Office Fluid database, investigating Fluid Framework telemetry or error rates, querying Office_Fluid_FluidRuntime_* tables, looking up a Fluid session by Session_Id or docId, investigating a Fluid-related error in Loop or Whiteboard telemetry, monitoring an FF bump or partner ring deployment, checking Fluid render reliability or Scriptor errors, or when the user mentions Fluid-specific tables (Office_Fluid_FluidRuntime_*, OwhLoads, HostTracker, Scriptor) or Fluid-specific error types (dataCorruptionError, dataProcessingError, DeltaConnectionFailureToConnect, ICE, ACE). Do NOT trigger for general Kusto questions that are not related to Fluid Framework.
- finishing-a-development-branchUse this when you have completed some feature implementation and have written passing tests, and you are ready to create a PR.
- fluid-prUse when creating a pull request in the Fluid Framework repo. Composes a PR title and body following Fluid Framework conventions, proposes them to the user, then pushes the branch and creates the PR on GitHub. Triggers on "create a PR", "make a PR", "open a PR", "submit a PR", or "push and create a PR".