crucible-research-foundations

Name: crucible-research-foundations
Author: terrylica/cc-skills
$npx mdskill add terrylica/cc-skills/crucible-research-foundations
Validates research foundations to prevent look-ahead bias and label leakage
Checks for look-ahead bias, label leakage, and causal feature validity
Uses Read, Grep, and Glob tools to analyze code and data
Applies causality tests and pattern validation to detect violations
Reports findings and ensures results are epistemically sound
SKILL.md
.github/skills/crucible-research-foundationsView on GitHub ↗
---
name: crucible-research-foundations
description: Validate findings, design shuffled nulls, check label leakage, review causal features. TRIGGERS - shuffled null, label leakage
allowed-tools: Read, Grep, Glob
---

# Research Foundations — 6 epistemic disciplines

> **Self-Evolving Skill**: This skill improves through use. If a discipline's guidance fails in practice or a new trap emerges, update the relevant section AND append to `references/evolution-log.md`. Don't defer.

Read these in order. The first three (causal, labels, nulls) are the hardest prerequisites — violating any of them silently invalidates every downstream result.

---

## 1. Causal-feature invariant (bars[:i])

Every feature `f[i]` used at trigger/decision bar `i` must be computable using only `bars[0:i]` — never `bars[i]`, never `bars[i+1:]`. Violation produces look-ahead bias; findings silently become worthless.

**Canonical pattern**:

```python
for i in range(n):
    lo = max(0, i - window)
    wind = values[lo:i]  # EXCLUSIVE upper bound — no peeking
    f[i] = compute(wind)
```

Note `lo:i` (exclusive), not `lo:i+1`. This discipline "feels off by one" but is correct.

**Verification test** (add to every new feature function):

```python
def test_causality(fn, n=1000):
    bars = generate_test_bars(n)
    f_orig = fn(bars)
    bars_mod = bars.copy()
    bars_mod[500:] *= 2  # perturb the FUTURE
    f_mod = fn(bars_mod)
    assert np.array_equal(f_orig[:500], f_mod[:500]), "look-ahead detected"
```

**Silent-bug signature**: impossibly clean results (tw > 10 bps on FX, win rate > 70%, OOS matches IS perfectly).

Full reference: `findings/methodology/10-causal-feature-invariant.md`.

---

## 2. Label-leakage (bar-local scaling kills window leakage)

Forward labels must be scaled to the **triggering bar's own range**, NEVER to a window-wide scale. Window-relative labels are tautological.

**Trap**: If you label `fwd+H = UP when close[i+H] - close[i] > window.span/20`, then when `close[i]` is near `window.min` (loc=B), `fwd=UP` is near-automatic. Agents will report spurious "signals".

**Fix**: use bar-local triple-barrier labels:

```python
r = high[i] - low[i]   # THIS bar's range, not window's
tp_level = close[i] + tp_mult * r
sl_level = close[i] - sl_mult * r
# walk forward, exit at first tp/sl/expiry
```

**Symptom that you fell into the trap**: apparent signal strengthens monotonically with `loc` quintile; collapses when you test adjacent cells.

Full reference: `findings/methodology/02-label-leakage-bar-local-scaling.md`.

---

## 3. Shuffled-null design (3 null types — get the right one)

Shuffled-null tests are mandatory before trust, but the **choice of what to shuffle** is a design decision.

| Hypothesis class                             | Shuffle WHAT                                                  | Session example                                           |
| -------------------------------------------- | ------------------------------------------------------------- | --------------------------------------------------------- |
| "Feature X predicts outcomes"                | Shuffle the feature values                                    | Phase F-B (used wrong null, "falsified" a real signal)    |
| "Trigger pattern fires at informative times" | Shuffle the trigger mask (preserve fire-rate, move locations) | Phase C (validated ngram_triple_fast_up at z=+5.74)       |
| "Filter improves selection"                  | Shuffle which trades pass the filter                          | Phase L-C (evaluated filters against N-size random draws) |

**Rule**: ask "what is the alternative hypothesis, in one sentence?" If you can't state it, you don't know what you're testing.

**Common mistakes**:

- Using feature-shuffle when testing a trigger pattern → destroys temporal structure the pattern depends on → real signal looks worse than shuffled noise
- Under-tight null (null std huge relative to observed effect) → no statistical power
- Over-tight null (too few permutations) → unreliable z-estimates; use ≥100 for z<3, ≥1000 for z<2

Full reference: `findings/methodology/03-shuffled-null-design.md`.

---

## 4. Agent significance corrections (z-scores are overstated 2-3×)

LLM agents systematically overstate z-scores. Treat agent-reported p-values as **upper bounds**.

**Three overstatement patterns**:

1. **Ignored multiple-testing burden**: agent tests 25 variants, reports z=2.43 vs nominal 1.96 threshold. True Bonferroni threshold is `sqrt(2 * ln(N))` — for N=25 that's z>2.8.
2. **Confused sample-mean z with binomial-proportion z**: 53.5% vs 50% on N=840 gives z≈2.0 not 4.2.
3. **Extremum-of-K treated as single test**: "top combo from 17,280" has expected null-max `null_mean + null_std × sqrt(2 ln K)` ≈ null_mean + 4.5σ. An observed tw that's below that expectation is not a finding.

**Always verify**:

- How many implicit tests did the agent run?
- Re-derive z yourself: `(real - null.mean) / null.std`
- Bonferroni threshold for K tests: `z > sqrt(2 * ln K)`

**Trust thresholds**:

- z > 5, N > 500: likely real, test further
- z in [3, 5]: promising, mandatory gate validation
- z in [2, 3]: suspect, require adjacent-cell gradient + null test
- z < 2: treat as null

Full reference: `findings/methodology/09-agent-significance-corrections.md`.

---

## 5. Record-keeping discipline (append-only ledger + audit folders)

Every investigation — positive or null — must produce a permanent, discoverable record.

**3-layer architecture**:

```
findings/
├── evolution/
│   ├── evolution.jsonl              # append-only ledger
│   └── audits/
│       └── YYYY-MM-DD-slug/
│           ├── CLAUDE.md            # navigator
│           ├── verdict.md           # plain-English conclusion
│           ├── CHRONICLE.md         # narrative (for major findings)
│           ├── <reproducer>.py      # script that regenerates headline numbers
│           └── <artifact>.json      # raw telemetry
└── methodology/                     # universal principles
```

**Ledger entry fields**: `id`, `date`, `status`, `supersedes`, `superseded_by`, `headline`, `key_numbers`, `evidence` (file paths), `sha256_results`.

**The supersedes pattern**: when a later finding replaces an earlier one, ADD a new entry with `supersedes: "OLD-ID"`; UPDATE the old entry with `superseded_by: "NEW-ID"`. **Do NOT delete** the older audit folder.

Full reference: `findings/methodology/07-record-keeping-discipline.md`.

---

## 6. Post-mortem-before-abandon

Before declaring a signal dead, enrich every trade with causal pre-entry features and hunt filters on individual losses. A "sometimes works" signal is often a filterable signal in disguise.

**Pipeline**:

1. Run the signal across full history; collect N trade outcomes
2. Compute ~20-30 causal features at each trigger bar
3. Emit per-trade parquet + CSV (one row per trade)
4. Ship to multi-lens agents (see Skill B)
5. Each agent hunts filters that separate winners from losers
6. Evaluate filters against shuffled-null (see §3)

**Kill-selectivity metric**: `losers_killed / max(1, winners_killed)`. < 1.0 = harmful; 1.0-1.2 = marginal; 1.2-1.5 = useful; > 1.5 = strong.

Session example: `+0.178 bps` baseline → `+0.514 bps` after Phase-L filter. 2.9× lift from enrichment-driven filter hunt.

Full reference: `findings/methodology/06-per-trade-enrichment-postmortem.md`.

---

## Confirmation counts (provisional, as of session ca9d7ffa)

| Principle                   | Confirmed         | Notes                                                                            |
| --------------------------- | ----------------- | -------------------------------------------------------------------------------- |
| 1. causal-feature-invariant | 18+ (every phase) | Fundamental; drop only with proof                                                |
| 2. label-leakage            | 2                 | Directly caught spurious "lower-rejection-at-bottom"                             |
| 3. shuffled-null-design     | 4                 | Phase F-B wrong-null, Phase C right-null, Phase L filter-null, Phase M mgmt-null |
| 4. agent-sig-corrections    | 5+                | Combinatorialist, transition-asymmetry, trade-mgmt agents all overstated         |
| 5. record-keeping           | 5 ledger entries  | Full chain for NGRAM3FU-STRADDLE                                                 |
| 6. post-mortem              | 1                 | Phase L delivered the filter; needs re-confirmation on other campaigns           |

Higher `confirmed` = more trustworthy. Principle 6 has only one confirmation and should be treated as provisional.

---

## Post-Execution Reflection

After invoking this skill:

1. Did applying a principle catch a bug or false positive? Increment its `confirmed` count in the table above; note the session where it fired in `references/evolution-log.md`.
2. Did a principle fail (bad guidance)? Demote it in the table; add a `superseded_by` pointer in `references/archive/` with `resurrect_if:` conditions.
3. New trap that isn't covered? Draft a new section here and append to the evolution log.
4. Never silently move on. This skill's value compounds only if reality-corrections flow back.