crucible-research-foundations
$
npx mdskill add terrylica/cc-skills/crucible-research-foundationsValidates research foundations to prevent look-ahead bias and label leakage
- Checks for look-ahead bias, label leakage, and causal feature validity
- Uses Read, Grep, and Glob tools to analyze code and data
- Applies causality tests and pattern validation to detect violations
- Reports findings and ensures results are epistemically sound
SKILL.md
.github/skills/crucible-research-foundationsView on GitHub ↗
---
name: crucible-research-foundations
description: Validate findings, design shuffled nulls, check label leakage, review causal features. TRIGGERS - shuffled null, label leakage
allowed-tools: Read, Grep, Glob
---
# Research Foundations — 6 epistemic disciplines
> **Self-Evolving Skill**: This skill improves through use. If a discipline's guidance fails in practice or a new trap emerges, update the relevant section AND append to `references/evolution-log.md`. Don't defer.
Read these in order. The first three (causal, labels, nulls) are the hardest prerequisites — violating any of them silently invalidates every downstream result.
---
## 1. Causal-feature invariant (bars[:i])
Every feature `f[i]` used at trigger/decision bar `i` must be computable using only `bars[0:i]` — never `bars[i]`, never `bars[i+1:]`. Violation produces look-ahead bias; findings silently become worthless.
**Canonical pattern**:
```python
for i in range(n):
lo = max(0, i - window)
wind = values[lo:i] # EXCLUSIVE upper bound — no peeking
f[i] = compute(wind)
```
Note `lo:i` (exclusive), not `lo:i+1`. This discipline "feels off by one" but is correct.
**Verification test** (add to every new feature function):
```python
def test_causality(fn, n=1000):
bars = generate_test_bars(n)
f_orig = fn(bars)
bars_mod = bars.copy()
bars_mod[500:] *= 2 # perturb the FUTURE
f_mod = fn(bars_mod)
assert np.array_equal(f_orig[:500], f_mod[:500]), "look-ahead detected"
```
**Silent-bug signature**: impossibly clean results (tw > 10 bps on FX, win rate > 70%, OOS matches IS perfectly).
Full reference: `findings/methodology/10-causal-feature-invariant.md`.
---
## 2. Label-leakage (bar-local scaling kills window leakage)
Forward labels must be scaled to the **triggering bar's own range**, NEVER to a window-wide scale. Window-relative labels are tautological.
**Trap**: If you label `fwd+H = UP when close[i+H] - close[i] > window.span/20`, then when `close[i]` is near `window.min` (loc=B), `fwd=UP` is near-automatic. Agents will report spurious "signals".
**Fix**: use bar-local triple-barrier labels:
```python
r = high[i] - low[i] # THIS bar's range, not window's
tp_level = close[i] + tp_mult * r
sl_level = close[i] - sl_mult * r
# walk forward, exit at first tp/sl/expiry
```
**Symptom that you fell into the trap**: apparent signal strengthens monotonically with `loc` quintile; collapses when you test adjacent cells.
Full reference: `findings/methodology/02-label-leakage-bar-local-scaling.md`.
---
## 3. Shuffled-null design (3 null types — get the right one)
Shuffled-null tests are mandatory before trust, but the **choice of what to shuffle** is a design decision.
| Hypothesis class | Shuffle WHAT | Session example |
| -------------------------------------------- | ------------------------------------------------------------- | --------------------------------------------------------- |
| "Feature X predicts outcomes" | Shuffle the feature values | Phase F-B (used wrong null, "falsified" a real signal) |
| "Trigger pattern fires at informative times" | Shuffle the trigger mask (preserve fire-rate, move locations) | Phase C (validated ngram_triple_fast_up at z=+5.74) |
| "Filter improves selection" | Shuffle which trades pass the filter | Phase L-C (evaluated filters against N-size random draws) |
**Rule**: ask "what is the alternative hypothesis, in one sentence?" If you can't state it, you don't know what you're testing.
**Common mistakes**:
- Using feature-shuffle when testing a trigger pattern → destroys temporal structure the pattern depends on → real signal looks worse than shuffled noise
- Under-tight null (null std huge relative to observed effect) → no statistical power
- Over-tight null (too few permutations) → unreliable z-estimates; use ≥100 for z<3, ≥1000 for z<2
Full reference: `findings/methodology/03-shuffled-null-design.md`.
---
## 4. Agent significance corrections (z-scores are overstated 2-3×)
LLM agents systematically overstate z-scores. Treat agent-reported p-values as **upper bounds**.
**Three overstatement patterns**:
1. **Ignored multiple-testing burden**: agent tests 25 variants, reports z=2.43 vs nominal 1.96 threshold. True Bonferroni threshold is `sqrt(2 * ln(N))` — for N=25 that's z>2.8.
2. **Confused sample-mean z with binomial-proportion z**: 53.5% vs 50% on N=840 gives z≈2.0 not 4.2.
3. **Extremum-of-K treated as single test**: "top combo from 17,280" has expected null-max `null_mean + null_std × sqrt(2 ln K)` ≈ null_mean + 4.5σ. An observed tw that's below that expectation is not a finding.
**Always verify**:
- How many implicit tests did the agent run?
- Re-derive z yourself: `(real - null.mean) / null.std`
- Bonferroni threshold for K tests: `z > sqrt(2 * ln K)`
**Trust thresholds**:
- z > 5, N > 500: likely real, test further
- z in [3, 5]: promising, mandatory gate validation
- z in [2, 3]: suspect, require adjacent-cell gradient + null test
- z < 2: treat as null
Full reference: `findings/methodology/09-agent-significance-corrections.md`.
---
## 5. Record-keeping discipline (append-only ledger + audit folders)
Every investigation — positive or null — must produce a permanent, discoverable record.
**3-layer architecture**:
```
findings/
├── evolution/
│ ├── evolution.jsonl # append-only ledger
│ └── audits/
│ └── YYYY-MM-DD-slug/
│ ├── CLAUDE.md # navigator
│ ├── verdict.md # plain-English conclusion
│ ├── CHRONICLE.md # narrative (for major findings)
│ ├── <reproducer>.py # script that regenerates headline numbers
│ └── <artifact>.json # raw telemetry
└── methodology/ # universal principles
```
**Ledger entry fields**: `id`, `date`, `status`, `supersedes`, `superseded_by`, `headline`, `key_numbers`, `evidence` (file paths), `sha256_results`.
**The supersedes pattern**: when a later finding replaces an earlier one, ADD a new entry with `supersedes: "OLD-ID"`; UPDATE the old entry with `superseded_by: "NEW-ID"`. **Do NOT delete** the older audit folder.
Full reference: `findings/methodology/07-record-keeping-discipline.md`.
---
## 6. Post-mortem-before-abandon
Before declaring a signal dead, enrich every trade with causal pre-entry features and hunt filters on individual losses. A "sometimes works" signal is often a filterable signal in disguise.
**Pipeline**:
1. Run the signal across full history; collect N trade outcomes
2. Compute ~20-30 causal features at each trigger bar
3. Emit per-trade parquet + CSV (one row per trade)
4. Ship to multi-lens agents (see Skill B)
5. Each agent hunts filters that separate winners from losers
6. Evaluate filters against shuffled-null (see §3)
**Kill-selectivity metric**: `losers_killed / max(1, winners_killed)`. < 1.0 = harmful; 1.0-1.2 = marginal; 1.2-1.5 = useful; > 1.5 = strong.
Session example: `+0.178 bps` baseline → `+0.514 bps` after Phase-L filter. 2.9× lift from enrichment-driven filter hunt.
Full reference: `findings/methodology/06-per-trade-enrichment-postmortem.md`.
---
## Confirmation counts (provisional, as of session ca9d7ffa)
| Principle | Confirmed | Notes |
| --------------------------- | ----------------- | -------------------------------------------------------------------------------- |
| 1. causal-feature-invariant | 18+ (every phase) | Fundamental; drop only with proof |
| 2. label-leakage | 2 | Directly caught spurious "lower-rejection-at-bottom" |
| 3. shuffled-null-design | 4 | Phase F-B wrong-null, Phase C right-null, Phase L filter-null, Phase M mgmt-null |
| 4. agent-sig-corrections | 5+ | Combinatorialist, transition-asymmetry, trade-mgmt agents all overstated |
| 5. record-keeping | 5 ledger entries | Full chain for NGRAM3FU-STRADDLE |
| 6. post-mortem | 1 | Phase L delivered the filter; needs re-confirmation on other campaigns |
Higher `confirmed` = more trustworthy. Principle 6 has only one confirmation and should be treated as provisional.
---
## Post-Execution Reflection
After invoking this skill:
1. Did applying a principle catch a bug or false positive? Increment its `confirmed` count in the table above; note the session where it fired in `references/evolution-log.md`.
2. Did a principle fail (bad guidance)? Demote it in the table; add a `superseded_by` pointer in `references/archive/` with `resurrect_if:` conditions.
3. New trap that isn't covered? Draft a new section here and append to the evolution log.
4. Never silently move on. This skill's value compounds only if reality-corrections flow back.
More from terrylica/cc-skills
- academic-pdf-to-gfmConvert academic PDF papers to GitHub-renderable GFM markdown with math equations. TRIGGERS - PDF, GitHub markdown, math
- adaptive-wfo-epochAdaptive epoch selection for Walk-Forward Optimization. TRIGGERS - WFO epoch, epoch selection, WFE optimization, overfitting epochs.
- adr-code-traceabilityAdd ADR references to code for traceability. TRIGGERS - ADR traceability, code reference, document decision in code.
- adr-graph-easy-architectASCII architecture diagrams for ADRs via graph-easy. TRIGGERS - ADR diagram, architecture diagram, ASCII diagram.
- agent-reach>
- agentic-process-monitorMonitor background processes from Claude Code using sentinel files, heartbeat liveness, and subagent polling. Best practices and.
- alpha-forge-preshipAlpha Forge quality gates for PR review - RNG determinism, URL validation, parameter validation, manifest sync.
- article-extractorExtract MQL5 articles and documentation. TRIGGERS - MQL5 articles, MetaTrader docs, mql5.com resources.
- ascii-diagram-validatorValidate ASCII diagram alignment in markdown. TRIGGERS - diagram alignment, ASCII art, box-drawing diagrams.
- asciinema-analyzerSemantic analysis of asciinema recordings. TRIGGERS - analyze cast, keyword extraction, find patterns in recordings.