backtesting-py-oracle
$
npx mdskill add terrylica/cc-skills/backtesting-py-oracleValidates SQL oracle results and backtests range bar patterns with backtesting.py
- Ensures SQL and Python trade evaluations produce identical results
- Uses backtesting.py, ClickHouse, and Python dataframes for validation
- Requires hedging and exclusive_orders settings for accurate signal replication
- Delivers bitwise-identical trade counts and performance metrics between systems
SKILL.md
.github/skills/backtesting-py-oracleView on GitHub ↗
---
name: backtesting-py-oracle
description: backtesting.py configuration for SQL oracle validation and range bar pattern backtesting. Use when running backtesting.py against.
allowed-tools: Read, Bash, Grep, Glob
---
# backtesting.py Oracle Validation for Range Bar Patterns
Configuration and anti-patterns for using backtesting.py to validate ClickHouse SQL sweep results. Ensures bit-atomic replicability between SQL and Python trade evaluation.
**Companion skills**: `clickhouse-antipatterns` (SQL correctness, AP-16) | `sweep-methodology` (sweep design) | `rangebar-eval-metrics` (evaluation metrics)
**Validated**: Gen600 oracle verification (2026-02-12) — 3 assets, 5 gates, ALL PASS.
---
> **Self-Evolving Skill**: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.
## Critical Configuration (NEVER omit)
```python
from backtesting import Backtest
bt = Backtest(
df,
Strategy,
cash=100_000,
commission=0,
hedging=True, # REQUIRED: Multiple concurrent positions
exclusive_orders=False, # REQUIRED: Don't auto-close on new signal
)
```
**Why**: SQL evaluates each signal independently (overlapping trades allowed). Without `hedging=True`, backtesting.py skips signals while a position is open, producing fewer trades than SQL. This was discovered when SOLUSDT produced 105 Python trades vs 121 SQL trades — 16 signals were silently skipped.
---
## Anti-Patterns (Ordered by Severity)
### BP-01: Missing Multi-Position Mode (CRITICAL)
**Symptom**: Python produces fewer trades than SQL. Gate 1 (signal count) fails.
**Root Cause**: Default `exclusive_orders=True` prevents opening new positions while one is active.
**Fix**: Always use `hedging=True, exclusive_orders=False`.
### BP-02: ExitTime Sort Order (CRITICAL)
**Symptom**: Entry prices appear mismatched (Gate 3 fails) even though both SQL and Python use the same price source.
**Root Cause**: `stats._trades` is sorted by ExitTime, not EntryTime. When overlapping trades exit in a different order than they entered, trade[i] no longer maps to signal[i].
**Fix**:
```python
trades = stats._trades.sort_values("EntryTime").reset_index(drop=True)
```
### BP-03: NaN Poisoning in Rolling Quantile (CRITICAL)
**Symptom**: Cross-asset tests fail with far fewer Python trades. Feature quantile becomes NaN and propagates forward indefinitely.
**Root Cause**: `np.percentile` with NaN inputs returns NaN. If even one NaN feature value enters the rolling window, all subsequent quantiles become NaN, making all subsequent filter comparisons fail.
**Fix**: Skip NaN values when building the signal window:
```python
def _rolling_quantile_on_signals(feature_arr, is_signal_arr, quantile_pct, window=1000):
result = np.full(len(feature_arr), np.nan)
signal_values = []
for i in range(len(feature_arr)):
if is_signal_arr[i]:
if len(signal_values) > 0:
window_data = signal_values[-window:]
result[i] = np.percentile(window_data, quantile_pct * 100)
# Only append non-NaN values (matches SQL quantileExactExclusive NULL handling)
if not np.isnan(feature_arr[i]):
signal_values.append(feature_arr[i])
return result
```
### BP-04: Data Range Mismatch (MODERATE)
**Symptom**: Different signal counts between SQL and Python for assets with early data (BNB, XRP).
**Root Cause**: `load_range_bars()` defaults to `start='2020-01-01'` but SQL has no lower bound.
**Fix**: Always pass `start='2017-01-01'` to cover all available data.
### BP-05: Margin Exhaustion with Overlapping Positions (MODERATE)
**Symptom**: Orders canceled with insufficient margin. Fewer trades than expected.
**Root Cause**: With `hedging=True` and default full-equity sizing, overlapping positions exhaust available margin.
**Fix**: Use fixed fractional sizing:
```python
self.buy(size=0.01) # 1% equity per trade
```
### BP-06: Signal Timestamp vs Entry Timestamp (LOW)
**Symptom**: Gate 2 (timestamp match) fails because SQL uses signal bar timestamps while Python uses entry bar timestamps.
**Root Cause**: SQL outputs the signal detection bar's `timestamp_ms`. Python's `EntryTime` is the fill bar (next bar after signal). These differ by 1 bar.
**Fix**: Record signal bar timestamps in the strategy's `next()` method:
```python
# Before calling self.buy()
self._signal_timestamps.append(int(self.data.index[-1].timestamp() * 1000))
```
---
## 5-Gate Oracle Validation Framework
| Gate | Metric | Threshold | What it catches |
| ---- | --------------- | --------- | ------------------------------------ |
| 1 | Signal Count | <5% diff | Missing signals, filter misalignment |
| 2 | Timestamp Match | >95% | Timing offset, warmup differences |
| 3 | Entry Price | >95% | Price source mismatch, sort ordering |
| 4 | Exit Type | >90% | Barrier logic differences |
| 5 | Kelly Fraction | <0.02 | Aggregate outcome alignment |
**Expected residual**: 1-2 exit type mismatches per asset at TIME barrier boundary (bar 50). SQL uses `fwd_closes[max_bars]`, backtesting.py closes at current bar price. Impact on Kelly < 0.006.
---
## Strategy Architecture: Single vs Multi-Position
| Mode | Constructor | Use Case | Position Sizing |
| --------------- | -------------------------------------- | --------------------- | ------------------------------ |
| Single-position | `hedging=False` (default) | Champion 1-bar hold | Full equity |
| Multi-position | `hedging=True, exclusive_orders=False` | SQL oracle validation | Fixed fractional (`size=0.01`) |
### Multi-Position Strategy Template
```python
class Gen600Strategy(Strategy):
def next(self):
current_bar = len(self.data) - 1
# 1. Register newly filled trades and set barriers
for trade in self.trades:
tid = id(trade)
if tid not in self._known_trades:
self._known_trades.add(tid)
self._trade_entry_bar[tid] = current_bar
actual_entry = trade.entry_price
if self.tp_mult > 0:
trade.tp = actual_entry * (1.0 + self.tp_mult * self.threshold_pct)
if self.sl_mult > 0:
trade.sl = actual_entry * (1.0 - self.sl_mult * self.threshold_pct)
# 2. Check time barrier for each open trade
for trade in list(self.trades):
tid = id(trade)
entry_bar = self._trade_entry_bar.get(tid, current_bar)
if self.max_bars > 0 and (current_bar - entry_bar) >= self.max_bars:
trade.close()
self._trade_entry_bar.pop(tid, None)
# 3. Check for new signal (no position guard — overlapping allowed)
if self._is_signal[current_bar]:
self.buy(size=0.01)
```
---
## Data Loading
```python
from data_loader import load_range_bars
df = load_range_bars(
symbol="SOLUSDT",
threshold=1000,
start="2017-01-01", # Cover all available data
end="2025-02-05", # Match SQL cutoff
extra_columns=["volume_per_trade", "lookback_price_range"], # Gen600 features
)
```
---
## HTML Equity Plot: Y-Axis Auto-Fit on Zoom (BP-07 to BP-10)
backtesting.py generates Bokeh HTML plots via `bt.plot()`. By default, the y-axis is fixed — zooming on the x-axis does NOT rescale the y-axis to fit visible data. This makes it impossible to inspect zoomed-in regions of equity curves that span 4+ orders of magnitude.
**Reference implementation**: `scripts/gen800/plotting.py` in opendeviationbar-patterns.
### BP-07: NEVER Use LogScale for Equity (CRITICAL)
**Symptom**: Equity panel y-axis doesn't auto-fit on zoom. JS callbacks fail silently.
**Root Cause**: Bokeh's `LogScale` breaks `CustomJS` y-range callbacks in multi-panel linked x_range layouts. The `js_on_change` callback fires but `y_range.start`/`y_range.end` assignments are ignored by the LogScale renderer.
**Proven via POC**: LogScale + JS callback works in single-panel mode but fails when 5 panels share a linked x_range.
**Fix**: Transform equity data to `log10()` in Python, display on **linear** scale with a `CustomJSTickFormatter` that shows `10^tick` as readable values:
```python
import numpy as np
from bokeh.models import CustomJSTickFormatter, Range1d
# Transform data
raw = np.asarray(src.data["equity"], dtype=float)
raw = np.where(raw > 0, raw, 1e-10)
src.data["equity"] = np.log10(raw).tolist()
# Custom tick formatter (shows 1%, 100%, 10K%, 1M%)
child.yaxis[0].formatter = CustomJSTickFormatter(code="""
const v = Math.pow(10, tick);
if (v >= 1e6) return (v/1e6).toFixed(1) + 'M%';
if (v >= 1e3) return (v/1e3).toFixed(0) + 'K%';
if (v >= 1) return v.toFixed(0) + '%';
return v.toFixed(2) + '%';
""")
# MUST set initial y_range explicitly (backtesting.py leaves it as NaN)
valid = eq[np.isfinite(eq)]
pad = (valid.max() - valid.min()) * 0.05
child.y_range = Range1d(start=valid.min() - pad, end=valid.max() + pad)
```
### BP-08: Panel-Aware Column Matching (CRITICAL)
**Symptom**: Drawdown/P&L panels go blank when zooming. Shows millions of percent.
**Root Cause**: Multiple panels share the same `ColumnDataSource` (backtesting.py optimization). The Equity panel's Line renderer and the Drawdown panel's Line renderer both reference a source containing `equity`, `drawdown`, `High`, `Low`, etc. A naive "find first y column" approach picks `equity` for the Drawdown panel.
**Fix**: Match by `panel_label` FIRST, then by column name:
```python
panel_label = child.yaxis[0].axis_label or ""
# Label-first matching (panels share data sources!)
if "Drawdown" in panel_label and "drawdown" in d:
hi_col = lo_col = "drawdown"
elif "Equity" in panel_label and "equity" in d:
hi_col = lo_col = "equity"
elif "High" in d and "Low" in d: # OHLC
hi_col, lo_col = "High", "Low"
```
### BP-09: JS Callback Pattern for Y-Axis Auto-Fit (PROVEN)
Attach a `CustomJS` callback to `x_range.js_on_change` for each panel. The callback scans visible data and sets `y_range` directly:
```python
from bokeh.models import CustomJS
cb = CustomJS(
args={"source": src, "yr": child.y_range,
"x_col": "index", "y_col": "equity"},
code="""
const xs = source.data[x_col];
const ys = source.data[y_col];
const x0 = cb_obj.start, x1 = cb_obj.end;
let lo = Infinity, hi = -Infinity;
for (let i = 0; i < xs.length; i++) {
if (xs[i] >= x0 && xs[i] <= x1 && isFinite(ys[i])) {
if (ys[i] < lo) lo = ys[i];
if (ys[i] > hi) hi = ys[i];
}
}
if (isFinite(lo) && isFinite(hi) && lo < hi) {
const pad = (hi - lo) * 0.05 || 0.001;
yr.start = lo - pad;
yr.end = hi + pad;
}
""",
)
child.x_range.js_on_change("start", cb)
child.x_range.js_on_change("end", cb)
```
### BP-10: DataRange1d(only_visible=True) Is Unreliable
**Symptom**: `DataRange1d(only_visible=True)` works for simple cases but fails with:
- LogScale panels (y-range computed in wrong space)
- VBar renderers (candlesticks — bounds not tracked correctly)
- Shared data sources across panels
**Fix**: Always use the JS callback pattern (BP-09) instead of `DataRange1d(only_visible=True)`.
### Panel Data Source Reference (backtesting.py internals)
| Panel | Label | Renderer | Y Column | Source |
| -------- | ----------------- | ------------------------------------ | ---------------- | --------------------- |
| Equity | `"Equity"` | Patch (`equity_dd`), Line (`equity`) | `equity` | Shared OHLC source |
| Drawdown | `"Drawdown"` | Line (`drawdown`), Scatter (peak) | `drawdown` | Shared OHLC source |
| P/L | `"Profit / Loss"` | Scatter (`returns`), MultiLine | `y` or `returns` | Separate trade source |
| OHLC | `""` (no label) | Segment, VBar | `High`/`Low` | Shared OHLC source |
| Volume | `"Volume"` | VBar | `Volume` (top) | Shared OHLC source |
---
## Project Artifacts (rangebar-patterns repo)
| Artifact | Path |
| --------------------------- | ------------------------------------------------- |
| Oracle comparison script | `scripts/gen600_oracle_compare.py` |
| Gen600 strategy (reference) | `backtest/backtesting_py/gen600_strategy.py` |
| SQL oracle query template | `sql/gen600_oracle_trades.sql` |
| Oracle validation findings | `findings/2026-02-12-gen600-oracle-validation.md` |
| Backtest CLAUDE.md | `backtest/CLAUDE.md` |
| ClickHouse AP-16 | `.claude/skills/clickhouse-antipatterns/SKILL.md` |
| Fork source | `~/fork-tools/backtesting.py/` |
## Post-Execution Reflection
After this skill completes, check before closing:
1. **Did the command succeed?** — If not, fix the instruction or error table that caused the failure.
2. **Did parameters or output change?** — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
3. **Was a workaround needed?** — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.
Only update if the issue is real and reproducible — not speculative.
More from terrylica/cc-skills
- academic-pdf-to-gfmConvert academic PDF papers to GitHub-renderable GFM markdown with math equations. TRIGGERS - PDF, GitHub markdown, math
- adaptive-wfo-epochAdaptive epoch selection for Walk-Forward Optimization. TRIGGERS - WFO epoch, epoch selection, WFE optimization, overfitting epochs.
- adr-code-traceabilityAdd ADR references to code for traceability. TRIGGERS - ADR traceability, code reference, document decision in code.
- adr-graph-easy-architectASCII architecture diagrams for ADRs via graph-easy. TRIGGERS - ADR diagram, architecture diagram, ASCII diagram.
- agent-reach>
- agentic-process-monitorMonitor background processes from Claude Code using sentinel files, heartbeat liveness, and subagent polling. Best practices and.
- alpha-forge-preshipAlpha Forge quality gates for PR review - RNG determinism, URL validation, parameter validation, manifest sync.
- article-extractorExtract MQL5 articles and documentation. TRIGGERS - MQL5 articles, MetaTrader docs, mql5.com resources.
- ascii-diagram-validatorValidate ASCII diagram alignment in markdown. TRIGGERS - diagram alignment, ASCII art, box-drawing diagrams.
- asciinema-analyzerSemantic analysis of asciinema recordings. TRIGGERS - analyze cast, keyword extraction, find patterns in recordings.