Knowledge Compendium
Settled principles, dead ends, and open questions — distilled from 120 rounds.
Model Architecture
SETTLED — what we know
Use GRU 2x128, dropout=0.2, weight_decay=1e-5 — settled sweet spot.
evidence: R012, R055, R086, R087
Any deviation either underfits or overfits while burning compute.
GRU > LSTM > Transformer for 15min BTC sequence classification.
evidence: R012, R089
LSTM/Transformer add params and training time without lifting compound.
Train 15 epochs, lr=0.001, Adam, batch=16, patience=10 — early-stopping kicks in around epoch 2-4.
evidence: R055, R069, R081
Longer training memorizes; smaller batches/lr fail to converge.
Lookback=300 event-bars on 15min base is optimal across all 4 folds.
evidence: R047, R060, R061
Shorter lookbacks miss regime context; longer ones inject noise and OOM.
Horizon=120 bars (10h) maximizes profitable signal density.
evidence: R037, R055
Shorter horizons are noise; 240+ horizons lower trade count below compounding threshold.
DEAD ENDS — what didn't work
GRU 2x256 or 2x64 (R086, R087)
256x2 single-set higher but cross-set worse; 64x2 underperforms.
lesson: Capacity beyond 128 fits seed-specific noise; less capacity loses signal.
re-explore if: Input modality changes (on-chain, order-book).
Higher dropout/weight_decay (R086, R087)
Reduces compound without improving cross-seed consistency.
lesson: Bottleneck is labeling, not regularization.
re-explore if: We swap to noisier label regime.
Attention/TFT/CNN-GRU (Proposed only)
Not explored beyond proposal.
lesson: Architecture unlikely to be bottleneck; data/labeling are.
re-explore if: Multi-asset / cross-market inputs where attention helps.
OPEN — questions still unanswered
Would small transformer help with on-chain or order-book features?
Does CNN front-end catch micro-structure that engineered features miss?
Is GRU still optimal at 5min or 1H timeframes?
Labeling
SETTLED — what we know
Use 3-class tb3_vol (SL=-1, timeout=0, TP=+1) with CrossEntropy weights [2.0, 0.5, 2.0].
evidence: R055, R059, R073-R080
Every other labeling scheme tested degrades compound; this is the spine of the system.
Make TP/SL barriers volatility-adaptive (k*sigma, k_up=20, k_dn=10) — never fixed.
evidence: R004-R026, R071
Fixed barriers ignore regime shifts; same 0.75% TP is trivial in 2021 and unreachable in 2023 lateral.
Clip sigma to [0.0005, 0.005] before computing barriers.
evidence: R065b
Removing clip costs ~2.5x compound by producing absurd barriers in ultra-low/high vol bars.
Keep timeouts as a class — they are 24% of labels and carry information.
evidence: R073-R080
Dropping timeout class collapses CE gradient quality; model needs a 'do nothing' target.
Use alpha (timeout horizon) = 24 bars and barriers k_up=20, k_dn=10 as validated default.
evidence: R055, R071
Tighter alpha starves TP examples; looser alpha pollutes labels with regime drift.
DEAD ENDS — what didn't work
Binary, 5-class, regression, DSR, filtered, next-event, RL-meta, SDIL, CBA, MRC, trend-scan, PQW, speed-weighted, efficiency (13 methods) (R000, R073-R080)
All 13+ alternatives lose to 3-class tb3_vol.
lesson: CE on 3 well-balanced classes concentrates gradients better than any continuous/auxiliary target.
re-explore if: Switch to mean-reversion or longer-horizon paradigm where TP/SL framing no longer fits.
MFE/MAE dual-head regression with mse_ratio (R036-R038)
Best +97.9% sum — superseded by event-driven 3-class.
lesson: Regression labels are noisier than discretized barriers for GRU.
re-explore if: We add attention/TFT model that exploits continuous targets.
Confidence-weighted continuous labels, multi-horizon heads (Proposed)
Not validated.
lesson: Adding label dimensions without changing base discretization adds noise channels.
re-explore if: Architecture natively benefits from multi-task heads (TFT).
OPEN — questions still unanswered
Mean-reversion labeling scheme (fade-the-move) complementing tb3_vol momentum?
Regime-conditional labels (different barriers in bull vs bear)?
Features
SETTLED — what we know
Use exactly 36 clean features (47 base minus 11 harmful).
evidence: R056, R057
The 11 harmful features cost compound; keeping them adds variance with zero alpha.
Harmful to remove: volatility_60, volume_ratio, volatility_12, hour_sin, volatility_24, obv_sma_20, obv_diff, macd, atr_14_pct, ema_9_21_cross, ema_21_50_cross.
evidence: R056
Permutation importance + ablation identified these; reintroducing them silently degrades.
Normalize per-sequence MinMax (each 300-bar window normalized independently).
evidence: R054, R054c, R054d
Global normalization leaks; partial underperforms ~3x.
Derive feature column order dynamically from the DataFrame, never hardcode.
evidence: R085 production audit
Hardcoded order silently misaligned ema_9/21/50 in production.
Single 15min timeframe — no 1h/4h aux channels.
evidence: R062
lb=300 already encodes ~75h of context; multi-TF features add noise.
DEAD ENDS — what didn't work
Add sigma as feature (R085, R085b)
-24% to -45% compound.
lesson: GRU already captures volatility from price; explicit sigma overfits.
re-explore if: Architecture doesn't implicitly model volatility.
tp_dist_pct / sl_dist_pct features (R085)
-24% compound.
lesson: Engineered distance-to-barrier leaks the label.
re-explore if: Never with current labeling.
is_event binary feature (R085)
No improvement.
lesson: Event filtering already governs entry; flag is information-free.
re-explore if: Mix event and non-event bars.
Consecutive + is_event combined (R085)
Catastrophic -88%.
lesson: Mixing sampling schemes poisons normalization stats.
re-explore if: Never.
Multi-timeframe (1h, 4h) features (R062)
No improvement.
lesson: Long lookback already provides macro context.
re-explore if: Lookback reduced below 100 event-bars.
OPEN — questions still unanswered
On-chain features (active addresses, exchange flows, funding rates)?
Order-book imbalance from L2 data?
Learned feature selector?
Event Detection
SETTLED — what we know
Filter to event bars using vol_tight: ATR > 2x rolling mean AND |returns| > p90 — yields ~10.5% of bars.
evidence: R046, R060, R085
Without event filtering the model trains on 90% noise bars.
Event-driven sequences beat consecutive sequences decisively.
evidence: R046, R085
Consecutive sampling mixes regimes; event sampling keeps each sequence in one regime.
Lookback measured in event-bars (300), not calendar bars.
evidence: R047, R060, R061
Calendar lookback dilutes context with noise; event-bar lookback compresses ~75h of relevance.
vol_tight beats all alternatives in 4-fold WF.
evidence: R060
Relaxed filters get more trades but worse avg-edge; tighter ones cost too much trade count.
DEAD ENDS — what didn't work
Non-volatility event filters (volume, MACD-cross) (R060)
All lose to vol_tight.
lesson: Volatility is the right gating signal because labeling (vol-adaptive) and filter must align.
re-explore if: Change labels to something other than vol-adaptive.
Looser filter (vol_medium ~15%) for more trades (R060)
More trades, lower avg-edge, net lower.
lesson: Trade frequency and per-trade edge are anti-correlated.
re-explore if: Per-trade cost reduction (maker fees, DEX).
create_sequences with exact lookback (R085 bug)
Returned 0 sequences silently.
lesson: Sequence builder needs lookback+1 events; document explicitly.
re-explore if: Sequence builder rewrite.
OPEN — questions still unanswered
Learned event detector beating vol_tight?
Regime-conditional event thresholds?
Ensemble and Voting
SETTLED — what we know
Train 5 seeds per fold, 4 folds — 20 checkpoints.
evidence: R069, R070, R072
Single-seed compound varies +86% to +262% across seeds at same config.
Validate across 3 independent seed-sets (42-46, 100-104, 200-204).
evidence: R072, R084
Single-set 4/4 can be luck; 3-set passing is the real robustness signal.
Binary voting: exit when ≥3/5 (orig 4/5) models signal danger.
evidence: R070, R071
Confidence-weighted averaging is noisier than discrete voting.
Danger trigger: P(TP)<0.22 OR P(SL)>0.70. Re-enter: P(TP)>0.32.
evidence: R071, R084
Asymmetric thresholds prevent whipsaw and preserve compounding.
Same 20 GRU checkpoints power all 5 production servers (V2-V6).
evidence: R111 deployment
Strategy layer is the differentiator; retraining rarely warranted.
DEAD ENDS — what didn't work
Single-seed selective trading (R084)
Below B&H (+159% vs +1,005%).
lesson: Single seeds are pure noise.
re-explore if: Never — 5 seeds minimum is law.
9-seed ensembles (R092)
No improvement over 5.
lesson: Variance reduction is sublinear; 5 is the knee.
re-explore if: Noisier base learner.
Confidence-weighted averaging (R070, R114)
Slightly worse than binary voting.
lesson: Calibration errors propagate through soft averages.
re-explore if: Add explicit per-model calibration (temperature, isotonic).
Stacking GRU + LightGBM (Proposed)
Not validated.
lesson: Unknown.
re-explore if: Hit clear ceiling with pure GRU ensembling.
OPEN — questions still unanswered
Heterogeneous ensembling (GRU + GBM + SMA)?
Per-fold model selection (best-K-of-5)?
Exit Logic and Cooldown
SETTLED — what we know
V66 uniform cooldown (4 bars post-entry, 48 bars post-exit) is the min-alpha champion at +245%.
evidence: R110, R111
Min alpha across folds is the conservative robustness metric.
V115_cmp (ATR-conditional + peak_drop + ratchet + regime cooldown) is the compound king at +168,759%.
evidence: R110, R111
Max compound is the aggressive growth metric — pick V66 vs V115 by risk tolerance.
Daily SMA filter is the single most important component — +7,641% → +720M%.
evidence: R094
Without daily trend filter, GRU still works but caps far below ceiling.
Fixed exits beat dynamic and trailing as the base layer.
evidence: R061
Trailing/dynamic add path-dependence that breaks labeling alignment.
Always-invested with ensemble-voted danger exit beats selective trading 4-8x.
evidence: R068, R071, R084
B&H captures the bull move; selective trading sits in cash through the biggest 1% of days.
DEAD ENDS — what didn't work
Trailing/dynamic exits as primary (R061)
Lose to fixed (+3,440%).
lesson: Simpler aligns with labeling (trained on fixed barriers).
re-explore if: Retrain with trailing-aware labels.
SL-signal exit (R063b)
Inferior to always-invested.
lesson: P(SL) not precise enough alone.
re-explore if: P(SL) calibration improved.
DD circuit breaker (R097)
Destroys compound.
lesson: DD is lagging; by trigger time, worst is over.
re-explore if: Leading regime indicator pre-empts.
Adaptive ATR exits uniformly (R095)
+53% V4, hurts V3/V6.
lesson: Adaptive exits are regime-dependent.
re-explore if: Per-strategy tuning.
RSI/MACD/volume/MTF filters on top of SMA (R096, R098, R119)
All hurt.
lesson: Once daily SMA is on, additional filters subtract from right tail.
re-explore if: Filter orthogonal to trend (funding rate).
OPEN — questions still unanswered
Exit policy between V66 and V115 capturing V115's compound with V66's robustness?
Learned exit policy (RL on frozen GRU)?
Position Sizing
SETTLED — what we know
Use 100% per trade — full notional always-invested maximizes compound.
evidence: R066, R066b, R084, R116
Every fractional sizing gave up more compound than it saved in DD.
Accept higher MaxDD as the cost of 100% sizing — compound math wins.
evidence: R066, R084
Reducing DD via sizing costs more compound than the DD itself.
Pyramiding marginal, not worth complexity.
evidence: R066b
Multi-entry adds bookkeeping for alpha that fails cross-set validation.
DEAD ENDS — what didn't work
Kelly Criterion (fixed 10-25%, half-Kelly) (R066, R116)
Reduces compound.
lesson: Kelly assumes accurate edge estimation; GRU probabilities not calibrated enough.
re-explore if: Explicit probability calibration.
Confidence-proportional sizing (R116)
Reduces compound.
lesson: Same as Kelly — calibration is blocker.
re-explore if: Post-calibration with isotonic regression.
Vol-scaled and DD-scaled sizing (R116)
Reduces compound.
lesson: Vol-scaling reduces position when vol is high — but that's exactly when event-bars carry largest alpha.
re-explore if: Sizing conditional on regime classification.
OPEN — questions still unanswered
Does sizing become relevant with shorts (combined long/short notional)?
Leverage >1x on high-confidence bars?
Validation
SETTLED — what we know
Use 4-fold walk-forward expanding-window over 2018-2026.
evidence: R025, R037, R064
Random splits leak; sliding window discards data; expanding mirrors live.
Promote a config only after it passes 3 seed-sets x 4 folds = 12/12.
evidence: R072, R084
Single-set 4/4 is coin-flip; 12/12 filters out seed-luck.
Compound = product((1+fold_pnl/100))-1, not arithmetic sum.
evidence: R058
Arithmetic sum hides geometric mean; actual deployed equity follows product.
Report min-alpha-across-folds AND compound — not either alone.
evidence: R110, R111
Compound dominated by one lucky fold; min-alpha is worst-case.
Sigma clip and feature normalization must be fit on train only.
evidence: CLAUDE.md, R054
Per-sequence normalization is leak-safe; global stats must be train-only.
Commission 0.04% + slippage 0.01% in every backtest — no exceptions.
evidence: CLAUDE.md rule
Cost-free backtests double apparent edge.
DEAD ENDS — what didn't work
Sliding window vs expanding (R064)
Expanding slightly better.
lesson: Older bull-cycle data still helps generalization.
re-explore if: Clear non-stationarity in pre-2020 data.
Single-seed promotion gates (R069 (before policy change))
Set B failed fold-3 by 6%.
lesson: Always multi-seed; 6% miss invisible in single.
re-explore if: Never.
val_loss as primary selection (R000, R001)
Lowest val_loss lost money in sim.
lesson: val_loss is decorrelated from PnL.
re-explore if: Never.
OPEN — questions still unanswered
5th fold spanning 2026-only as true OOS?
Non-overlapping seed-set strategy (5 sets of 5)?
Production Engineering
SETTLED — what we know
Five servers in production share one ensemble (20 checkpoints) and differ only in YAML strategy config.
evidence: R111
Decoupling model from strategy lets you A/B test policies without retraining.
Derive feature column order from live DataFrame at inference.
evidence: R085 audit
Hardcoded order silently misaligned ema columns.
After force-closing last trade: set position=0, position_cost=0, equity[-1]=capital.
evidence: R082 bug, R083 fix
Double-counting inflated reported compound ~2x for R068-R082.
Smoke-test trader class on real bars before deploying.
evidence: R111
Catches CSV misalignment, sequence edge cases, YAML drifts.
Pre-trained checkpoints from R069/R072/R081 reused — don't retrain unless arch/features/labels change.
evidence: R082, R111
Retraining burns Colab budget and reintroduces variance.
Paper-trade any new strategy on real-time data before live capital.
evidence: CLAUDE.md
Execution slippage and CSV-write races surface only in live data.
DEAD ENDS — what didn't work
Always-invested backtester in notebook (pre-fix) (R068-R082)
Reported +89,000% was actually +4,000-7,600%.
lesson: Notebook backtesters drift from canonical class.
re-explore if: Never — only canonical is trustworthy.
CSV column logging with implicit positional alignment (Production audit)
Column misalignment in event logs.
lesson: Always log with explicit dict-to-row or DataFrame.to_csv.
re-explore if: Never.
OPEN — questions still unanswered
Live-vs-backtest slippage gap after 60 days paper trading?
Should V2-V6 be merged into single multi-config trader?
Paradigm Boundaries
SETTLED — what we know
Momentum / barrier-trigger paradigm is exhausted — 87 rounds converge to same family.
evidence: R000-R087
Further tuning yields ~5-10% deltas; new paradigms could yield 2-10x.
Shorts do not work with current 3-class GRU.
evidence: R063, R063b
P(SL) not precise enough; need dedicated short model.
5min and 1min base timeframes do not beat 15min.
evidence: R035, R065
More frequency doesn't pay for noise increase.
Daily SMA trend filter is highest-leverage discovery post-R083.
evidence: R094
Only single component that moved compound by 5+ orders of magnitude.
Portfolio regime-aware V66+MACD_div delivers 5× V66 compound at matching min α.
evidence: R120
First strategy treating V2-V6 as portfolio members rather than alternatives.
DEAD ENDS — what didn't work
Mean-reversion (R117) standalone (R117)
MACD_div +179K% compound BUT min α -42.
lesson: Mean-rev needs portfolio context, not standalone.
re-explore if: Combined with regime filter or trend confirmation.
Alternative model classes — XGBoost (R118)
GRU crushes trees 287× (245% vs 70,576% compound).
lesson: Tree models can't capture sequence dependencies that GRU encodes.
re-explore if: Tabular features only, no sequence.
Multi-timeframe filter (4h on V66) (R119)
All 5 filter variants degrade V66.
lesson: V66 already encodes context; external filter is redundant.
re-explore if: Filter applied to a strategy WITHOUT internal regime detection.
On-chain features (Glassnode/mempool) (Not explored)
Untested.
lesson: All features price/volume-derived.
re-explore if: Always — most likely orthogonal alpha source.
Multi-asset (ETH, SOL, top-10) (Not explored)
Untested.
lesson: Cross-asset signals (BTC-ETH ratio) untouched.
re-explore if: Always — major unexplored direction.
Reinforcement learning end-to-end (R075 partial)
Meta-RL on top of GRU didn't help.
lesson: RL on frozen GRU has too little signal.
re-explore if: Stable simulation environment with realistic slippage.
OPEN — questions still unanswered
Does mean-reversion + regime detection deliver portfolio-level alpha?
Do on-chain or funding-rate features add anything?
Multi-asset coordination (BTC+ETH+SOL)?
End-to-end RL beats hand-tuned cascades?
META PRINCIPLES — How We Learn
SETTLED — what we know
Single-seed results are noise — minimum 5 seeds for any compound claim.
evidence: R069, R072
Fold-0 compound varies +86% to +262% across seeds at same config.
Pre-bug-fix numbers (R068-R082, R022) are not comparable to post-fix.
evidence: R027 gaps bug, R083 equity bug
Mixing pre/post-fix promotes broken configs.
val_loss is not PnL — always simulate compound on test fold.
evidence: R000, R001
Optimizer optimizes loss; user optimizes equity. They diverge.
Commission 0.04% + slippage 0.01% per CLAUDE.md — hard floor in every backtest.
evidence: CLAUDE.md
Cost-free backtests have falsely promoted multiple configs that died in paper.
More trades at lower per-trade edge can beat fewer high-edge trades — compounding wins.
evidence: R068, R084, R120
500 trades at +1% → +14,477%; 100 at +2% → +624%. Frequency matters.
Promote on min-alpha AND compound — not either alone.
evidence: R110, V66 vs V115
Compound rewards one lucky fold; min-alpha rewards worst-case.
Compound = product of (1 + fold_returns), not sum.
evidence: R058, R084
Arithmetic sums rank configs that crash in fold 2 above consistent ones.
Reuse pre-trained ensembles across strategy experiments — retrain only when arch/features/labels change.
evidence: R082, R111
Keeps strategy comparisons clean; saves Colab budget for paradigm shifts.
Smoke-test on real bars before any deploy — never rely on backtest only.
evidence: R111, R085
Three most expensive bugs (equity double-count, feature order, CSV) survived backtests.
When you've explored 87 rounds inside one paradigm, EXIT IT — incremental gains become noise.
evidence: R084-R098 plateau, R112-R116 V66 exhaustion
Mean-rev, on-chain, multi-asset, RL paradigms are unexplored; further momentum tuning is diminishing returns.