Knowledge · BTC Trading AI

Model Architecture

SETTLED — what we know

Use GRU 2x128, dropout=0.2, weight_decay=1e-5 — settled sweet spot.

evidence: R012, R055, R086, R087

Any deviation either underfits or overfits while burning compute.

GRU > LSTM > Transformer for 15min BTC sequence classification.

evidence: R012, R089

LSTM/Transformer add params and training time without lifting compound.

Train 15 epochs, lr=0.001, Adam, batch=16, patience=10 — early-stopping kicks in around epoch 2-4.

evidence: R055, R069, R081

Longer training memorizes; smaller batches/lr fail to converge.

Lookback=300 event-bars on 15min base is optimal across all 4 folds.

evidence: R047, R060, R061

Shorter lookbacks miss regime context; longer ones inject noise and OOM.

Horizon=120 bars (10h) maximizes profitable signal density.

evidence: R037, R055

Shorter horizons are noise; 240+ horizons lower trade count below compounding threshold.

DEAD ENDS — what didn't work

GRU 2x256 or 2x64 (R086, R087)

256x2 single-set higher but cross-set worse; 64x2 underperforms.

lesson: Capacity beyond 128 fits seed-specific noise; less capacity loses signal.

re-explore if: Input modality changes (on-chain, order-book).

Higher dropout/weight_decay (R086, R087)

Reduces compound without improving cross-seed consistency.

lesson: Bottleneck is labeling, not regularization.

re-explore if: We swap to noisier label regime.

Attention/TFT/CNN-GRU (Proposed only)

Not explored beyond proposal.

lesson: Architecture unlikely to be bottleneck; data/labeling are.

re-explore if: Multi-asset / cross-market inputs where attention helps.

OPEN — questions still unanswered

Would small transformer help with on-chain or order-book features?

Does CNN front-end catch micro-structure that engineered features miss?

Is GRU still optimal at 5min or 1H timeframes?

Labeling

SETTLED — what we know

Use 3-class tb3_vol (SL=-1, timeout=0, TP=+1) with CrossEntropy weights [2.0, 0.5, 2.0].

evidence: R055, R059, R073-R080

Every other labeling scheme tested degrades compound; this is the spine of the system.

Make TP/SL barriers volatility-adaptive (k*sigma, k_up=20, k_dn=10) — never fixed.

evidence: R004-R026, R071

Fixed barriers ignore regime shifts; same 0.75% TP is trivial in 2021 and unreachable in 2023 lateral.

Clip sigma to [0.0005, 0.005] before computing barriers.

evidence: R065b

Removing clip costs ~2.5x compound by producing absurd barriers in ultra-low/high vol bars.

Keep timeouts as a class — they are 24% of labels and carry information.

evidence: R073-R080

Dropping timeout class collapses CE gradient quality; model needs a 'do nothing' target.

Use alpha (timeout horizon) = 24 bars and barriers k_up=20, k_dn=10 as validated default.

evidence: R055, R071

Tighter alpha starves TP examples; looser alpha pollutes labels with regime drift.

DEAD ENDS — what didn't work

Binary, 5-class, regression, DSR, filtered, next-event, RL-meta, SDIL, CBA, MRC, trend-scan, PQW, speed-weighted, efficiency (13 methods) (R000, R073-R080)

All 13+ alternatives lose to 3-class tb3_vol.

lesson: CE on 3 well-balanced classes concentrates gradients better than any continuous/auxiliary target.

re-explore if: Switch to mean-reversion or longer-horizon paradigm where TP/SL framing no longer fits.

MFE/MAE dual-head regression with mse_ratio (R036-R038)

Best +97.9% sum — superseded by event-driven 3-class.

lesson: Regression labels are noisier than discretized barriers for GRU.

re-explore if: We add attention/TFT model that exploits continuous targets.

Confidence-weighted continuous labels, multi-horizon heads (Proposed)

Not validated.

lesson: Adding label dimensions without changing base discretization adds noise channels.

re-explore if: Architecture natively benefits from multi-task heads (TFT).

OPEN — questions still unanswered

Mean-reversion labeling scheme (fade-the-move) complementing tb3_vol momentum?

Regime-conditional labels (different barriers in bull vs bear)?

Features

SETTLED — what we know

Use exactly 36 clean features (47 base minus 11 harmful).

evidence: R056, R057

The 11 harmful features cost compound; keeping them adds variance with zero alpha.

Harmful to remove: volatility_60, volume_ratio, volatility_12, hour_sin, volatility_24, obv_sma_20, obv_diff, macd, atr_14_pct, ema_9_21_cross, ema_21_50_cross.

evidence: R056

Permutation importance + ablation identified these; reintroducing them silently degrades.

Normalize per-sequence MinMax (each 300-bar window normalized independently).

evidence: R054, R054c, R054d

Global normalization leaks; partial underperforms ~3x.

Derive feature column order dynamically from the DataFrame, never hardcode.

evidence: R085 production audit

Hardcoded order silently misaligned ema_9/21/50 in production.

Single 15min timeframe — no 1h/4h aux channels.

evidence: R062

lb=300 already encodes ~75h of context; multi-TF features add noise.

DEAD ENDS — what didn't work

Add sigma as feature (R085, R085b)

-24% to -45% compound.

lesson: GRU already captures volatility from price; explicit sigma overfits.

re-explore if: Architecture doesn't implicitly model volatility.

tp_dist_pct / sl_dist_pct features (R085)

-24% compound.

lesson: Engineered distance-to-barrier leaks the label.

re-explore if: Never with current labeling.

is_event binary feature (R085)

No improvement.

lesson: Event filtering already governs entry; flag is information-free.

re-explore if: Mix event and non-event bars.

Consecutive + is_event combined (R085)

Catastrophic -88%.

lesson: Mixing sampling schemes poisons normalization stats.

re-explore if: Never.

Multi-timeframe (1h, 4h) features (R062)

No improvement.

lesson: Long lookback already provides macro context.

re-explore if: Lookback reduced below 100 event-bars.

OPEN — questions still unanswered

On-chain features (active addresses, exchange flows, funding rates)?

Order-book imbalance from L2 data?

Learned feature selector?

Event Detection

SETTLED — what we know

Filter to event bars using vol_tight: ATR > 2x rolling mean AND |returns| > p90 — yields ~10.5% of bars.

evidence: R046, R060, R085

Without event filtering the model trains on 90% noise bars.

Event-driven sequences beat consecutive sequences decisively.

evidence: R046, R085

Consecutive sampling mixes regimes; event sampling keeps each sequence in one regime.

Lookback measured in event-bars (300), not calendar bars.

evidence: R047, R060, R061

Calendar lookback dilutes context with noise; event-bar lookback compresses ~75h of relevance.

vol_tight beats all alternatives in 4-fold WF.

evidence: R060

Relaxed filters get more trades but worse avg-edge; tighter ones cost too much trade count.

DEAD ENDS — what didn't work

Non-volatility event filters (volume, MACD-cross) (R060)

All lose to vol_tight.

lesson: Volatility is the right gating signal because labeling (vol-adaptive) and filter must align.

re-explore if: Change labels to something other than vol-adaptive.

Looser filter (vol_medium ~15%) for more trades (R060)

More trades, lower avg-edge, net lower.

lesson: Trade frequency and per-trade edge are anti-correlated.

re-explore if: Per-trade cost reduction (maker fees, DEX).

create_sequences with exact lookback (R085 bug)

Returned 0 sequences silently.

lesson: Sequence builder needs lookback+1 events; document explicitly.

re-explore if: Sequence builder rewrite.

OPEN — questions still unanswered

Learned event detector beating vol_tight?

Regime-conditional event thresholds?

Ensemble and Voting

SETTLED — what we know

Train 5 seeds per fold, 4 folds — 20 checkpoints.

evidence: R069, R070, R072

Single-seed compound varies +86% to +262% across seeds at same config.

Validate across 3 independent seed-sets (42-46, 100-104, 200-204).

evidence: R072, R084

Single-set 4/4 can be luck; 3-set passing is the real robustness signal.

Binary voting: exit when ≥3/5 (orig 4/5) models signal danger.

evidence: R070, R071

Confidence-weighted averaging is noisier than discrete voting.

Danger trigger: P(TP)<0.22 OR P(SL)>0.70. Re-enter: P(TP)>0.32.

evidence: R071, R084

Asymmetric thresholds prevent whipsaw and preserve compounding.

Same 20 GRU checkpoints power all 5 production servers (V2-V6).

evidence: R111 deployment

Strategy layer is the differentiator; retraining rarely warranted.

DEAD ENDS — what didn't work

Single-seed selective trading (R084)

Below B&H (+159% vs +1,005%).

lesson: Single seeds are pure noise.

re-explore if: Never — 5 seeds minimum is law.

9-seed ensembles (R092)

No improvement over 5.

lesson: Variance reduction is sublinear; 5 is the knee.

re-explore if: Noisier base learner.

Confidence-weighted averaging (R070, R114)

Slightly worse than binary voting.

lesson: Calibration errors propagate through soft averages.

re-explore if: Add explicit per-model calibration (temperature, isotonic).

Stacking GRU + LightGBM (Proposed)

Not validated.

lesson: Unknown.

re-explore if: Hit clear ceiling with pure GRU ensembling.

OPEN — questions still unanswered

Heterogeneous ensembling (GRU + GBM + SMA)?

Per-fold model selection (best-K-of-5)?

Exit Logic and Cooldown

SETTLED — what we know

V66 uniform cooldown (4 bars post-entry, 48 bars post-exit) is the min-alpha champion at +245%.

evidence: R110, R111

Min alpha across folds is the conservative robustness metric.

V115_cmp (ATR-conditional + peak_drop + ratchet + regime cooldown) is the compound king at +168,759%.

evidence: R110, R111

Max compound is the aggressive growth metric — pick V66 vs V115 by risk tolerance.

Daily SMA filter is the single most important component — +7,641% → +720M%.

evidence: R094

Without daily trend filter, GRU still works but caps far below ceiling.

Fixed exits beat dynamic and trailing as the base layer.

evidence: R061

Trailing/dynamic add path-dependence that breaks labeling alignment.

Always-invested with ensemble-voted danger exit beats selective trading 4-8x.

evidence: R068, R071, R084

B&H captures the bull move; selective trading sits in cash through the biggest 1% of days.

DEAD ENDS — what didn't work

Trailing/dynamic exits as primary (R061)

Lose to fixed (+3,440%).

lesson: Simpler aligns with labeling (trained on fixed barriers).

re-explore if: Retrain with trailing-aware labels.

SL-signal exit (R063b)

Inferior to always-invested.

lesson: P(SL) not precise enough alone.

re-explore if: P(SL) calibration improved.

DD circuit breaker (R097)

Destroys compound.

lesson: DD is lagging; by trigger time, worst is over.

re-explore if: Leading regime indicator pre-empts.

Adaptive ATR exits uniformly (R095)

+53% V4, hurts V3/V6.

lesson: Adaptive exits are regime-dependent.

re-explore if: Per-strategy tuning.

RSI/MACD/volume/MTF filters on top of SMA (R096, R098, R119)

All hurt.

lesson: Once daily SMA is on, additional filters subtract from right tail.

re-explore if: Filter orthogonal to trend (funding rate).

OPEN — questions still unanswered

Exit policy between V66 and V115 capturing V115's compound with V66's robustness?

Learned exit policy (RL on frozen GRU)?

Position Sizing

SETTLED — what we know

Use 100% per trade — full notional always-invested maximizes compound.

evidence: R066, R066b, R084, R116

Every fractional sizing gave up more compound than it saved in DD.

Accept higher MaxDD as the cost of 100% sizing — compound math wins.

evidence: R066, R084

Reducing DD via sizing costs more compound than the DD itself.

Pyramiding marginal, not worth complexity.

evidence: R066b

Multi-entry adds bookkeeping for alpha that fails cross-set validation.

DEAD ENDS — what didn't work

Kelly Criterion (fixed 10-25%, half-Kelly) (R066, R116)

Reduces compound.

lesson: Kelly assumes accurate edge estimation; GRU probabilities not calibrated enough.

re-explore if: Explicit probability calibration.

Confidence-proportional sizing (R116)

Reduces compound.

lesson: Same as Kelly — calibration is blocker.

re-explore if: Post-calibration with isotonic regression.

Vol-scaled and DD-scaled sizing (R116)

Reduces compound.

lesson: Vol-scaling reduces position when vol is high — but that's exactly when event-bars carry largest alpha.

re-explore if: Sizing conditional on regime classification.

OPEN — questions still unanswered

Does sizing become relevant with shorts (combined long/short notional)?

Leverage >1x on high-confidence bars?

Validation

SETTLED — what we know

Use 4-fold walk-forward expanding-window over 2018-2026.

evidence: R025, R037, R064

Random splits leak; sliding window discards data; expanding mirrors live.

Promote a config only after it passes 3 seed-sets x 4 folds = 12/12.

evidence: R072, R084

Single-set 4/4 is coin-flip; 12/12 filters out seed-luck.

Compound = product((1+fold_pnl/100))-1, not arithmetic sum.

evidence: R058

Arithmetic sum hides geometric mean; actual deployed equity follows product.

Report min-alpha-across-folds AND compound — not either alone.

evidence: R110, R111

Compound dominated by one lucky fold; min-alpha is worst-case.

Sigma clip and feature normalization must be fit on train only.

evidence: CLAUDE.md, R054

Per-sequence normalization is leak-safe; global stats must be train-only.

Commission 0.04% + slippage 0.01% in every backtest — no exceptions.

evidence: CLAUDE.md rule

Cost-free backtests double apparent edge.

DEAD ENDS — what didn't work

Sliding window vs expanding (R064)

Expanding slightly better.

lesson: Older bull-cycle data still helps generalization.

re-explore if: Clear non-stationarity in pre-2020 data.

Single-seed promotion gates (R069 (before policy change))

Set B failed fold-3 by 6%.

lesson: Always multi-seed; 6% miss invisible in single.

re-explore if: Never.

val_loss as primary selection (R000, R001)

Lowest val_loss lost money in sim.

lesson: val_loss is decorrelated from PnL.

re-explore if: Never.

OPEN — questions still unanswered

5th fold spanning 2026-only as true OOS?

Non-overlapping seed-set strategy (5 sets of 5)?

Production Engineering

SETTLED — what we know

Five servers in production share one ensemble (20 checkpoints) and differ only in YAML strategy config.

evidence: R111

Decoupling model from strategy lets you A/B test policies without retraining.

Derive feature column order from live DataFrame at inference.

evidence: R085 audit

Hardcoded order silently misaligned ema columns.

After force-closing last trade: set position=0, position_cost=0, equity[-1]=capital.

evidence: R082 bug, R083 fix

Double-counting inflated reported compound ~2x for R068-R082.

Smoke-test trader class on real bars before deploying.

evidence: R111

Catches CSV misalignment, sequence edge cases, YAML drifts.

Pre-trained checkpoints from R069/R072/R081 reused — don't retrain unless arch/features/labels change.

evidence: R082, R111

Retraining burns Colab budget and reintroduces variance.

Paper-trade any new strategy on real-time data before live capital.

evidence: CLAUDE.md

Execution slippage and CSV-write races surface only in live data.

DEAD ENDS — what didn't work

Always-invested backtester in notebook (pre-fix) (R068-R082)

Reported +89,000% was actually +4,000-7,600%.

lesson: Notebook backtesters drift from canonical class.

re-explore if: Never — only canonical is trustworthy.

CSV column logging with implicit positional alignment (Production audit)

Column misalignment in event logs.

lesson: Always log with explicit dict-to-row or DataFrame.to_csv.

re-explore if: Never.

OPEN — questions still unanswered

Live-vs-backtest slippage gap after 60 days paper trading?

Should V2-V6 be merged into single multi-config trader?

Paradigm Boundaries

SETTLED — what we know

Momentum / barrier-trigger paradigm is exhausted — 87 rounds converge to same family.

evidence: R000-R087

Further tuning yields ~5-10% deltas; new paradigms could yield 2-10x.

Shorts do not work with current 3-class GRU.

evidence: R063, R063b

P(SL) not precise enough; need dedicated short model.

5min and 1min base timeframes do not beat 15min.

evidence: R035, R065

More frequency doesn't pay for noise increase.

Daily SMA trend filter is highest-leverage discovery post-R083.

evidence: R094

Only single component that moved compound by 5+ orders of magnitude.

Portfolio regime-aware V66+MACD_div delivers 5× V66 compound at matching min α.

evidence: R120

First strategy treating V2-V6 as portfolio members rather than alternatives.

DEAD ENDS — what didn't work

Mean-reversion (R117) standalone (R117)

MACD_div +179K% compound BUT min α -42.

lesson: Mean-rev needs portfolio context, not standalone.

re-explore if: Combined with regime filter or trend confirmation.

Alternative model classes — XGBoost (R118)

GRU crushes trees 287× (245% vs 70,576% compound).

lesson: Tree models can't capture sequence dependencies that GRU encodes.

re-explore if: Tabular features only, no sequence.

Multi-timeframe filter (4h on V66) (R119)

All 5 filter variants degrade V66.

lesson: V66 already encodes context; external filter is redundant.

re-explore if: Filter applied to a strategy WITHOUT internal regime detection.

On-chain features (Glassnode/mempool) (Not explored)

Untested.

lesson: All features price/volume-derived.

re-explore if: Always — most likely orthogonal alpha source.

Multi-asset (ETH, SOL, top-10) (Not explored)

Untested.

lesson: Cross-asset signals (BTC-ETH ratio) untouched.

re-explore if: Always — major unexplored direction.

Reinforcement learning end-to-end (R075 partial)

Meta-RL on top of GRU didn't help.

lesson: RL on frozen GRU has too little signal.

re-explore if: Stable simulation environment with realistic slippage.

OPEN — questions still unanswered

Does mean-reversion + regime detection deliver portfolio-level alpha?

Do on-chain or funding-rate features add anything?

Multi-asset coordination (BTC+ETH+SOL)?

End-to-end RL beats hand-tuned cascades?

META PRINCIPLES — How We Learn

SETTLED — what we know

Single-seed results are noise — minimum 5 seeds for any compound claim.

evidence: R069, R072

Fold-0 compound varies +86% to +262% across seeds at same config.

Pre-bug-fix numbers (R068-R082, R022) are not comparable to post-fix.

evidence: R027 gaps bug, R083 equity bug

Mixing pre/post-fix promotes broken configs.

val_loss is not PnL — always simulate compound on test fold.

evidence: R000, R001

Optimizer optimizes loss; user optimizes equity. They diverge.

Commission 0.04% + slippage 0.01% per CLAUDE.md — hard floor in every backtest.

evidence: CLAUDE.md

Cost-free backtests have falsely promoted multiple configs that died in paper.

More trades at lower per-trade edge can beat fewer high-edge trades — compounding wins.

evidence: R068, R084, R120

500 trades at +1% → +14,477%; 100 at +2% → +624%. Frequency matters.

Promote on min-alpha AND compound — not either alone.

evidence: R110, V66 vs V115

Compound rewards one lucky fold; min-alpha rewards worst-case.

Compound = product of (1 + fold_returns), not sum.

evidence: R058, R084

Arithmetic sums rank configs that crash in fold 2 above consistent ones.

Reuse pre-trained ensembles across strategy experiments — retrain only when arch/features/labels change.

evidence: R082, R111

Keeps strategy comparisons clean; saves Colab budget for paradigm shifts.

Smoke-test on real bars before any deploy — never rely on backtest only.

evidence: R111, R085

Three most expensive bugs (equity double-count, feature order, CSV) survived backtests.

When you've explored 87 rounds inside one paradigm, EXIT IT — incremental gains become noise.

evidence: R084-R098 plateau, R112-R116 V66 exhaustion

Mean-rev, on-chain, multi-asset, RL paradigms are unexplored; further momentum tuning is diminishing returns.