The Journey
120 rounds across 56 days (2026-03-27 to 2026-05-22) — 10 epochs of trial, error, and accumulated knowledge.
Timeline each block = one epoch, width ∝ number of rounds
Epochs
Foundations & First Failures
27 rounds chasing supervised learning on binary then volatility-adaptive triple-barrier labels — only to discover a sequence-gap bug that invalidated nearly every result.
The project opens with the most natural question in financial ML: can a sequence model predict the next move? R000 wires up the full pipeline — LSTM 2x128, lookback 60, binary_12 labels asking 'will price be higher in 1 hour?' — and gets answered with a brutal -36.5% PnL and a 45.2% win rate. Binary labels are noise. R001 swaps to triple-barrier labeling (TP=0.75%, SL=0.5%) and uncovers a deeper set of issues: the model was shorting on label=0 (which doesn't mean 'down', only 'not up'), class weights were missing, and the Sharpe formula was double-counting flat bars. R002 patches those, R003 simplifies to a 21K-parameter GRU to fight overfit. None of it works at fixed thresholds.
The pivot happens at R004: thresholds should scale with volatility. The new formula TP = k_up · σ, SL = k_dn · σ becomes the labeling backbone that survives the entire project. Twenty-three rounds (R004-R026) sweep k_up/k_dn from 1.8/1.2 to 25/15, alpha from 6 to 96, lookbacks from 20 to 120. R012 establishes GRU > LSTM as a permanent fact. R020-R022 prove multi-timeframe features add +795%. R025 validates walk-forward 4-fold. By R026 the project thinks it has a +998% champion (R022).
Then R027 happens. Investigating a discrepancy between training and production, the team discovers that filtering-out 'timeout' bars during training created sequence gaps that did not exist in live data. Every result in R004-R026 was inflated — sometimes by 60x. The honest +998% becomes +15.9%. Twenty rounds of work are reduced to a single durable insight (volatility-adaptive barriers) and a brutal lesson about pipeline parity.
Key breakthroughs
- R004: Volatility-adaptive triple barrier (k·σ) replaces fixed thresholds — the foundational labeling scheme that survives the next 116 rounds
- R012: GRU consistently beats LSTM — settled architecturally for the rest of the project
- R020-R022: Multi-timeframe (15min aux features) gives +795% (honest +15.9% post-fix, but the directional signal was real)
- R025: Walk-forward 4-fold validation methodology established
Key disappointments
- R000-R003: Binary and fixed triple-barrier labels are noise — no amount of architecture tuning rescues them
- R027: The sequence-gap bug invalidates ~20 rounds of recorded 'wins' — a humbling reset
Post-Fix Recalibration & The MFE/MAE Pivot
After the bug fix flatlined returns, the project pivots from classification to regression on MFE/MAE, finds the right loss function (mse_ratio), and posts the first walk-forward-validated +97.9% across 4 folds.
With the sequence-gap bug exorcised, the team has to rebuild credibility against honest numbers. R027-R033 try swing labels and a dual-head architecture; R033 hits +17.6% with the dual head. R034 attempts pure regression and finds an unusable combination — WR 75% and PF 3.56, but only 16 trades across the whole walk-forward. R035 explores 1-minute base bars and immediately blows up GPU memory.
R036 introduces what becomes the second foundational idea: predict MFE (maximum favorable excursion) and MAE (maximum adverse excursion) as continuous values. DualHeadGRU + MSE on both heads gives +8.3%; R036b tunes TP/SL to +9.4%; R036c shows the model knows when NOT to enter, losing only -1.2% in a bear market where B&H loses -23%. The 'knows when to sit out' result is the project's first hint that the model has real, asymmetric edge.
R038 then performs the single most important loss-function experiment. Standard MSE produces uniform predictions (MFE/MAE ratio ≈ 1.0) — the model cheats by predicting the mean. Four losses are compared: standard MSE (-4.9%), mse_ratio (+16.2%), weighted (-13.6%, hacks loss), and asymmetric (-0.6%). mse_ratio is the ONLY viable loss. R037 then validates mse_ratio + MFE/MAE in proper walk-forward across 4 folds (2018-2026) and posts +97.9% total / +9.3% worst fold / WR 54% / PF 1.49.
Key breakthroughs
- R036: MFE/MAE dual-head regression introduced — the model learns asymmetric risk-reward, not just direction
- R038: mse_ratio loss function discovered — the ONLY loss that doesn't collapse to a constant prediction
- R036c: First demonstration that the model adds value by knowing when to stay flat (-1.2% vs B&H -23% in bear)
- R037: First walk-forward-validated +97.9% across 4 folds — credibility restored
Key disappointments
- R034: Pure regression gives WR 75% / PF 3.56 but only 16 trades — too sparse to compound
- R035: 1-minute base bars don't help and cause OOM
- R038 weighted/asymmetric losses: 'Smart' loss variants either hack the loss or over-conserve
The Event-Driven Revolution
Replacing consecutive-bar sequences with event-driven sequences (filtered by volatility) produced the biggest single jump in performance of the entire project — +685% compound across 4 folds.
The lurking problem after R037 was sample efficiency: 95% of 5-minute BTC bars carry essentially no information, but the GRU was being asked to learn from all of them equally. R046 introduces the most important architectural reframe: train only on 'interesting' bars, defined as bars where ATR > 2x the rolling mean OR |returns| > 90th percentile. About 10.5% of bars survive the vol_tight filter.
The results are dramatic. The model now sees a sequence of regime transitions rather than a flood of mostly-flat candles. R046-R053 sweep event filters, lookbacks, and feature sets. By R053 the best config — binary labels + event-driven sequences + lookback=100 + vol_ultra filter — posts +685% summed across 4 walk-forward folds. That is roughly 7x the previous walk-forward record and ~1.7x B&H. 15min is confirmed as the optimal base timeframe.
This epoch is short (8 rounds, 3 days) but represents the single highest-leverage methodological change in the project. Every subsequent breakthrough is built on top of event-driven sequences.
Key breakthroughs
- R046: Event-driven sequences introduced — 10.5% of bars selected by volatility, massively improving signal density
- R047/R053: Lookback=100 event bars and vol_ultra filter give +685% summed across 4 walk-forward folds
- Walk-forward optimization methodology stabilized
- 15min base timeframe confirmed optimal vs 5min, 10min, 1h
Key disappointments
- Consecutive-bar sequences definitively retired
- Most 'event filter' variants underperform vol_tight
Feature & Label Consolidation
Seven rounds methodically lock in the building blocks: full-sequence MinMax normalization, 3-class labels with weighted CrossEntropy, 36 clean features, and the vol_tight + lb=300 + k25/15 sweet spot.
With event-driven sequences proven, R054-R060 settle every component decision that will define the V2-V6 production stack. R054/054c/054d run three independent normalization studies and unanimously confirm full per-sequence MinMax — R054c posts +791% compound at lb=300. R055 tests 3-class labels (SL=-1, timeout=0, TP=+1) and wins +824% vs +685% binary; the key is CrossEntropyLoss with ce_signal weights [2.0, 0.5, 2.0].
R056 runs permutation importance + correlation + ablation studies on the original 47 features, identifying 11 actively harmful features (volatility_60, volume_ratio, hour_sin, MACD, ema crosses, etc.) — the 'clean 36' feature set is born. R057 confirms clean 36 is more robust than positive_only 24. R058 introduces the compound metric (product of (1+fold)) and posts +4,783% compound. R059 combines 3-class + clean 36 features. R060 does the definitive event-filter grid and confirms vol_tight + k25/15 + thr=0.40 as the sweet spot at +4,021% compound.
This epoch is the project's least glamorous and arguably its most important. None of these rounds 'discovered' anything; they consolidated and stress-tested the discoveries of the previous epoch.
Key breakthroughs
- R054c: Full per-sequence MinMax normalization confirmed (3 independent tests) — record +791% compound
- R055: 3-class labels with ce_signal [2.0, 0.5, 2.0] beat binary +824% vs +685%
- R056: 11 harmful features identified; clean 36 feature set established
- R058/R060: Compound metric introduced and vol_tight + lb=300 + k25/15 sweet spot locked in (+4,783%)
Key disappointments
- Positive-only 24 features: high-quality trades but less robust than clean 36
- R058 single-seed variance: fold 0 varies +86% to +262% — foreshadowing ensemble work
Always-Invested Revolution & The R071 Record
A twelve-round sprint that rebuilt the trading layer from exit modes upward — culminating in R068's paradigm flip to 'always invested + ensemble danger detection' and R071's record +7,128% compound.
R061-R067 systematically refine the trading layer. R061 compares fixed/dynamic/trailing/hybrid/always exits and finds fixed simplest and best (+3,440%). R062 confirms multi-timeframe features don't help once lb=300 is in place. R063 retires shorts (P(SL) isn't precise enough) and SL-signal exit. R064 settles expanding-window over sliding. R065 confirms 15min over every alternative and discovers that sigma clip [0.0005, 0.005] is essential — removing it cuts compound by 2.5x. R066 picks 100% position sizing. R067 finds that requiring 3-4 consecutive event-bars in agreement improves trade quality +47%.
Then R068 changes everything. Rather than asking 'when should the model enter?', it asks 'why is the model selectively trading at all?' Answer: stay invested like B&H by default, use the model only to detect dangerous regimes and exit. This single reframing captures all of B&H's upside while letting the model add asymmetric protection. It is the single biggest conceptual breakthrough in the project.
R069 introduces 5-seed-per-config validation (single-seed results revealed as unreliable). R070 confirms voting (≥4/5 agree) beats simple averaging. R071 then assembles the full stack and posts +7,128% compound, beating B&H in 4/4 folds. R072 validates across three seed sets: set A passes 4/4, set B 3/4, set C 4/4.
Key breakthroughs
- R068: Always-invested + danger detection paradigm (4-8x improvement)
- R069/R070: 5-seed ensemble with ≥4/5 voting — single-seed retired permanently
- R071: Record +7,128% compound, 4/4 folds beat B&H, fold 2 +261% vs B&H +228%
- R065b: Sigma clip [0.0005, 0.005] — a tiny detail with 2.5x compound impact
Key disappointments
- R063: Shorts don't work — long-only confirmed
- R061: Trailing/dynamic/hybrid exits all lose to fixed — simpler wins
- R072: Set B fails fold 3 by ~6% — not yet bulletproof across seed inits
The Labeling Dead-End
Eight rounds, thirteen advanced labeling methods tested in a single day — every one failed to beat plain 3-class tb3_vol. The dead-end that proved the foundation was right.
After R071 set a record but R072 revealed instability, the natural hypothesis was 'the labels can be smarter.' R073-R080 is the project's most concentrated burst of experimentation: thirteen different labeling schemes tested in a single calendar day. Speed-weighted labels, 5-class labels, efficiency-weighted, DSR (deflated Sharpe), filtered labels, next-event labels, RL meta-decision, Self-Distillation Iterative (SDIL), Conditional Barrier Asymmetry (CBA), Multi-Resolution Consensus (MRC), Trend Scanning, Path-Quality Weighted (PQW).
The result: none of the thirteen methods beat plain 3-class tb3_vol. Two insights survive. First, regression labels fundamentally fail with the GRU — CrossEntropy concentrates gradients on the directional decision in a way no continuous loss can match. Second, timeouts (~24% of labels) are not noise but essential information; every 'cleaner' labeling scheme that suppresses or downweights timeouts loses more signal than it gains.
This epoch is the canonical example of a productive dead-end. The team spent eight rounds learning that the foundation was already as good as it could be, which freed the next epoch to focus on production deployment.
Key breakthroughs
- Negative result confirmed: 13 labeling methods tested, none beat tb3_vol
- Insight: Timeouts (24% of labels) carry essential information
- Insight: CrossEntropy concentrates gradients in a way no regression variant can replicate
- Project gains permission to stop searching for better labels
Key disappointments
- Regression labels (R073, R080): fundamentally fail with GRU
- RL meta-decision (R075), SDIL (R076), CBA (R077): high-engineering-effort, zero improvement
- Trend Scanning (R079), MRC (R078): theoretically motivated, empirically inferior
Production Validation & The Equity Bug Reckoning
R083 uncovers a critical equity double-counting bug that had inflated every R068-R082 result by 2x — and then R094 discovers the daily SMA filter that becomes the most important component in the entire stack.
R081 combines every breakthrough into a single 'ultimate production config'. R082 begins fine-grid threshold tuning. Then R083 — the second great bug-fix moment of the project.
In the always-invested notebooks (R068 onward), the equity computation at the end of each fold force-closed the open position and then computed equity[-1] = capital + position_value — double-counting the last trade. The reported +89,000% compound was actually +4,000-7,600%. Two months of 'records' had to be re-evaluated. The corrected R084 result is +7,641% best-set / +3,978% worst-set / B&H baseline +1,005% — still a 4-7.6x outperformance.
R085-R087 explore whether anything obvious can lift the corrected numbers. R085/R085b add sigma, tp_dist_pct, sl_dist_pct as features — all hurt (-24% to -45%). R086/R087 sweep regularization and model size — all hurt or fail to improve cross-set consistency. R089-R092 try LSTM swap, multi-lookback, 9-seed ensembles — none beat baseline.
Then R094 breaks the plateau in an unexpected direction. Trying SMA as a sanity baseline, the team discovers that simple SMA on its own crushes the GRU — and combining either with a DAILY SMA filter explodes performance: GRU+daily gives +720M%, SMA+daily gives +9.3B%. Every tested config with the daily filter beats B&H in 4/4 folds. The daily SMA filter is the single most important component the project has ever found.
Key breakthroughs
- R083: Equity double-counting bug fixed — project finally has honest numbers (+7,641% best / +3,978% worst)
- R084: Always-invested + ensemble confirmed at +4-7.6x B&H
- R094: Daily SMA filter discovered — the single most important component (+720M% GRU+daily, +9.3B% SMA+daily)
- Negative confirmation R085-R092: more features, regularization, seeds, LSTM — nothing else helps
Key disappointments
- R082 equity bug: two months of records cut by ~2x
- R085-R087: Adding obvious features (sigma, distances) hurts
- R089-R092: LSTM, multi-lookback, 9-seed — no architecture improvement
Hybrid Strategies & Live Deployment (V2-V6)
Seventeen rounds turn validated backtests into five live production bots — including the V115_cmp and V66 strategies that crown the entire project.
With the GRU stack validated and the daily SMA filter discovered, R095-R111 are the project's deployment epoch. R095 tests ATR-adaptive exits — improves V4 by +53% but hurts V3/V6. R096 confirms multi-timeframe entry hurts. R097 confirms drawdown circuit breakers hurt. R098 confirms RSI/MACD/volume filters layered on top of SMA all hurt. The pattern is consistent: the daily SMA filter is essential, but additional filters add noise.
R099-R109 build out three production bots: V2 (pure GRU ensemble, +7,641% backtest), V3 (GRU + DailyRSI>80 + 10% trailing stop, +450B% backtest), V4 (3-regime adaptive V5.4 Robust-5, +30,174% backtest, +117% min α). All three deployed to Hetzner.
R110-R111 are the climax. An external collaboration produces V115_cmp (combines GRU + peak_drop + ratchet + regime cooldown) and V66 (uniform thresholds + extreme cooldown 4,48 bars). R110 reproduces them at the dollar: V115_cmp at +168,759% compound / +238% min α; V66 at +70,576% compound / +245% min α (the all-time min-α record). R111 builds the HybridV115Trader class (~750 LOC) and V5+V6 join the live stack. Five servers, one shared GRU checkpoint, five strategy configurations.
Key breakthroughs
- Daily SMA filter integrated into V3/V4/V5/V6
- R110: V115_cmp validated at +168,759% compound (all-time compound record)
- R110: V66 validated at +245% min α (all-time robustness record)
- R111: 5 production servers live on Hetzner — V2/V3/V4/V5/V6
Key disappointments
- R095: ATR-adaptive exits improve V4 but hurt V3/V6 — no universal exit
- R096-R098: Multi-timeframe entry, DD breakers, RSI/MACD/volume filters all hurt
- Plateau confirmed: layering more filters on daily-SMA stack consistently reduces compound
V66 Refinement Exhaustion
Five rounds attacking the V66 hyperparameter surface — all confirmed that V66's specific configuration cannot be tuned further.
With V5/V6 deployed and posting numbers nobody dared to budget for, the natural next question was: can we extract more from the same GRU checkpoints? R112-R116 attack the V66 strategy layer from every angle. R114 implements confidence-weighted voting. R115 sweeps vote thresholds and min_votes parameters. R116 implements Kelly position sizing.
The result of five rounds is unanimous: V66's hyperparameters are at a local optimum that cannot be improved by any tweak inside the same parameter family. Confidence-weighted voting matches binary voting in compound but doesn't improve robustness. Vote-threshold sweeps confirm the existing config sits at the peak. Kelly sizing reduces drawdown but also reduces compound.
This is the second major dead-end of the project — and like R073-R080 before it, the value is in the negative result. The team formally closes the V66-refinement search and reframes the question.
Key breakthroughs
- Negative result confirmed: V66 hyperparameters cannot be improved within existing parameter family
- Confidence-weighted voting (R114): matches binary voting — no free lunch
- Vote threshold + min_votes sweep (R115): confirms current config is local optimum
- Kelly sizing (R116): trades compound for DD reduction with neutral min α
Key disappointments
- Five rounds of intricate engineering produced zero improvement to V66's headline numbers
- Kelly's promise of 'reduce DD without sacrificing compound' is empirically falsified for this strategy
Paradigm Shift — Exploring Orthogonal Alpha
After V66 refinement was exhausted, the project pivots to orthogonal strategies: mean reversion, XGBoost, multi-timeframe filtering, and portfolio composition.
R117-R120 mark the project's most recent strategic pivot. Having proven that the V66 stack cannot be improved within its own vocabulary, the team formalizes a new objective in ACTION_PLAN_2026.md: produce a candidate that beats V5/V6 by a meaningful margin AND is diversifying — uncorrelated enough with the existing stack to be deployed alongside it.
The four exploratory rounds attack four orthogonal hypotheses. R117 tests mean reversion (BB_revert, RSI_extreme, Z_score, MACD_div) — the dual of the trend-following stack. R118 implements XGBoost as an alternative model class — REJECTED, GRU crushes trees 287×. R119 implements a multi-timeframe regime filter — REJECTED, V66 already encodes context. R120 is the portfolio-composition breakthrough: regime-aware allocation of V66 + MACD_div delivers 7 winners with **362,991% compound (5.1× V66) and matching 245% min α**.
Alongside the experiments, the project hardens its validation infrastructure (Phase 0: 5-step gate). The era of 'a single notebook + a single seed' is formally over. The current strategic focus is concluding R120 analysis and building a real-time regime detector (R121) so the retroactive R120 alpha can be deployed.
Key breakthroughs
- R117: Mean reversion as orthogonal alpha source — MACD_div +179K% compound
- R120: Portfolio V66+MACD_div regime-aware — 7 configs PASS (362,991% compound, 245% min α, 16× more trades)
- Phase 0 validation infrastructure proposed (5-step gate)
- Project framing shifted from 'replace V6' to 'diversify alongside V6'
Key disappointments
- R118: XGBoost CRUSHED by GRU (245% vs 70,576% compound)
- R119: All 5 MTF filter variants degrade V66 — context already in features
- R120 winners are RETROACTIVE — need real-time regime detector (R121) for deploy