Journey · BTC Trading AI

Total rounds: 120
Span: 56 days (2026-03-27 to 2026-05-22)
Cadence: ~15 rounds/week (extraordinary cadence: ~2 rounds/day, sustained for 8 weeks)
Current champion: V6 V66 Cooldown(4,48): +70,576% compound / +245% min α (the robustness champion). V5 V115_cmp holds the absolute compound record at +168,759% / +238% min α. Both deployed live on Hetzner.
Biggest breakthrough: R068 Always-Invested Strategy (2026-04-10) — paradigm flip from 'selective trading' to 'stay invested like B&H, use model only for danger detection.' This 4-8x'd compound and made every subsequent record possible. Honorable mention: R094 Daily SMA filter, the single most important component (+720M% with vs +7,641% without).
Biggest dead-end: R073-R080 Labeling Innovation (8 rounds, 13 advanced methods). ALL failed to beat plain 3-class tb3_vol. Lesson: timeouts (24% of labels) carry essential information that 'cleaner' labels destroy, and CrossEntropy concentrates gradients better than any regression variant.
Current focus: Post-R120 paradigm shift — having exhausted V66 GRU refinement (R112-R116 confirmed no more juice), exploring orthogonal alpha: mean reversion (R117), XGBoost (R118), MTF (R119), portfolio composition (R120). Goal: diversify the live stack, not replace it.

Timeline each block = one epoch, width ∝ number of rounds

Epochs

EPOCH 01

Foundations & First Failures

R000-R026 2026-03-27 to 2026-03-31

27 rounds chasing supervised learning on binary then volatility-adaptive triple-barrier labels — only to discover a sequence-gap bug that invalidated nearly every result.

The project opens with the most natural question in financial ML: can a sequence model predict the next move? R000 wires up the full pipeline — LSTM 2x128, lookback 60, binary_12 labels asking 'will price be higher in 1 hour?' — and gets answered with a brutal -36.5% PnL and a 45.2% win rate. Binary labels are noise. R001 swaps to triple-barrier labeling (TP=0.75%, SL=0.5%) and uncovers a deeper set of issues: the model was shorting on label=0 (which doesn't mean 'down', only 'not up'), class weights were missing, and the Sharpe formula was double-counting flat bars. R002 patches those, R003 simplifies to a 21K-parameter GRU to fight overfit. None of it works at fixed thresholds.

The pivot happens at R004: thresholds should scale with volatility. The new formula TP = k_up · σ, SL = k_dn · σ becomes the labeling backbone that survives the entire project. Twenty-three rounds (R004-R026) sweep k_up/k_dn from 1.8/1.2 to 25/15, alpha from 6 to 96, lookbacks from 20 to 120. R012 establishes GRU > LSTM as a permanent fact. R020-R022 prove multi-timeframe features add +795%. R025 validates walk-forward 4-fold. By R026 the project thinks it has a +998% champion (R022).

Then R027 happens. Investigating a discrepancy between training and production, the team discovers that filtering-out 'timeout' bars during training created sequence gaps that did not exist in live data. Every result in R004-R026 was inflated — sometimes by 60x. The honest +998% becomes +15.9%. Twenty rounds of work are reduced to a single durable insight (volatility-adaptive barriers) and a brutal lesson about pipeline parity.

Key breakthroughs

R004: Volatility-adaptive triple barrier (k·σ) replaces fixed thresholds — the foundational labeling scheme that survives the next 116 rounds
R012: GRU consistently beats LSTM — settled architecturally for the rest of the project
R020-R022: Multi-timeframe (15min aux features) gives +795% (honest +15.9% post-fix, but the directional signal was real)
R025: Walk-forward 4-fold validation methodology established

Key disappointments

R000-R003: Binary and fixed triple-barrier labels are noise — no amount of architecture tuning rescues them
R027: The sequence-gap bug invalidates ~20 rounds of recorded 'wins' — a humbling reset

Exit stateHonest best result: tb_vol_10_7_a24 GRU lb=20 → +15.9% PnL, 87 trades, WR 46%, PF 1.53 over walk-forward. Volatility-adaptive labels established as foundation.

EPOCH 02

Post-Fix Recalibration & The MFE/MAE Pivot

R027-R045 2026-03-31 to 2026-04-03

After the bug fix flatlined returns, the project pivots from classification to regression on MFE/MAE, finds the right loss function (mse_ratio), and posts the first walk-forward-validated +97.9% across 4 folds.

With the sequence-gap bug exorcised, the team has to rebuild credibility against honest numbers. R027-R033 try swing labels and a dual-head architecture; R033 hits +17.6% with the dual head. R034 attempts pure regression and finds an unusable combination — WR 75% and PF 3.56, but only 16 trades across the whole walk-forward. R035 explores 1-minute base bars and immediately blows up GPU memory.

R036 introduces what becomes the second foundational idea: predict MFE (maximum favorable excursion) and MAE (maximum adverse excursion) as continuous values. DualHeadGRU + MSE on both heads gives +8.3%; R036b tunes TP/SL to +9.4%; R036c shows the model knows when NOT to enter, losing only -1.2% in a bear market where B&H loses -23%. The 'knows when to sit out' result is the project's first hint that the model has real, asymmetric edge.

R038 then performs the single most important loss-function experiment. Standard MSE produces uniform predictions (MFE/MAE ratio ≈ 1.0) — the model cheats by predicting the mean. Four losses are compared: standard MSE (-4.9%), mse_ratio (+16.2%), weighted (-13.6%, hacks loss), and asymmetric (-0.6%). mse_ratio is the ONLY viable loss. R037 then validates mse_ratio + MFE/MAE in proper walk-forward across 4 folds (2018-2026) and posts +97.9% total / +9.3% worst fold / WR 54% / PF 1.49.

Key breakthroughs

R036: MFE/MAE dual-head regression introduced — the model learns asymmetric risk-reward, not just direction
R038: mse_ratio loss function discovered — the ONLY loss that doesn't collapse to a constant prediction
R036c: First demonstration that the model adds value by knowing when to stay flat (-1.2% vs B&H -23% in bear)
R037: First walk-forward-validated +97.9% across 4 folds — credibility restored

Key disappointments

R034: Pure regression gives WR 75% / PF 3.56 but only 16 trades — too sparse to compound
R035: 1-minute base bars don't help and cause OOM
R038 weighted/asymmetric losses: 'Smart' loss variants either hack the loss or over-conserve

Exit stateDualHeadGRU + mse_ratio + 15min multiframe, H120, lb=20, 2 epochs → +97.9% total walk-forward, +9.3% worst fold. MaxDD 37% too high for production but foundation is real.

EPOCH 03

The Event-Driven Revolution

R046-R053 2026-04-03 to 2026-04-06

Replacing consecutive-bar sequences with event-driven sequences (filtered by volatility) produced the biggest single jump in performance of the entire project — +685% compound across 4 folds.

The lurking problem after R037 was sample efficiency: 95% of 5-minute BTC bars carry essentially no information, but the GRU was being asked to learn from all of them equally. R046 introduces the most important architectural reframe: train only on 'interesting' bars, defined as bars where ATR > 2x the rolling mean OR |returns| > 90th percentile. About 10.5% of bars survive the vol_tight filter.

The results are dramatic. The model now sees a sequence of regime transitions rather than a flood of mostly-flat candles. R046-R053 sweep event filters, lookbacks, and feature sets. By R053 the best config — binary labels + event-driven sequences + lookback=100 + vol_ultra filter — posts +685% summed across 4 walk-forward folds. That is roughly 7x the previous walk-forward record and ~1.7x B&H. 15min is confirmed as the optimal base timeframe.

This epoch is short (8 rounds, 3 days) but represents the single highest-leverage methodological change in the project. Every subsequent breakthrough is built on top of event-driven sequences.

Key breakthroughs

R046: Event-driven sequences introduced — 10.5% of bars selected by volatility, massively improving signal density
R047/R053: Lookback=100 event bars and vol_ultra filter give +685% summed across 4 walk-forward folds
Walk-forward optimization methodology stabilized
15min base timeframe confirmed optimal vs 5min, 10min, 1h

Key disappointments

Consecutive-bar sequences definitively retired
Most 'event filter' variants underperform vol_tight

Exit stateEvent-driven sequences + vol_tight + lookback=100 + GRU 2x128 + binary labels → +685% summed across 4 folds. Sample efficiency problem solved.

EPOCH 04

Feature & Label Consolidation

R054-R060 2026-04-06 to 2026-04-08

Seven rounds methodically lock in the building blocks: full-sequence MinMax normalization, 3-class labels with weighted CrossEntropy, 36 clean features, and the vol_tight + lb=300 + k25/15 sweet spot.

With event-driven sequences proven, R054-R060 settle every component decision that will define the V2-V6 production stack. R054/054c/054d run three independent normalization studies and unanimously confirm full per-sequence MinMax — R054c posts +791% compound at lb=300. R055 tests 3-class labels (SL=-1, timeout=0, TP=+1) and wins +824% vs +685% binary; the key is CrossEntropyLoss with ce_signal weights [2.0, 0.5, 2.0].

R056 runs permutation importance + correlation + ablation studies on the original 47 features, identifying 11 actively harmful features (volatility_60, volume_ratio, hour_sin, MACD, ema crosses, etc.) — the 'clean 36' feature set is born. R057 confirms clean 36 is more robust than positive_only 24. R058 introduces the compound metric (product of (1+fold)) and posts +4,783% compound. R059 combines 3-class + clean 36 features. R060 does the definitive event-filter grid and confirms vol_tight + k25/15 + thr=0.40 as the sweet spot at +4,021% compound.

This epoch is the project's least glamorous and arguably its most important. None of these rounds 'discovered' anything; they consolidated and stress-tested the discoveries of the previous epoch.

Key breakthroughs

R054c: Full per-sequence MinMax normalization confirmed (3 independent tests) — record +791% compound
R055: 3-class labels with ce_signal [2.0, 0.5, 2.0] beat binary +824% vs +685%
R056: 11 harmful features identified; clean 36 feature set established
R058/R060: Compound metric introduced and vol_tight + lb=300 + k25/15 sweet spot locked in (+4,783%)

Key disappointments

Positive-only 24 features: high-quality trades but less robust than clean 36
R058 single-seed variance: fold 0 varies +86% to +262% — foreshadowing ensemble work

Exit stateLocked-in stack: clean 36 features + per-sequence MinMax + 3-class tb3_vol + ce_signal weights + vol_tight + lb=300 + GRU 2x128. Best compound +4,783% but with concerning seed variance.

EPOCH 05

Always-Invested Revolution & The R071 Record

R061-R072 2026-04-09 to 2026-04-11

A twelve-round sprint that rebuilt the trading layer from exit modes upward — culminating in R068's paradigm flip to 'always invested + ensemble danger detection' and R071's record +7,128% compound.

R061-R067 systematically refine the trading layer. R061 compares fixed/dynamic/trailing/hybrid/always exits and finds fixed simplest and best (+3,440%). R062 confirms multi-timeframe features don't help once lb=300 is in place. R063 retires shorts (P(SL) isn't precise enough) and SL-signal exit. R064 settles expanding-window over sliding. R065 confirms 15min over every alternative and discovers that sigma clip [0.0005, 0.005] is essential — removing it cuts compound by 2.5x. R066 picks 100% position sizing. R067 finds that requiring 3-4 consecutive event-bars in agreement improves trade quality +47%.

Then R068 changes everything. Rather than asking 'when should the model enter?', it asks 'why is the model selectively trading at all?' Answer: stay invested like B&H by default, use the model only to detect dangerous regimes and exit. This single reframing captures all of B&H's upside while letting the model add asymmetric protection. It is the single biggest conceptual breakthrough in the project.

R069 introduces 5-seed-per-config validation (single-seed results revealed as unreliable). R070 confirms voting (≥4/5 agree) beats simple averaging. R071 then assembles the full stack and posts +7,128% compound, beating B&H in 4/4 folds. R072 validates across three seed sets: set A passes 4/4, set B 3/4, set C 4/4.

Key breakthroughs

R068: Always-invested + danger detection paradigm (4-8x improvement)
R069/R070: 5-seed ensemble with ≥4/5 voting — single-seed retired permanently
R071: Record +7,128% compound, 4/4 folds beat B&H, fold 2 +261% vs B&H +228%
R065b: Sigma clip [0.0005, 0.005] — a tiny detail with 2.5x compound impact

Key disappointments

R063: Shorts don't work — long-only confirmed
R061: Trailing/dynamic/hybrid exits all lose to fixed — simpler wins
R072: Set B fails fold 3 by ~6% — not yet bulletproof across seed inits

Exit stateR071 always-invested + ensemble voting: +7,128% compound, 4/4 vs B&H, set B fails 3/4. Strategy layer essentially solved; remaining question is whether labels can be improved further.

EPOCH 06

The Labeling Dead-End

R073-R080 2026-04-11

Eight rounds, thirteen advanced labeling methods tested in a single day — every one failed to beat plain 3-class tb3_vol. The dead-end that proved the foundation was right.

After R071 set a record but R072 revealed instability, the natural hypothesis was 'the labels can be smarter.' R073-R080 is the project's most concentrated burst of experimentation: thirteen different labeling schemes tested in a single calendar day. Speed-weighted labels, 5-class labels, efficiency-weighted, DSR (deflated Sharpe), filtered labels, next-event labels, RL meta-decision, Self-Distillation Iterative (SDIL), Conditional Barrier Asymmetry (CBA), Multi-Resolution Consensus (MRC), Trend Scanning, Path-Quality Weighted (PQW).

The result: none of the thirteen methods beat plain 3-class tb3_vol. Two insights survive. First, regression labels fundamentally fail with the GRU — CrossEntropy concentrates gradients on the directional decision in a way no continuous loss can match. Second, timeouts (~24% of labels) are not noise but essential information; every 'cleaner' labeling scheme that suppresses or downweights timeouts loses more signal than it gains.

This epoch is the canonical example of a productive dead-end. The team spent eight rounds learning that the foundation was already as good as it could be, which freed the next epoch to focus on production deployment.

Key breakthroughs

Negative result confirmed: 13 labeling methods tested, none beat tb3_vol
Insight: Timeouts (24% of labels) carry essential information
Insight: CrossEntropy concentrates gradients in a way no regression variant can replicate
Project gains permission to stop searching for better labels

Key disappointments

Regression labels (R073, R080): fundamentally fail with GRU
RL meta-decision (R075), SDIL (R076), CBA (R077): high-engineering-effort, zero improvement
Trend Scanning (R079), MRC (R078): theoretically motivated, empirically inferior

Exit state3-class tb3_vol with ce_signal [2.0, 0.5, 2.0] is the definitive labeling scheme — search formally closed.

EPOCH 07

Production Validation & The Equity Bug Reckoning

R081-R094 2026-04-12 to 2026-04-15

R083 uncovers a critical equity double-counting bug that had inflated every R068-R082 result by 2x — and then R094 discovers the daily SMA filter that becomes the most important component in the entire stack.

R081 combines every breakthrough into a single 'ultimate production config'. R082 begins fine-grid threshold tuning. Then R083 — the second great bug-fix moment of the project.

In the always-invested notebooks (R068 onward), the equity computation at the end of each fold force-closed the open position and then computed equity[-1] = capital + position_value — double-counting the last trade. The reported +89,000% compound was actually +4,000-7,600%. Two months of 'records' had to be re-evaluated. The corrected R084 result is +7,641% best-set / +3,978% worst-set / B&H baseline +1,005% — still a 4-7.6x outperformance.

R085-R087 explore whether anything obvious can lift the corrected numbers. R085/R085b add sigma, tp_dist_pct, sl_dist_pct as features — all hurt (-24% to -45%). R086/R087 sweep regularization and model size — all hurt or fail to improve cross-set consistency. R089-R092 try LSTM swap, multi-lookback, 9-seed ensembles — none beat baseline.

Then R094 breaks the plateau in an unexpected direction. Trying SMA as a sanity baseline, the team discovers that simple SMA on its own crushes the GRU — and combining either with a DAILY SMA filter explodes performance: GRU+daily gives +720M%, SMA+daily gives +9.3B%. Every tested config with the daily filter beats B&H in 4/4 folds. The daily SMA filter is the single most important component the project has ever found.

Key breakthroughs

R083: Equity double-counting bug fixed — project finally has honest numbers (+7,641% best / +3,978% worst)
R084: Always-invested + ensemble confirmed at +4-7.6x B&H
R094: Daily SMA filter discovered — the single most important component (+720M% GRU+daily, +9.3B% SMA+daily)
Negative confirmation R085-R092: more features, regularization, seeds, LSTM — nothing else helps

Key disappointments

R082 equity bug: two months of records cut by ~2x
R085-R087: Adding obvious features (sigma, distances) hurts
R089-R092: LSTM, multi-lookback, 9-seed — no architecture improvement

Exit stateCorrected stack: GRU 5-seed ensemble + always-invested + 36 features + 3-class tb3_vol + vol_tight + lb=300, validated at +7,641% best / +3,978% worst. Daily SMA filter newly discovered.

EPOCH 08

Hybrid Strategies & Live Deployment (V2-V6)

R095-R111 2026-04-15 to 2026-04-20

Seventeen rounds turn validated backtests into five live production bots — including the V115_cmp and V66 strategies that crown the entire project.

With the GRU stack validated and the daily SMA filter discovered, R095-R111 are the project's deployment epoch. R095 tests ATR-adaptive exits — improves V4 by +53% but hurts V3/V6. R096 confirms multi-timeframe entry hurts. R097 confirms drawdown circuit breakers hurt. R098 confirms RSI/MACD/volume filters layered on top of SMA all hurt. The pattern is consistent: the daily SMA filter is essential, but additional filters add noise.

R099-R109 build out three production bots: V2 (pure GRU ensemble, +7,641% backtest), V3 (GRU + DailyRSI>80 + 10% trailing stop, +450B% backtest), V4 (3-regime adaptive V5.4 Robust-5, +30,174% backtest, +117% min α). All three deployed to Hetzner.

R110-R111 are the climax. An external collaboration produces V115_cmp (combines GRU + peak_drop + ratchet + regime cooldown) and V66 (uniform thresholds + extreme cooldown 4,48 bars). R110 reproduces them at the dollar: V115_cmp at +168,759% compound / +238% min α; V66 at +70,576% compound / +245% min α (the all-time min-α record). R111 builds the HybridV115Trader class (~750 LOC) and V5+V6 join the live stack. Five servers, one shared GRU checkpoint, five strategy configurations.

Key breakthroughs

Daily SMA filter integrated into V3/V4/V5/V6
R110: V115_cmp validated at +168,759% compound (all-time compound record)
R110: V66 validated at +245% min α (all-time robustness record)
R111: 5 production servers live on Hetzner — V2/V3/V4/V5/V6

Key disappointments

R095: ATR-adaptive exits improve V4 but hurt V3/V6 — no universal exit
R096-R098: Multi-timeframe entry, DD breakers, RSI/MACD/volume filters all hurt
Plateau confirmed: layering more filters on daily-SMA stack consistently reduces compound

Exit state5 production servers live: V2 (+7,641%), V3 (+450B%), V4 (+30,174% / +117% min α), V5 (+168,759% / +238% min α), V6 (+70,576% / +245% min α). Real capital deployed (€125 across 5 sub-accounts).

EPOCH 09

V66 Refinement Exhaustion

R112-R116 2026-04-20 to 2026-05-05

Five rounds attacking the V66 hyperparameter surface — all confirmed that V66's specific configuration cannot be tuned further.

With V5/V6 deployed and posting numbers nobody dared to budget for, the natural next question was: can we extract more from the same GRU checkpoints? R112-R116 attack the V66 strategy layer from every angle. R114 implements confidence-weighted voting. R115 sweeps vote thresholds and min_votes parameters. R116 implements Kelly position sizing.

The result of five rounds is unanimous: V66's hyperparameters are at a local optimum that cannot be improved by any tweak inside the same parameter family. Confidence-weighted voting matches binary voting in compound but doesn't improve robustness. Vote-threshold sweeps confirm the existing config sits at the peak. Kelly sizing reduces drawdown but also reduces compound.

This is the second major dead-end of the project — and like R073-R080 before it, the value is in the negative result. The team formally closes the V66-refinement search and reframes the question.

Key breakthroughs

Negative result confirmed: V66 hyperparameters cannot be improved within existing parameter family
Confidence-weighted voting (R114): matches binary voting — no free lunch
Vote threshold + min_votes sweep (R115): confirms current config is local optimum
Kelly sizing (R116): trades compound for DD reduction with neutral min α

Key disappointments

Five rounds of intricate engineering produced zero improvement to V66's headline numbers
Kelly's promise of 'reduce DD without sacrificing compound' is empirically falsified for this strategy

Exit stateV66 hyperparameters formally confirmed at local optimum. The existing GRU+strategy vocabulary is exhausted.

EPOCH 10

Paradigm Shift — Exploring Orthogonal Alpha

R117-R120 2026-05-05 to 2026-05-22

After V66 refinement was exhausted, the project pivots to orthogonal strategies: mean reversion, XGBoost, multi-timeframe filtering, and portfolio composition.

R117-R120 mark the project's most recent strategic pivot. Having proven that the V66 stack cannot be improved within its own vocabulary, the team formalizes a new objective in ACTION_PLAN_2026.md: produce a candidate that beats V5/V6 by a meaningful margin AND is diversifying — uncorrelated enough with the existing stack to be deployed alongside it.

The four exploratory rounds attack four orthogonal hypotheses. R117 tests mean reversion (BB_revert, RSI_extreme, Z_score, MACD_div) — the dual of the trend-following stack. R118 implements XGBoost as an alternative model class — REJECTED, GRU crushes trees 287×. R119 implements a multi-timeframe regime filter — REJECTED, V66 already encodes context. R120 is the portfolio-composition breakthrough: regime-aware allocation of V66 + MACD_div delivers 7 winners with **362,991% compound (5.1× V66) and matching 245% min α**.

Alongside the experiments, the project hardens its validation infrastructure (Phase 0: 5-step gate). The era of 'a single notebook + a single seed' is formally over. The current strategic focus is concluding R120 analysis and building a real-time regime detector (R121) so the retroactive R120 alpha can be deployed.

Key breakthroughs

R117: Mean reversion as orthogonal alpha source — MACD_div +179K% compound
R120: Portfolio V66+MACD_div regime-aware — 7 configs PASS (362,991% compound, 245% min α, 16× more trades)
Phase 0 validation infrastructure proposed (5-step gate)
Project framing shifted from 'replace V6' to 'diversify alongside V6'

Key disappointments

R118: XGBoost CRUSHED by GRU (245% vs 70,576% compound)
R119: All 5 MTF filter variants degrade V66 — context already in features
R120 winners are RETROACTIVE — need real-time regime detector (R121) for deploy

Exit stateFive production bots live. V6 holds min-α crown (+245%); V5 holds compound crown (+168,759%). R120 proved portfolio diversification works — pending regime detector for deployment.