branch main commit 4f9d331 built last · R120 + R122: portfolio breakthrough + relaxed-filter notebook
Total rounds
120
Span
56 days (2026-03-27 to 2026-05-22)
Cadence
~15 rounds/week (extraordinary cadence: ~2 rounds/day, sustained for 8 weeks)
Current champion
V6 V66 Cooldown(4,48): +70,576% compound / +245% min α (the robustness champion). V5 V115_cmp holds the absolute compound record at +168,759% / +238% min α. Both deployed live on Hetzner.
Biggest breakthrough
R068 Always-Invested Strategy (2026-04-10) — paradigm flip from 'selective trading' to 'stay invested like B&H, use model only for danger detection.' This 4-8x'd compound and made every subsequent record possible. Honorable mention: R094 Daily SMA filter, the single most important component (+720M% with vs +7,641% without).
Biggest dead-end
R073-R080 Labeling Innovation (8 rounds, 13 advanced methods). ALL failed to beat plain 3-class tb3_vol. Lesson: timeouts (24% of labels) carry essential information that 'cleaner' labels destroy, and CrossEntropy concentrates gradients better than any regression variant.
Current focus
Post-R120 paradigm shift — having exhausted V66 GRU refinement (R112-R116 confirmed no more juice), exploring orthogonal alpha: mean reversion (R117), XGBoost (R118), MTF (R119), portfolio composition (R120). Goal: diversify the live stack, not replace it.

Timeline each block = one epoch, width ∝ number of rounds

2026-03-27 R000 2026-05-22 R120 Foundations & First Failures — R000-R026Post-Fix Recalibration & The MFE/MAE Pivot — R027-R045The Event-Driven Revolution — R046-R053Feature & Label Consolidation — R054-R060Always-Invested Revolution & The R071 Record — R061-R072The Labeling Dead-End — R073-R080Production Validation & The Equity Bug Reckoning — R081-R094Hybrid Strategies & Live Deployment (V2-V6) — R095-R111V66 Refinement Exhaustion — R112-R116Paradigm Shift — Exploring Orthogonal Alpha — R117-R120 R000-R026R027-R045R046-R053R054-R060R061-R072R073-R080R081-R094R095-R111R112-R116R117-R120

Epochs

EPOCH 01

Foundations & First Failures

R000-R026 2026-03-27 to 2026-03-31

27 rounds chasing supervised learning on binary then volatility-adaptive triple-barrier labels — only to discover a sequence-gap bug that invalidated nearly every result.

The project opens with the most natural question in financial ML: can a sequence model predict the next move? R000 wires up the full pipeline — LSTM 2x128, lookback 60, binary_12 labels asking 'will price be higher in 1 hour?' — and gets answered with a brutal -36.5% PnL and a 45.2% win rate. Binary labels are noise. R001 swaps to triple-barrier labeling (TP=0.75%, SL=0.5%) and uncovers a deeper set of issues: the model was shorting on label=0 (which doesn't mean 'down', only 'not up'), class weights were missing, and the Sharpe formula was double-counting flat bars. R002 patches those, R003 simplifies to a 21K-parameter GRU to fight overfit. None of it works at fixed thresholds.

The pivot happens at R004: thresholds should scale with volatility. The new formula TP = k_up · σ, SL = k_dn · σ becomes the labeling backbone that survives the entire project. Twenty-three rounds (R004-R026) sweep k_up/k_dn from 1.8/1.2 to 25/15, alpha from 6 to 96, lookbacks from 20 to 120. R012 establishes GRU > LSTM as a permanent fact. R020-R022 prove multi-timeframe features add +795%. R025 validates walk-forward 4-fold. By R026 the project thinks it has a +998% champion (R022).

Then R027 happens. Investigating a discrepancy between training and production, the team discovers that filtering-out 'timeout' bars during training created sequence gaps that did not exist in live data. Every result in R004-R026 was inflated — sometimes by 60x. The honest +998% becomes +15.9%. Twenty rounds of work are reduced to a single durable insight (volatility-adaptive barriers) and a brutal lesson about pipeline parity.

Key breakthroughs

  • R004: Volatility-adaptive triple barrier (k·σ) replaces fixed thresholds — the foundational labeling scheme that survives the next 116 rounds
  • R012: GRU consistently beats LSTM — settled architecturally for the rest of the project
  • R020-R022: Multi-timeframe (15min aux features) gives +795% (honest +15.9% post-fix, but the directional signal was real)
  • R025: Walk-forward 4-fold validation methodology established

Key disappointments

  • R000-R003: Binary and fixed triple-barrier labels are noise — no amount of architecture tuning rescues them
  • R027: The sequence-gap bug invalidates ~20 rounds of recorded 'wins' — a humbling reset
Exit stateHonest best result: tb_vol_10_7_a24 GRU lb=20 → +15.9% PnL, 87 trades, WR 46%, PF 1.53 over walk-forward. Volatility-adaptive labels established as foundation.
EPOCH 02

Post-Fix Recalibration & The MFE/MAE Pivot

R027-R045 2026-03-31 to 2026-04-03

After the bug fix flatlined returns, the project pivots from classification to regression on MFE/MAE, finds the right loss function (mse_ratio), and posts the first walk-forward-validated +97.9% across 4 folds.

With the sequence-gap bug exorcised, the team has to rebuild credibility against honest numbers. R027-R033 try swing labels and a dual-head architecture; R033 hits +17.6% with the dual head. R034 attempts pure regression and finds an unusable combination — WR 75% and PF 3.56, but only 16 trades across the whole walk-forward. R035 explores 1-minute base bars and immediately blows up GPU memory.

R036 introduces what becomes the second foundational idea: predict MFE (maximum favorable excursion) and MAE (maximum adverse excursion) as continuous values. DualHeadGRU + MSE on both heads gives +8.3%; R036b tunes TP/SL to +9.4%; R036c shows the model knows when NOT to enter, losing only -1.2% in a bear market where B&H loses -23%. The 'knows when to sit out' result is the project's first hint that the model has real, asymmetric edge.

R038 then performs the single most important loss-function experiment. Standard MSE produces uniform predictions (MFE/MAE ratio ≈ 1.0) — the model cheats by predicting the mean. Four losses are compared: standard MSE (-4.9%), mse_ratio (+16.2%), weighted (-13.6%, hacks loss), and asymmetric (-0.6%). mse_ratio is the ONLY viable loss. R037 then validates mse_ratio + MFE/MAE in proper walk-forward across 4 folds (2018-2026) and posts +97.9% total / +9.3% worst fold / WR 54% / PF 1.49.

Key breakthroughs

  • R036: MFE/MAE dual-head regression introduced — the model learns asymmetric risk-reward, not just direction
  • R038: mse_ratio loss function discovered — the ONLY loss that doesn't collapse to a constant prediction
  • R036c: First demonstration that the model adds value by knowing when to stay flat (-1.2% vs B&H -23% in bear)
  • R037: First walk-forward-validated +97.9% across 4 folds — credibility restored

Key disappointments

  • R034: Pure regression gives WR 75% / PF 3.56 but only 16 trades — too sparse to compound
  • R035: 1-minute base bars don't help and cause OOM
  • R038 weighted/asymmetric losses: 'Smart' loss variants either hack the loss or over-conserve
Exit stateDualHeadGRU + mse_ratio + 15min multiframe, H120, lb=20, 2 epochs → +97.9% total walk-forward, +9.3% worst fold. MaxDD 37% too high for production but foundation is real.
EPOCH 03

The Event-Driven Revolution

R046-R053 2026-04-03 to 2026-04-06

Replacing consecutive-bar sequences with event-driven sequences (filtered by volatility) produced the biggest single jump in performance of the entire project — +685% compound across 4 folds.

The lurking problem after R037 was sample efficiency: 95% of 5-minute BTC bars carry essentially no information, but the GRU was being asked to learn from all of them equally. R046 introduces the most important architectural reframe: train only on 'interesting' bars, defined as bars where ATR > 2x the rolling mean OR |returns| > 90th percentile. About 10.5% of bars survive the vol_tight filter.

The results are dramatic. The model now sees a sequence of regime transitions rather than a flood of mostly-flat candles. R046-R053 sweep event filters, lookbacks, and feature sets. By R053 the best config — binary labels + event-driven sequences + lookback=100 + vol_ultra filter — posts +685% summed across 4 walk-forward folds. That is roughly 7x the previous walk-forward record and ~1.7x B&H. 15min is confirmed as the optimal base timeframe.

This epoch is short (8 rounds, 3 days) but represents the single highest-leverage methodological change in the project. Every subsequent breakthrough is built on top of event-driven sequences.

Key breakthroughs

  • R046: Event-driven sequences introduced — 10.5% of bars selected by volatility, massively improving signal density
  • R047/R053: Lookback=100 event bars and vol_ultra filter give +685% summed across 4 walk-forward folds
  • Walk-forward optimization methodology stabilized
  • 15min base timeframe confirmed optimal vs 5min, 10min, 1h

Key disappointments

  • Consecutive-bar sequences definitively retired
  • Most 'event filter' variants underperform vol_tight
Exit stateEvent-driven sequences + vol_tight + lookback=100 + GRU 2x128 + binary labels → +685% summed across 4 folds. Sample efficiency problem solved.
EPOCH 04

Feature & Label Consolidation

R054-R060 2026-04-06 to 2026-04-08

Seven rounds methodically lock in the building blocks: full-sequence MinMax normalization, 3-class labels with weighted CrossEntropy, 36 clean features, and the vol_tight + lb=300 + k25/15 sweet spot.

With event-driven sequences proven, R054-R060 settle every component decision that will define the V2-V6 production stack. R054/054c/054d run three independent normalization studies and unanimously confirm full per-sequence MinMax — R054c posts +791% compound at lb=300. R055 tests 3-class labels (SL=-1, timeout=0, TP=+1) and wins +824% vs +685% binary; the key is CrossEntropyLoss with ce_signal weights [2.0, 0.5, 2.0].

R056 runs permutation importance + correlation + ablation studies on the original 47 features, identifying 11 actively harmful features (volatility_60, volume_ratio, hour_sin, MACD, ema crosses, etc.) — the 'clean 36' feature set is born. R057 confirms clean 36 is more robust than positive_only 24. R058 introduces the compound metric (product of (1+fold)) and posts +4,783% compound. R059 combines 3-class + clean 36 features. R060 does the definitive event-filter grid and confirms vol_tight + k25/15 + thr=0.40 as the sweet spot at +4,021% compound.

This epoch is the project's least glamorous and arguably its most important. None of these rounds 'discovered' anything; they consolidated and stress-tested the discoveries of the previous epoch.

Key breakthroughs

  • R054c: Full per-sequence MinMax normalization confirmed (3 independent tests) — record +791% compound
  • R055: 3-class labels with ce_signal [2.0, 0.5, 2.0] beat binary +824% vs +685%
  • R056: 11 harmful features identified; clean 36 feature set established
  • R058/R060: Compound metric introduced and vol_tight + lb=300 + k25/15 sweet spot locked in (+4,783%)

Key disappointments

  • Positive-only 24 features: high-quality trades but less robust than clean 36
  • R058 single-seed variance: fold 0 varies +86% to +262% — foreshadowing ensemble work
Exit stateLocked-in stack: clean 36 features + per-sequence MinMax + 3-class tb3_vol + ce_signal weights + vol_tight + lb=300 + GRU 2x128. Best compound +4,783% but with concerning seed variance.
EPOCH 05

Always-Invested Revolution & The R071 Record

R061-R072 2026-04-09 to 2026-04-11

A twelve-round sprint that rebuilt the trading layer from exit modes upward — culminating in R068's paradigm flip to 'always invested + ensemble danger detection' and R071's record +7,128% compound.

R061-R067 systematically refine the trading layer. R061 compares fixed/dynamic/trailing/hybrid/always exits and finds fixed simplest and best (+3,440%). R062 confirms multi-timeframe features don't help once lb=300 is in place. R063 retires shorts (P(SL) isn't precise enough) and SL-signal exit. R064 settles expanding-window over sliding. R065 confirms 15min over every alternative and discovers that sigma clip [0.0005, 0.005] is essential — removing it cuts compound by 2.5x. R066 picks 100% position sizing. R067 finds that requiring 3-4 consecutive event-bars in agreement improves trade quality +47%.

Then R068 changes everything. Rather than asking 'when should the model enter?', it asks 'why is the model selectively trading at all?' Answer: stay invested like B&H by default, use the model only to detect dangerous regimes and exit. This single reframing captures all of B&H's upside while letting the model add asymmetric protection. It is the single biggest conceptual breakthrough in the project.

R069 introduces 5-seed-per-config validation (single-seed results revealed as unreliable). R070 confirms voting (≥4/5 agree) beats simple averaging. R071 then assembles the full stack and posts +7,128% compound, beating B&H in 4/4 folds. R072 validates across three seed sets: set A passes 4/4, set B 3/4, set C 4/4.

Key breakthroughs

  • R068: Always-invested + danger detection paradigm (4-8x improvement)
  • R069/R070: 5-seed ensemble with ≥4/5 voting — single-seed retired permanently
  • R071: Record +7,128% compound, 4/4 folds beat B&H, fold 2 +261% vs B&H +228%
  • R065b: Sigma clip [0.0005, 0.005] — a tiny detail with 2.5x compound impact

Key disappointments

  • R063: Shorts don't work — long-only confirmed
  • R061: Trailing/dynamic/hybrid exits all lose to fixed — simpler wins
  • R072: Set B fails fold 3 by ~6% — not yet bulletproof across seed inits
Exit stateR071 always-invested + ensemble voting: +7,128% compound, 4/4 vs B&H, set B fails 3/4. Strategy layer essentially solved; remaining question is whether labels can be improved further.
EPOCH 06

The Labeling Dead-End

R073-R080 2026-04-11

Eight rounds, thirteen advanced labeling methods tested in a single day — every one failed to beat plain 3-class tb3_vol. The dead-end that proved the foundation was right.

After R071 set a record but R072 revealed instability, the natural hypothesis was 'the labels can be smarter.' R073-R080 is the project's most concentrated burst of experimentation: thirteen different labeling schemes tested in a single calendar day. Speed-weighted labels, 5-class labels, efficiency-weighted, DSR (deflated Sharpe), filtered labels, next-event labels, RL meta-decision, Self-Distillation Iterative (SDIL), Conditional Barrier Asymmetry (CBA), Multi-Resolution Consensus (MRC), Trend Scanning, Path-Quality Weighted (PQW).

The result: none of the thirteen methods beat plain 3-class tb3_vol. Two insights survive. First, regression labels fundamentally fail with the GRU — CrossEntropy concentrates gradients on the directional decision in a way no continuous loss can match. Second, timeouts (~24% of labels) are not noise but essential information; every 'cleaner' labeling scheme that suppresses or downweights timeouts loses more signal than it gains.

This epoch is the canonical example of a productive dead-end. The team spent eight rounds learning that the foundation was already as good as it could be, which freed the next epoch to focus on production deployment.

Key breakthroughs

  • Negative result confirmed: 13 labeling methods tested, none beat tb3_vol
  • Insight: Timeouts (24% of labels) carry essential information
  • Insight: CrossEntropy concentrates gradients in a way no regression variant can replicate
  • Project gains permission to stop searching for better labels

Key disappointments

  • Regression labels (R073, R080): fundamentally fail with GRU
  • RL meta-decision (R075), SDIL (R076), CBA (R077): high-engineering-effort, zero improvement
  • Trend Scanning (R079), MRC (R078): theoretically motivated, empirically inferior
Exit state3-class tb3_vol with ce_signal [2.0, 0.5, 2.0] is the definitive labeling scheme — search formally closed.
EPOCH 07

Production Validation & The Equity Bug Reckoning

R081-R094 2026-04-12 to 2026-04-15

R083 uncovers a critical equity double-counting bug that had inflated every R068-R082 result by 2x — and then R094 discovers the daily SMA filter that becomes the most important component in the entire stack.

R081 combines every breakthrough into a single 'ultimate production config'. R082 begins fine-grid threshold tuning. Then R083 — the second great bug-fix moment of the project.

In the always-invested notebooks (R068 onward), the equity computation at the end of each fold force-closed the open position and then computed equity[-1] = capital + position_value — double-counting the last trade. The reported +89,000% compound was actually +4,000-7,600%. Two months of 'records' had to be re-evaluated. The corrected R084 result is +7,641% best-set / +3,978% worst-set / B&H baseline +1,005% — still a 4-7.6x outperformance.

R085-R087 explore whether anything obvious can lift the corrected numbers. R085/R085b add sigma, tp_dist_pct, sl_dist_pct as features — all hurt (-24% to -45%). R086/R087 sweep regularization and model size — all hurt or fail to improve cross-set consistency. R089-R092 try LSTM swap, multi-lookback, 9-seed ensembles — none beat baseline.

Then R094 breaks the plateau in an unexpected direction. Trying SMA as a sanity baseline, the team discovers that simple SMA on its own crushes the GRU — and combining either with a DAILY SMA filter explodes performance: GRU+daily gives +720M%, SMA+daily gives +9.3B%. Every tested config with the daily filter beats B&H in 4/4 folds. The daily SMA filter is the single most important component the project has ever found.

Key breakthroughs

  • R083: Equity double-counting bug fixed — project finally has honest numbers (+7,641% best / +3,978% worst)
  • R084: Always-invested + ensemble confirmed at +4-7.6x B&H
  • R094: Daily SMA filter discovered — the single most important component (+720M% GRU+daily, +9.3B% SMA+daily)
  • Negative confirmation R085-R092: more features, regularization, seeds, LSTM — nothing else helps

Key disappointments

  • R082 equity bug: two months of records cut by ~2x
  • R085-R087: Adding obvious features (sigma, distances) hurts
  • R089-R092: LSTM, multi-lookback, 9-seed — no architecture improvement
Exit stateCorrected stack: GRU 5-seed ensemble + always-invested + 36 features + 3-class tb3_vol + vol_tight + lb=300, validated at +7,641% best / +3,978% worst. Daily SMA filter newly discovered.
EPOCH 08

Hybrid Strategies & Live Deployment (V2-V6)

R095-R111 2026-04-15 to 2026-04-20

Seventeen rounds turn validated backtests into five live production bots — including the V115_cmp and V66 strategies that crown the entire project.

With the GRU stack validated and the daily SMA filter discovered, R095-R111 are the project's deployment epoch. R095 tests ATR-adaptive exits — improves V4 by +53% but hurts V3/V6. R096 confirms multi-timeframe entry hurts. R097 confirms drawdown circuit breakers hurt. R098 confirms RSI/MACD/volume filters layered on top of SMA all hurt. The pattern is consistent: the daily SMA filter is essential, but additional filters add noise.

R099-R109 build out three production bots: V2 (pure GRU ensemble, +7,641% backtest), V3 (GRU + DailyRSI>80 + 10% trailing stop, +450B% backtest), V4 (3-regime adaptive V5.4 Robust-5, +30,174% backtest, +117% min α). All three deployed to Hetzner.

R110-R111 are the climax. An external collaboration produces V115_cmp (combines GRU + peak_drop + ratchet + regime cooldown) and V66 (uniform thresholds + extreme cooldown 4,48 bars). R110 reproduces them at the dollar: V115_cmp at +168,759% compound / +238% min α; V66 at +70,576% compound / +245% min α (the all-time min-α record). R111 builds the HybridV115Trader class (~750 LOC) and V5+V6 join the live stack. Five servers, one shared GRU checkpoint, five strategy configurations.

Key breakthroughs

  • Daily SMA filter integrated into V3/V4/V5/V6
  • R110: V115_cmp validated at +168,759% compound (all-time compound record)
  • R110: V66 validated at +245% min α (all-time robustness record)
  • R111: 5 production servers live on Hetzner — V2/V3/V4/V5/V6

Key disappointments

  • R095: ATR-adaptive exits improve V4 but hurt V3/V6 — no universal exit
  • R096-R098: Multi-timeframe entry, DD breakers, RSI/MACD/volume filters all hurt
  • Plateau confirmed: layering more filters on daily-SMA stack consistently reduces compound
Exit state5 production servers live: V2 (+7,641%), V3 (+450B%), V4 (+30,174% / +117% min α), V5 (+168,759% / +238% min α), V6 (+70,576% / +245% min α). Real capital deployed (€125 across 5 sub-accounts).
EPOCH 09

V66 Refinement Exhaustion

R112-R116 2026-04-20 to 2026-05-05

Five rounds attacking the V66 hyperparameter surface — all confirmed that V66's specific configuration cannot be tuned further.

With V5/V6 deployed and posting numbers nobody dared to budget for, the natural next question was: can we extract more from the same GRU checkpoints? R112-R116 attack the V66 strategy layer from every angle. R114 implements confidence-weighted voting. R115 sweeps vote thresholds and min_votes parameters. R116 implements Kelly position sizing.

The result of five rounds is unanimous: V66's hyperparameters are at a local optimum that cannot be improved by any tweak inside the same parameter family. Confidence-weighted voting matches binary voting in compound but doesn't improve robustness. Vote-threshold sweeps confirm the existing config sits at the peak. Kelly sizing reduces drawdown but also reduces compound.

This is the second major dead-end of the project — and like R073-R080 before it, the value is in the negative result. The team formally closes the V66-refinement search and reframes the question.

Key breakthroughs

  • Negative result confirmed: V66 hyperparameters cannot be improved within existing parameter family
  • Confidence-weighted voting (R114): matches binary voting — no free lunch
  • Vote threshold + min_votes sweep (R115): confirms current config is local optimum
  • Kelly sizing (R116): trades compound for DD reduction with neutral min α

Key disappointments

  • Five rounds of intricate engineering produced zero improvement to V66's headline numbers
  • Kelly's promise of 'reduce DD without sacrificing compound' is empirically falsified for this strategy
Exit stateV66 hyperparameters formally confirmed at local optimum. The existing GRU+strategy vocabulary is exhausted.
EPOCH 10

Paradigm Shift — Exploring Orthogonal Alpha

R117-R120 2026-05-05 to 2026-05-22

After V66 refinement was exhausted, the project pivots to orthogonal strategies: mean reversion, XGBoost, multi-timeframe filtering, and portfolio composition.

R117-R120 mark the project's most recent strategic pivot. Having proven that the V66 stack cannot be improved within its own vocabulary, the team formalizes a new objective in ACTION_PLAN_2026.md: produce a candidate that beats V5/V6 by a meaningful margin AND is diversifying — uncorrelated enough with the existing stack to be deployed alongside it.

The four exploratory rounds attack four orthogonal hypotheses. R117 tests mean reversion (BB_revert, RSI_extreme, Z_score, MACD_div) — the dual of the trend-following stack. R118 implements XGBoost as an alternative model class — REJECTED, GRU crushes trees 287×. R119 implements a multi-timeframe regime filter — REJECTED, V66 already encodes context. R120 is the portfolio-composition breakthrough: regime-aware allocation of V66 + MACD_div delivers 7 winners with **362,991% compound (5.1× V66) and matching 245% min α**.

Alongside the experiments, the project hardens its validation infrastructure (Phase 0: 5-step gate). The era of 'a single notebook + a single seed' is formally over. The current strategic focus is concluding R120 analysis and building a real-time regime detector (R121) so the retroactive R120 alpha can be deployed.

Key breakthroughs

  • R117: Mean reversion as orthogonal alpha source — MACD_div +179K% compound
  • R120: Portfolio V66+MACD_div regime-aware — 7 configs PASS (362,991% compound, 245% min α, 16× more trades)
  • Phase 0 validation infrastructure proposed (5-step gate)
  • Project framing shifted from 'replace V6' to 'diversify alongside V6'

Key disappointments

  • R118: XGBoost CRUSHED by GRU (245% vs 70,576% compound)
  • R119: All 5 MTF filter variants degrade V66 — context already in features
  • R120 winners are RETROACTIVE — need real-time regime detector (R121) for deploy
Exit stateFive production bots live. V6 holds min-α crown (+245%); V5 holds compound crown (+168,759%). R120 proved portfolio diversification works — pending regime detector for deployment.