Discarded Paths · BTC Trading AI

How to read this table: Each row documents the hypothesis, what failed, the LESSON LEARNED (the principle now believed), and under what FUTURE conditions it would be worth retrying. Before designing a new experiment, grep this page first — if it's already here, save the Colab hours.

Round	Date	Paradigm	Hypothesis	What happened	★ Lessons learned	Re-explore if	Salvage
`R202`	2026-05-29	meta-labeling	V5 ~45% losing trades may have predictable features. Train RandomForest secondary on (entry context → was profitable). Filter pre-trade.	Best variant (keep 81%): +72,623% / +254. Worse than V5 baseline at ALL veto levels. RF can't identify losing trades reliably enough.	V5's 45% loss rate is IRREDUCIBLE NOISE at entry-time features. The losing trades don't have predictable signatures in (regime, vol, RSI, slope, GRU probs). Meta-labeling direction closed.	Only with richer feature set (e.g., order book, macro, on-chain) — out of scope for this sprint.	Closes orthogonal-filter approach as path to beat V5.
`R201`	2026-05-29	labeling	Stricter TP threshold (+10bps friction) yields cleaner training signal → cascade fires only on net-profitable patterns.	Compound +17% / Min α -229 / 0/4 BH / 6 trades total. Val_loss 0.56 (vs canonical 0.43). Stricter labels made P(TP) too low → cascade barely fires.	Same distribution mismatch pattern as TCN/CNN-GRU. Cleaner labels reduce positive class → model underconfident → V5 thresholds don't fire. Confirms cascade rigidity at the LABEL level too.	Only if combined with re-tuned V5 thresholds (out of scope).	Closes labeling direction as easy way to beat V5.
`R198`	2026-05-28	ensemble	Combine V5 and V66 cascades: defensive (exit if either) or aggressive (exit if both) or regime-conditional (V66 in bear only). Mixed cascade might improve risk-adjusted returns.	V5 baseline wins all. defensive +63K/+235 (worse min α). aggressive +72K/+252 (close but loses). regime_v66_bear = aggressive (identical pattern). V66 cascade is strictly inferior, blending it with V5 only adds drag.	Cascade ensemble doesn't help when one cascade is strictly inferior. V66 was the predecessor; V5 cascade is its improvement. Mixing adds noise. After 8 R&D directions, V5 V115_cmp is uncontested CPU local optimum.	After GPU experiments (R193 TCN retry, R194 alternative labelings, multi-task GRU) — different GRU base may unlock new cascade options.	Confirms V5 standalone is the deployment champion. Combined with R195 leverage finding, V5 + 1.25× margin leverage (V11 candidate) is the only meaningful improvement found.
`R197`	2026-05-28	robustness	Higher reenter threshold = fewer trades = more robust to slippage. R197 sweeps reenter +0.05/+0.10/+0.15/+0.20 × slip 0/0.1/0.5%.	V5 baseline WINS at all slip levels. Higher reenter destroys alpha: +0.05→+9K compound (was +76K), +0.10→+414 compound. Slip effect on V5 baseline: 0.01%→+76K/+266, 0.1%→+38K/+203, 0.5%→+3.5K/+15 (still positive!).	V5 baseline already optimal — increasing selectivity removes more GOOD trades than BAD ones. The 'high-freq hurts slip' hypothesis was WRONG. V5 IS robust to plausible slip (0.1% slip still gives +38K/+203). R196 worst-case (2% slip) was unrealistic.	Different paradigm entirely (e.g., new architecture via RunPod, cross-bot ensemble, multi-timeframe).	Confirms V5 V115_cmp is the deployment champion (no direction beats it after 7 attempts). At realistic live slip 0.01-0.1%, V5 backtest is achievable.
`R196`	2026-05-28	robustness	R195 showed V5 + leverage 1.25× yields +297K with 0 liquidations. Real-world has slippage, API delays, gaps. Will V5 leveraged survive realistic adversities?	Flash crashes alone don't kill V5 (init_sl handles them, 0 liquidations). BUT slip 2% destroys compound from +76,725% to +79% at 1.0× baseline. Slip 5% = -100% total loss at ALL leverages. Gap 2% same. WORST CASE (slip 5% + delay 3 + gap 2% + flash): -100% all leverages.	CRITICAL: V5 baseline (+76,725% honest) ASSUMES 0.01% slip per trade. With realistic slip 0.5-2%, compound collapses 99%+ even at 1.0× (no leverage). R195 leverage finding was a MIRAGE that depends on PERFECT execution. Leverage AMPLIFIES the fragility but isn't the root cause — V5's high-frequency cascade is hyper-sensitive to slippage.	After live V5 slippage audit shows <0.5% avg. Then re-test 1.10× leverage with corrected slip model.	V11 (V5+leverage) PAUSED until V5 live slippage measured. If avg live slip < 0.5%, R195 leverage might still be viable. If > 1%, V5 itself is over-optimistic. URGENT: audit V5 bot's actual fill prices vs close.
`R192`	2026-05-28	sizing	V5 might be improved by sizing proportionally to prediction confidence: high confidence → 100%, low confidence → 40%. Reducing low-conf exposure should improve min α.	ALL FIVE variants underperform V5 baseline (+76,725% / +266 min α). Best: anti_confidence +71,858% / +265.3. Worst: combined +27,453% / +25. Reducing position size kills compound, doesn't protect min α (V5 cascade already exits well).	V5's '100% always' sizing is structurally optimal. The cascade catches bad trades via exits, not by reducing entry size. Three consecutive negatives (R190 pyramid, R191 exits, R192 sizing) confirm V5 is local optimum for current architecture+labeling+features.	Only if V5's prediction confidence calibration is REPLACED (e.g., new architecture with better-calibrated confidence outputs).	Direction CPU-tweaks definitively closed. Pivot to RunPod GPU step-change: new architecture (R193 TCN), new labelings (R194 tb_vol variants), or new features (multi-timeframe after audit #74).
`R191`	2026-05-28	cascade-tuning	V5 +266 min α might be tightened with extra exit conditions (atr_z spike, drsi tighter, rsi confluence, slope reversal, etc).	NONE improve V5 baseline. atr_z_spike OVER-exits (compound +1,479%, min α -90). drsi_oversold ties min α +266 but loses compound. All others underperform. V5 cascade well-calibrated already.	V5 V115_cmp's existing 5-exit cascade is structurally optimal for its labeling+architecture. Adding more exits causes over-exiting → loses compound without improving min α. Direction CLOSED for marginal tweaks.	Only if a fundamentally new feature is introduced (e.g., MTF after lookahead audit, cross-asset).	Two consecutive negatives (R190 pyramid, R191 exits) confirm V5 is at local optimum. Need step-change: different sizing logic, new architecture (R193 TCN), or new labeling (R194).
`R190`	2026-05-28	feature-eng	Pyramid was phantom (R163/R170/R184) due to cap=0 bug. With 50% cap initial + 50% reserve for REAL adds, can pyramid mechanic add alpha to V5 V115_cmp baseline (+76,725% / +266 min α)?	ALL 6 variants DESTROY V5. Best: 70/30 trig=10% add=70% gives +12,975% / +34 (vs baseline +76,725% / +266). Worst: 50/50 trig=15% gives +3,495% / -84.	Pyramid direction PERMANENTLY closed even under honest accounting. The phantom alpha that R163/R170 showed was 100% from the bug. Initial cap reduction (50%) destroys compound faster than pyramid adds can recover. V5 cascade already captures the upside that pyramid would have harvested.	Only with fundamentally different paradigm — e.g., true margin trading on futures (not available for Spain account) or fractional Kelly scaling based on prediction confidence.	Negative result valuable — saves time exploring pyramid variants in future. R&D effort redirected to extended exits (R191), cascade tweaks (R192), new architectures (R193 TCN).
`R181`	2026-05-28	robustness	Does skipping shorts in recovery regimes (N-day return > threshold) fix R180's F2 tail risk?	F2 PnL IDENTICAL (+428%) across ALL 7 variants. Filter never triggers in F2 — only F0. Regime filter does NOT address F2 fragility.	F2 tail risk is path-dependent within F2's own short trades, not from misclassified recoveries. Simple N-day filter useless.	HMM-based regime detector (complex, untested).	Closes the regime-filter direction definitively.
`R179`	2026-05-27	alt-model	Map Transformer P(TP) distribution onto GRU canonical shape so V66 thresholds work as-is?	Script died silently twice (shell snapshot reload). Skipped — Transformer direction closed after R176/R178 anyway.	Calibration salvage of failed architecture is high-cost, low-probability. Better to accept arch failure.	If a new architecture shows shape compatibility from the start.	None — Transformer direction closed.
`R178`	2026-05-27	alt-model	Re-tune V66 thresholds for Transformer shape?	0/300 configs passed. Transformer P(TP) p50=0.18 vs GRU 0.30, fundamental calibration mismatch.	Architecture via threshold tuning CLOSED. Issue is calibration semantics, not just thresholds.	Joint model+threshold optimization.	Drives R179 calibration approach.
`R176`	2026-05-27	alt-model	Better val_loss → better backtest.	Compound +91.9%, Min α -224, 0/4 BH. Catastrophic.	V66 thresholds calibrated for canonical GRU softmax. Transformer P(TP) p50=0.18 vs GRU 0.30.	—	Drives R178/R179.
`R175`	2026-05-27	alt-model	Transformer might learn richer features than GRU.	val_loss 0.43 (better than GRU 0.50). 155K params.	Best val_loss in project. But R176 showed it doesn't translate.	—	Checkpoints kept for R178/R179.
`R172`	2026-05-27	audit	If R170-B is real, per-fold instrumentation should match the aggregate.	Per-fold cap reconstruction did NOT match aggregate. Trigger for R173 forensic audit.	Instrumentation revealed 4 engine bugs ([[r173-pyramid-was-phantom]]). R170-B was 100% phantom.	Never. Pyramid direction closed.	Saved deployment by triggering R173.
`R171`	2026-05-27	validation	R170 candidates need OOT validation to crown the true winner.	Originally +53% OOT / +71.9% α. R173 audit: same phantom root cause. HONEST = V66 alone (+12.4% α).	Most aggressive variant (higher leverage + bigger add) wins both backtest AND OOT. Pyramid is robust — selectivity (filter on +12% unreal + GRU re-vote) prevents over-trading even with aggressive size.	—	R170-B is the FINAL deployment leader. Backtest +257K / +415 / 4/4. OOT 2026 +53% / +71.9% α. Spot-tradeable, no perp, no funding cost.
`R170`	2026-05-27	additive-sleeve	R163's (0.10/0.50/1.5x/24h) may not be optimal. Dense sweep around it finds true optimum.	Original R170-B +257,754%. R173 audit: 6.05× inflation phantom leverage. HONEST: identical to V66 alone (517/517 attempts blocked).	R163 was sub-optimal. Higher leverage + bigger add + slightly higher trigger improves both compound AND min α. Sweep was worth doing.	Annual re-sweep as market regime evolves.	R170-B promoted to FINAL deployment candidate after R171 OOT validation.
`R169`	2026-05-27	validation	Validate R168 via V66 backtest.	FAIL: compound -56%, min α -222, 0/4 BH.	Multi-timeframe direction CLOSED as standalone retraining.	—	—
`R168`	2026-05-27	features	Inject macro context (1H/4H rolling stats) into V66 inputs. Should help model distinguish pullback-in-uptrend from drop-in-downtrend.	val_loss 0.44 (better than canonical 0.50!) BUT V66 backtest collapses: compound -56%, min α -222, 0/4 BH.	13th confirmation: ANY model retraining breaks V66's threshold calibration. val_loss improvements don't translate to backtest. Strong evidence that V66's threshold is the BOTTLENECK, not the underlying GRU representation.	Only if V66 thresholds re-tuned alongside.	—
`R166`	2026-05-26	additive-sleeve	The two breakthroughs (pyramid + shorts) cover orthogonal regimes — should STACK.	Original +454,068% compound. R173 audit: pyramid portion 100% phantom. Effective COMBINED = R151-A alone (+113K honest).	IN-SAMPLE super-additive because each sleeve fires at different regimes. F0 1322 > max(R163 676, R151-A 880). BUT R167 shows this is path-dependent (see R167).	If position sizing per sleeve is added to prevent overlap.	Mechanism understood; not deployed.
`R165`	2026-05-26	validation	Validate R163 pyramid in 2026 OOT.	Originally +41.3% OOT / +60.3% α. R173 audit shows phantom unfunded leverage. HONEST = identical to V66 alone.	Pyramid CAPTURES BEAR-MARKET INTRADAY RALLIES. 2026 had a sharp rally within the broader bear; pyramid leveraged into it perfectly. Validates the additive paradigm with REAL data.	—	R163 promoted to primary deployment candidate.
`R163`	2026-05-26	additive-sleeve	CTA classic: add to position at +X% unrealized when GRU re-confirms. Captures bull tails V66 alone misses.	ORIGINAL claim: +198,893% compound, +373 min α. R173 audit found 4 engine bugs (UNFUNDED LEVERAGE). HONEST: pyramid never fires (cap=0 blocks all 583 attempts). R174 confirmed: bug-fixed R163 = +42,636% IDENTICAL to V66. Phantom 100%.	Pyramid is the FIRST mechanism to improve min α above V66's +245. F1 (bull) unchanged because gru_safe only fires at event bars (rare in steady bull). The selectivity (3 adds total in 8 years) is FEATURE not bug. Spot-only — no perp needed.	Sensitivity sweep in R170 found better config R170-B.	Major deployment candidate. Cleaner than R151 (no perp).
`R162`	2026-05-26	ensemble	Use R134's P(v66_zone) as size multiplier (25/50/75/100) per V66 entry.	FAIL: all variants WORSE than canonical. R134 recall 15-49% → forces V66 to size DOWN at wrong times. Best: +6,278% vs canonical +70,576%.	R134 recall too low for sizing application. R148 (filter) and R162 (sizing) both fail for same reason.	Only if higher-recall classifier emerges.	—
`R158`	2026-05-26	alt-model	R157's better val_loss → better short signal.	CATASTROPHIC: min α -290 to -311 across all 5 thresholds. R157's P(SL) is NOISY → wrong-time shorts.	'Train short-specific GRU' direction CLOSED. Canonical R069 P(SL) remains the BEST short signal. Asymmetric labels make most outcomes Timeout (model can't distinguish TP vs SL).	—	—
`R157`	2026-05-26	alt-model	Correct asymmetry: k_up=15 (TP hard), k_dn=5 (SL easy) → more down-move labels.	Training complete, val_loss 0.84 (better than R155's 1.03 but still worse than canonical 0.50).	Improved over R155 in val_loss. Eval in R158.	—	—
`R156`	2026-05-26	alt-model	R155 produces cleaner short signal than canonical R069 P(SL).	FAIL: R155 P(SL) WORSE than canonical for shorts. Best variant +52K (vs R151-A +212K).	Wrong asymmetry direction. Drives R157 correction.	—	—
`R155`	2026-05-26	alt-model	Train a separate GRU with asymmetric short-favoring labels (k_up=7, k_dn=10).	Training complete, val_loss 1.03 (vs canonical 0.50). Used by R156 eval.	Asymmetry chosen WRONG: k_dn=10 means SL barrier is FARTHER → FEWER SL labels → harder to learn drops. Corrected in R157.	—	—
`R150`	2026-05-26	risk-management	Pause trading after X% drawdown for Y days. Should reduce tail risk found in R149.	ALL variants WORSE compound for marginal MDD reduction. Best: DD 20% / pause 30d gives +20K compound (vs canonical +70K), MDD 20% (vs 27.7%). Compound loss > MDD gain.	Naive DD circuit breaker is too blunt. V66's existing trail+peak_drop+GRU danger cascade already handles risk. A circuit breaker that triggers on DD that V66 itself caused = double-exit.	Re-test only after smarter regime detection added.	If volatility/regime-aware kill-switch designed in future, this gives baseline.
`R148`	2026-05-26	ensemble	R134 honest classifier as a take/skip filter on every V66 entry should improve precision (López de Prado meta-labeling).	ALL thresholds reduce compound (best τ=0.1 → +3,681%). Skip rate 92-97% even at low τ. R134 predicts 'not v66 zone' too often → V66 misses most entries.	R134 recall is 15-49% (per classifier-label-leak memory). At useful thresholds it filters out too many legitimate V66 trades. Meta-labeling needs a HIGH-RECALL primary classifier, R134 doesn't qualify.	If a classifier with recall > 70% emerges, retest meta-labeling.	Confirms R134 limitations from earlier rounds.
`R146`	2026-05-26	exit-logic	Fixed 10% SL is naive for BTC's variable volatility. ATR-adaptive should improve risk/reward.	ALL ATR variants WORSE than canonical fixed 10%. Best: ATR 2.5x floor 7% gives +49K (vs canonical +70K). All have HIGHER MDD too.	The 'naive' fixed 10% SL is actually optimal given V66's full exit cascade. ATR adapation interacts badly with the other exits (trail, DRSI, peak_drop). Confirms V66 components are co-optimized.	If full exit-cascade grid search is run, include ATR variants.	Don't replace SL component alone — needs joint tuning of all exits.
`R144`	2026-05-26	architecture	More capacity (2x256 vs canonical 2x128) → richer representations → better predictions.	CATASTROPHIC: compound -37.6%, min α -203, 0/4 BH. Worst of R141-R144.	More capacity → overfit. The canonical 2x128 sits at the right point on the bias-variance curve for this data. Confirms V66 is at a tight architecture optimum.	If sequence length doubles or feature count grows, capacity might need to grow proportionally.	Don't propose bigger architectures without explicit overfit mitigation (heavy regularization, larger train data).
`R143`	2026-05-26	labeling	Different k_up/k_dn ratios (15/10 vs canonical 10/7) capture different price moves and might give cleaner GRU signal.	FAIL but cleaner than R141/R142: compound +6,363%, min α +15, 4/4 BH. The strategy stays coherent but 11× lower compound, 16× lower min α than canonical.	Labeling change doesn't completely break V66 (kept 4/4 BH) but heavily reduces alpha. tb_vol_10_7 (current canonical) is well-calibrated. Similar lesson as filter: V66 thresholds are tightly coupled to label distribution.	Re-test labels only if combined with V66 threshold grid search.	Confirms canonical labeling stays.
`R142`	2026-05-26	features	If loose filter destroys (R141), tight filter (~5% bars) should produce more confident predictions and more alpha.	ALSO FAIL: compound +38.3%, min α -216, 1/4 BH. Better than R141 (loose) but still catastrophic.	Event filter direction CLOSED BOTH WAYS. Canonical 90-percentile is a global optimum. V66 thresholds are TIGHTLY calibrated. Mechanism: tight filter shifts model output distribution to extremes; V66's hardcoded thresholds over/under-trigger.	Only re-explore filter + threshold joint grid search if a new paradigm motivates it.	Definitively close 'event filter tuning' as standalone direction.
`R141`	2026-05-26	features	More events (15-20% bars vs canonical 10.5%) → more training data → better predictions → more alpha.	CATASTROPHIC: compound +4.6% (vs canonical +70,576%), min α -220 (vs +245), 0/4 BH. Trade counts F0-F3: 2/72/33/66 (vs canonical 63/101/60/76).	val_loss of R141 (~0.50) was SAME as canonical. Only the BACKTEST revealed the disaster. Train metrics ≠ deployment performance. Mechanism: V66 thresholds calibrated for canonical-90 distribution; any input filter shift breaks V66 calibration.	If V66 thresholds are ever fully re-grid-searched, re-test relaxed filter under new thresholds.	Event filter direction CLOSED loose side. Don't loosen filter without re-tuning V66 thresholds (eb/el/er/nb/nl/nr).
`R135`	2026-05-25	infra	A bar-by-bar simulator with V66 cascade + portfolio routing matches paper-trade reality.	4 stacked bugs in initial version (selective trading, peak_drop on price, no trail ratchet, wrong params). Fixed in R136.	Bar-by-bar simulators are bug magnets. R136 fixed all 4. Foundation for R173 audit.	Already in use as R173 bug-fixed engine.	Bug-fixed simulator became R173 honest engine baseline.
`R132`	2026-05-24	feature-eng	Volatility-shock features approximate sentiment regime changes.	Marginal — small lift in val_loss but no consistent backtest improvement.	val_loss != PnL. New features without backtest gain are noise.	If MTF audit unlocks legitimate timeframe mixing.	Features kept available for future MTF exploration.
`R130`	2026-05-23	ta-only	Combine 2-3 TA indicators (RSI/MACD/BBands/etc) for standalone strategies.	None beat V66. Closes [[ta-paradigm-closed]] — ~25 TA combos tested across R117/R130 series.	TA-only strategies don't beat GRU+V66. Need model component.	Never. Only TA + model hybrids.	Definitively closes TA-only direction.
`R128`	2026-05-23	classifier	Sweep allocation weights guided by R121 predictions yields Pareto-better portfolios.	Many configs appeared to beat V66 — INVALIDATED by R134 (same circular label).	Inflated by R121 leak. R139 confirmed R128 family ~18% conservative (real winner: R134 τ=0.3).	Forward-labeled classifier with purged WF.	Sweep mechanics later applied to honest R134 sweep.
`R121`	2026-05-22	classifier	A 4-class classifier (bull/bear/sideways/chop) can route between regime-specific strategies.	Backtest looked promising but R134 forward-label audit later showed circular label leak.	Any classifier using future-aware labels inflates metrics. Need purged WF + forward labels.	Only with strict forward-label + purged WF protocol.	Drove R134 anti-leak protocol.
`R121b`	2026-05-22	classifier	Predict directly whether V66 will be in-position next bar.	Recall looked great in backtest — later proven phantom by R134 (circular label).	[[classifier-label-leak]] — circular labels invalidate any classifier metric without forward labels.	After re-training with forward labels.	Built tooling for R134 forward-label refactor.
`R119`	2026-05-21	mtf-filter	Adding a 4h context filter (EMA bullish, no extreme RSI, slope supportive, vol regime, or all combined) on top of V66 improves consistency.	—	Third confirmation (after R062 and R096) that multi-timeframe filters do NOT help once the V66 stack already includes the daily SMA filter. Lesson upgraded to a hard rule: do NOT add more TF filters to V66-class strategies. Future MTF ideas should be tested as the SOLE higher-TF signal, replacing the daily SMA rather than stacking with it.	Only revisit if removing the daily SMA filter, e.g., to test a pure 4h-based strategy.	Hard rule against TF-stacking; freed Colab budget for portfolio work (R120).
`R118`	2026-05-21	alt-model	A gradient-boosted-tree classifier (XGBoost) gives alpha independent from the GRU and can be stacked.	XGB+V66 compound 245.8% (vs GRU+V66 +70,576%); min alpha -128%; beats B&H in 0/4 folds.	Tree-based models can't compete with GRU on sequential 15min crypto data — they lack memory across the lookback window. XGBoost trained on summary-stat features could potentially work as a META-model on top of GRU outputs (stacking), but as a primary model it is decisively worse.	Re-test XGBoost as a META-stacker (input = GRU ensemble outputs, output = entry decision), not as a primary.	XGB pipeline; possible future use as meta-learner for ensemble gating.
`R098`	2026-04-19	features	Classical technical filters (RSI, MACD, volume) layered on top of the SMA hybrid will improve quality.	ALL hurt performance.	Classical TA indicators are highly correlated with what the GRU is already learning — adding them as gates over-constrains the system and removes profitable trades. The daily-SMA filter is special because it operates on a different timeframe; same-TF TA filters do not add orthogonal information.	Re-test ONLY when applied at a different timeframe than the model's input (e.g., daily RSI, weekly MACD).	Negative result — saves time on future RSI/MACD layering attempts.
`R097`	2026-04-18	exit-logic	An emergency exit when DD exceeds a threshold should preserve capital.	Hurts performance.	DD breakers exit at the WORST possible time — at peak DD, which is often just before recovery. In trending markets like BTC, the cost of missing the rebound vastly exceeds the cost of riding out a DD that the model would have managed via its normal exit logic. Same lesson as R082 DD protection.	Only worth revisiting under regulatory or psychological DD limits (e.g., $X max drawdown for fund mandates).	Negative result — locks the recommendation that DD limits should be position-sizing, not strategy-level circuit breakers.
`R096`	2026-04-18	mtf-filter	Confirming entries with multi-timeframe agreement (e.g., 15min + 1h direction match) reduces false signals.	Hurts performance.	Adding more confirmation gates on top of an already-filtered system removes good trades faster than bad ones (the filter chain becomes too strict). Already-validated by R062. Multi-TF confirmation only helps when the base strategy is too liberal — once you have the daily SMA filter, additional MTF gates are subtractive.	Re-test only if the base strategy becomes more trade-frequent (e.g., after MACD_div integration).	Negative result repeats; locks in 'single-TF + daily filter' as canonical.
`R089-R092`	2026-04-16	architecture	LSTM, varying lookbacks, or 9-seed ensembles may outperform the 5-seed GRU baseline.	None beat the R084 baseline.	Diminishing returns past 5-seed GRU 128x2 lb=300. The architecture and ensemble size are at their local optimum given the data and labels. Spending more on architecture is wasted effort; gains have to come from STRATEGY (filters, exits, regimes), which is exactly the pivot to R094+.	Re-test if a new model family (Transformer, XGBoost stack from R118) joins the ensemble.	Negative results that justify locking architecture and refocusing search.
`R087`	2026-04-15	architecture	The 256x2 d=0.5 wd=1e-4 winner from R086 will hold up across seed sets.	256x2 d=0.5 wd=1e-4: worst +3,757%, only 1/3 sets 4/4. 128x2 d=0.5 wd=1e-4: worst +3,241%, 0/3 sets 4/4. Baseline 128x2 d=0.2 wd=1e-5 still best (+3,978% worst).	Higher regularization does NOT improve generalization in this setup — the model is already small enough and the data is large enough that mild reg is sufficient. The R086->R087 reversal is a permanent reminder that single-seed sweeps are unreliable. Baseline d=0.2 wd=1e-5 is the optimal config.	Re-test reg when dataset size doubles or when introducing new feature families.	Locks 128x2 d=0.2 wd=1e-5 as canonical production architecture.
`R085`	2026-04-14	features	Adding raw sigma and barrier distance as features should let the model adapt to volatility regimes.	ALL new features reduce compound by -24% to -95%. Consecutive + is_event catastrophic (-88%). Fixed scaling for sigma slightly better than MinMax but both hurt.	The model already captures volatility implicitly via ATR, returns and per-sequence normalization. Adding explicit sigma features creates redundancy that triggers overfit (early stopping at epoch 1). Principle: when normalization already encodes a quantity, do not add the raw value as a feature.	Sigma-like features worth re-testing only if removing the per-sequence MinMax normalization or moving to a stateless model.	Negative result protects future you from adding 'obvious' features that are already implicit.
`R082`	2026-04-12	production	Fine-tune thresholds (step 0.02) and add drawdown protection to get 3/3 seed sets passing 4/4.	Notebook prepared, but during R083 review the equity double-counting bug was found — invalidated all R068-R082 always-invested results (inflated ~2x).	Always recompute equity by simulating bar-by-bar including the closing transaction explicitly — do not add 'position_value' at the end. The R082 bug taught the team to never trust a backtest until the close-out logic has been audited line by line. Every later 'record' (R084, R110, R120) is recomputed with the fixed equity logic.	DD protection conclusively rejected later (R097); only revisit if a regulatory DD limit is imposed.	Grid search harness; lesson on equity accounting.
`R073-R080`	2026-04-11	labeling	There is a better labeling than tb3_vol — speed-weighted, 5-class, efficiency, DSR, filtered, next-event, RL meta-layer, self-distillation, conditional barriers, multi-resolution consensus, trend scanning, path-quality weighted.	ALL 13 methods FAIL to beat tb3_vol 3-class.	tb3_vol with 3 classes is a local optimum that is extremely hard to beat — and we now know why: (1) regression labels fail with GRU because CE concentrates gradients better; (2) timeouts (24% of labels) are essential information, not noise — methods that drop timeouts always underperform. The labeling search is closed; future gains have to come from features, exits, ensemble, or regime.	Only worth re-running with a fundamentally different model family (Transformer with hierarchical attention, diffusion, etc.) that may exploit a different label structure.	Permanent prior: don't search for new labels unless a new model family is also introduced.
`R063`	2026-04-09	exit-logic	Adding short trades using P(SL) and exiting on SL-signal should add alpha.	Shorts don't work (P(SL) not precise enough). SL-signal exit inferior to always-invested.	P(SL) is not symmetric with P(TP) — the model can predict 'this is bad' but not 'this will fall by X%'. Shorts have been tested multiple times and consistently fail; the recommendation is to treat BTC strategies as long-only until a dedicated short-only model is trained. Signal-based early exits also lose to staying invested.	Shorts worth revisiting only with a separately trained model whose labels are short-specific (mirror tb3_vol for the short side).	Confirmation that long-only is the right default.
`R062`	2026-04-09	mtf-filter	Adding 1h and 4h aggregated features should provide higher-timeframe context and improve decisions.	1h+4h features do NOT help. lb=300 already captures enough context.	When the lookback is already 300 event bars (~weeks of context), additional higher-timeframe features are redundant. The model is already seeing the trend. This negative result repeats in R119 (multi-timeframe FILTER) and is now a project-wide prior: don't stack timeframes on a model that already has a long lookback.	Multi-timeframe features worth revisiting only with much shorter lookbacks (<50 bars) or when the higher TF carries clearly orthogonal info (e.g., daily SMA trend filter, which DID work in R094).	Justification for single 15min timeframe in production.
`R037`	2026-04-03	validation	MFE/MAE with mse_ratio loss generalizes across 4 chronological folds spanning 2018-2026.	H120 r>1.5 ep2: +97.9% total, min fold +9.3%, 282 trades/fold, WR 54%, PF 1.49. Profitable in 4/4 folds. MaxDD 37%. Underperforms B&H in strong bulls.	First proof that walk-forward 4-fold robustness is achievable but tells you something important: a strategy that is profitable in every regime can still underperform B&H during mega-bulls because it is selectively long. From here on, 4/4 walk-forward becomes the minimum acceptance bar, and 'beats B&H compound' becomes the second.	MFE/MAE walk-forward worth re-running once we have a higher trade frequency mechanism to capture bull markets.	Walk-forward 4-fold harness; multi-horizon labeling pattern.
`R036`	2026-04-02	labeling	Predicting MFE (max favorable excursion) and MAE (max adverse excursion) as continuous values gives richer information than a binary win/lose label.	R036: +8.3% (H240, fold 5%), WR 69%, PF 1.70. R036b: +9.4% with TP=1.2 SL=0.7. R036c: -1.2% in bear vs B&H -23% — model learns when NOT to enter.	MFE/MAE is the most informative continuous target tried in the project, and the principle 'model that knows when not to trade' (drawdown protection through abstention) becomes a core later thesis. Decomposing P&L into upside and downside potential is a transferable idea worth keeping in any future labeling system.	MFE/MAE worth revisiting as an auxiliary head alongside the 3-class classifier, or as an exit-quality signal.	MFE/MAE label generator, DualHeadGRU class, abstain-in-bear behavior.
`R035`	2026-04-01	features	Going down to 1min should expose more microstructure signal and let the model trade more often.	Similar to 5min, no improvement; OOM issues from sequence size.	More bars is not more signal — 1min adds noise and infrastructure cost without alpha. The signal-to-noise ratio of crypto improves as you go up the timeframe (15min was eventually chosen). Memory becomes a real constraint when lookback*features explodes.	1min only worth revisiting for execution / micro-slippage modeling, never as a learning timeframe.	Confirms 5min as the lower bound; pushes the search toward 15min.
`R027-R033`	2026-03-31	labeling	After fixing the gap bug, rebuild credible results via swing labels, dual-head outputs, and continuous regression.	Honest results recovered. Regression has very high precision but is too selective for compounding.	Two principles emerged that stayed true: (1) very low trade count + very high WR is not a strategy — compounding needs frequency; (2) dual-head outputs that decompose 'how much up' vs 'how much down' carry more information than a single scalar and would later resurface as MFE/MAE. Always report trade count alongside WR/PF.	Continuous regression labels worth revisiting if combined with event filtering to get more trades without sacrificing precision.	Swing label code, dual-head architecture template, regression baseline for sanity checks.
`R004-R026`	2026-03-28	labeling	Scaling TP/SL/timeout by EWMA volatility (sigma) should produce labels that are meaningful across market regimes and unlock real alpha.	GRU beats LSTM consistently (R012). Multi-timeframe 15min aux gave +795% (R020-R022). Walk-forward 4-fold validated (R025). BUT all results pre-R027 were invalidated by the sequence-gap bug — true best honest config (tb_vol 10/7 alpha=24, GRU lb=20) was only +15.9%, 87 trades, WR 46%, PF 1.53.	Volatility-adaptive barriers are the right idea and stayed in every later paradigm — fixed barriers were never revived. GRU > LSTM was confirmed here as a permanent finding. The most important lesson is methodological: dropping timeout samples from training created gaps that the model couldn't reproduce live, inflating backtests ~60x. Always reconstruct the exact bar sequence the model will see in production.	Wider k_up/k_dn ratios worth re-sweeping only when paired with a new labeling paradigm; the parameterization is otherwise locked.	tb_vol label generator (k*sigma), GRU-as-default, 15min as best base timeframe, sigma clip discovery.
`R002`	2026-03-28	labeling	Disabling shorts, adding pos_weight in BCEWithLogitsLoss, lowering LR and raising dropout should fix the R001 overfit and produce real discrimination.	PnL -27.4%, 256 trades, WR 52.3%, PF 0.70 (worse than R001). Predictions collapsed into [0.35, 0.53] range, mean 0.467.	Class weights alone cannot rescue a model that has no discriminative signal — they just push the output distribution toward 0.5 without adding edge. WR going up while PF goes down is a classic 'more trades, smaller wins, bigger losses' trap. Regularization must be combined with a richer signal, not used as a substitute for one.	Class weights worth re-considering only when label imbalance is extreme (>80/20) and the base model already shows non-trivial AUC.	Long-only flag, BCEWithLogitsLoss + pos_weight scaffolding.
`R000`	2026-03-27	labeling	A simple binary up/down label over 12 candles (1h) is enough signal for an LSTM to learn directional moves.	PnL -36.5%, 31 trades, WR 45.2%, immediate overfit (val_loss 0.694 epoch 2).	Binary 'up/down in N bars' labels are essentially noise in 5min crypto — the signal-to-noise ratio is too low for sequence models to learn anything but the prior. Pure binary classification on raw direction is a dead end without a profitability-aware label (triple barrier). Always sanity-check label class balance and overfit speed in the first 2 epochs.	Only worth revisiting binary direction labels if combined with very strong event filtering or as an auxiliary head, not as the sole target.	Pipeline scaffolding, position sizing bug found and fixed in backtester.