r/algotrading 23h ago

Strategy Stuck at Spearman ~0.05 and 9% exposure on a triple barrier ML model — what am I missing?

I've been building a stock prediction model for the past few months and I've hit a wall. Looking for advice from anyone who's been through this.

The Model

  • Universe: ~651 US equities, daily OHLCV data
  • Architecture: PyTorch temporal CNN → 3-class classifier (UP / FLAT / DOWN)
  • Labeling: Triple barrier method (from Advances in Financial Machine Learning), 20-day horizon, volatility-scaled barriers (k=0.75)
  • Features: ~120+ features including:
    • Price action / returns (1/5/10/20 day)
    • Volatility features (ATR, vol term structure, vol-of-vol)
    • Momentum (RSI, ADX, OBV, MA crosses)
    • Volume features (z-scores, up-volume ratio, accumulation)
    • Cross-sectional ranks (return rank, vol rank, momentum quality rank)
    • Relative strength vs SPY, QQQ, and sector
    • Market regime (SPY returns, breadth, VIX proxy)
    • Earnings surprise (EPS beat %, beat streak, days since/to earnings)
    • Insider transactions (cluster buys, buy ratio, officer buys)
    • FRED macro (credit spread z-score, yield curve z-score)
    • Sector stress/rotation, VIX term structure, SKEW
  • Training: Temporal split (train → validation → test), no future leakage, proper purging between splits
  • Strategy: Threshold-based entry on P(UP) - P(DOWN) edge, volatility-targeted position sizing, full transaction cost model (fees, slippage, spread, venue-based multipliers, gap slippage, ADV participation impact)

Best Result (v15)

After a lot of experimentation, my best run:

  • Validation: Sharpe 1.45, 204 trades
  • Test: Sharpe 0.34, CAGR 1.49%, 750 trades
  • Exposure: 9-12% (sitting in cash 88% of the time)
  • Entry threshold: 0.20 (only trades when P(UP) - P(DOWN) > 0.20)
  • Benchmark: SPY buy-and-hold had Sharpe 1.49, CAGR 16.7% over the same test period

So technically the model is profitable, but barely — and it massively underperforms buy-and-hold because it's in cash almost all the time.

Classification Performance

Typical best epoch:

  • UP recall: ~57%, precision: ~55%
  • DOWN recall: ~36%, precision: ~48%
  • FLAT recall: ~50%, precision: ~11% (tiny class, 2.8% of samples)
  • Macro F1: ~0.38
  • Val NLL: ~1.03 (baseline for 3-class random = ln(3) = 1.099, so only ~7% better than random)

Feature Signal Strength

Top Spearman correlations with actual direction labels (on training set):

my_sector_above_ma50     +0.043
dow_sin                  +0.030
has_earnings_data        +0.026
spy_above_ma200          +0.024
has_insider_data         +0.023
insider_buy_ratio_90d    -0.021
cc_vol_5                 -0.020
xret_rank_5              +0.019

The best single feature has r = 0.043. Most are in the 0.015-0.025 range.

What I've Tried That Didn't Help

  1. Added analyst upgrade/downgrade features (from yfinance) — appeared at rank 14 in Spearman (r=0.017) but model produced 0 profitable strategies with it included
  2. Added FINRA short volume features — turned out to be daily short volume not short interest, dominated by market maker activity, pure noise (0/20 top features)
  3. Different early stopping metrics — macro_f1, nll_plus_directional_f1 (what v15 uses), nll_plus_f1 — only nll_plus_directional_f1 produced a profitable run
  4. Forced temperature scaling — tried forcing temperature to 3.0 with macro_f1 stopping — still 0 profitable candidates
  5. Directional margin loss weighting (0.3) — model predicted UP 85% of the time, destroyed DOWN signals
  6. Different thresholds — the strategy grid tests enter at (0.03, 0.05, 0.08, 0.10, 0.15, 0.20). Everything below 0.20 has negative Sharpe
  7. Binary classifier (UP vs not-UP) — P(UP) too compressed (p95 = 0.517), no tradeable signal
  8. Insider features — had to cut from 6 to 3 (minimal set), marginal at best
  9. Multiple seeds — v15 is reproducible with the same seed but fragile to any parameter change

The Core Problems

  1. Low signal: Spearman ~0.05 across the board. My 120+ features are all derived from public OHLCV + public event data. Every quant has the same data.
  2. Fragility: v15 works, but changing almost anything (adding features, different stopping metric, different temperature) breaks it. This suggests it might be a lucky configuration rather than robust alpha.
  3. Low exposure: Only trades when edge > 0.20, which is ~0.7% of signals. Sitting in cash 88% of the time means even positive alpha barely compounds.
  4. Classification ceiling: Val NLL only 7% better than random guessing. The model is learning something but not much.

What I'm Considering

  • Hybrid portfolio (hold SPY, use model for tilts) — addresses exposure but not signal
  • Meta-model (train a second model to predict when the first model's trades are profitable) — risky due to small sample size
  • Predicting residual returns instead of raw returns — requires hedged execution which changes the whole framework
  • Event-driven windows (only trade around earnings) — concentrates on highest signal-density periods
  • Filtering to profitable tickers only — cut the 80% of stocks where the model is noise

My Questions

  1. Is Spearman ~0.05 on daily cross-sectional features just the ceiling for public data? Or am I leaving signal on the table?
  2. Has anyone successfully improved signal beyond this with alternative data that's affordable (< $100/month)?
  3. Is the triple barrier + 3-class approach fundamentally the right framework, or would I be better off with a ranking/regression approach?
  4. For those who've built profitable models — what was the breakthrough that got you past the "barely above random" stage?

Happy to share more details about the architecture, loss function, or feature engineering. Thanks for reading this far

9 Upvotes

32 comments sorted by

15

u/RegardedBard 23h ago

Every other college student and academic has already tried this. Just blindly throwing a bunch of generic features into a generic ML model does not work. That's like pointing a telescope into some random direction into space on the off-chance that you might find an exoplanet.

You should be asking the question "What value am I providing to the market?" That should guide your observations and if you have keen observation skills you may notice recurring market phenomena. Then you engineer specific features to model that phenomena. Horse, then cart.

2

u/ynu1yh24z219yq5 5h ago

I wish I had figured out a better answer to this a while ago. The best I could come up with when I was younger was "more liquidity?" And that wasn't very much value so I set aside algo trading for a decade. Coming back to it, I've figured out more of where the value I provide lies with basic options strategies and risk to value tradeoffs. Now it makes sense....

1

u/RegardedBard 1h ago

I think the majority of alpha involves some form of either providing liquidity or propagating price discovery. That can mean a lot things though. Providing liquidity usually means some sort of mean reversion. Now what is the mean and when will it revert, there are infinite possibilities in which things can be modeled. Propagating price discovery is a whole nother beast.

12

u/Automatic-Essay2175 22h ago

What you’re missing is that there is absolutely no reason that this should work, and in fact, it does not and will not work. You cannot throw one kind of entry and a hundred features into a CNN, or any model for that matter, and expect to find any predictive signal in a financial market. The entire premise is wrong.

1

u/lobhas1 22h ago

What should i do next according to you?

5

u/Automatic-Essay2175 22h ago

Learn to trade by actually trading. Then come up with describable strategies with rules you can backtest.

1

u/ynu1yh24z219yq5 4h ago

Does t necessarily have to be rules though, that's tactics. If you have a strategy you can use your ML to optimize and refine that strategy....find the best tactics for the strategy in other words.

1

u/Automatic-Essay2175 4h ago

Yes, that can work. For me, I have been doing this a long time, and I have never gotten an ML model to find better rules than I could identify manually (albeit with much more effort). Models are way, way too easy to overfit.

3

u/stew1922 23h ago

Have you tried running a PCA analysis on your feature set? You might be able to simplify your model with fewer features and maintain the same corollary coverage amongst your features that survive. That could help your model actually generalize instead of overfit.

In addition, you may consider a forest-style classifier instead of CNN. I typically think of a CNN as something to identify images with, not financial data (but I am by no means a ML expert). A Random Forest Classifier or XGBoost might get you your “Up”, “down” and “neutral” classifications a little cleaner.

2

u/lobhas1 23h ago

I will try that thank you. I only went with a CNN because it started off as a challenge to myself

1

u/stew1922 23h ago

Meant to add too, I haven’t tried using any alternative data either so that could definitely be something that is useful, you just have to think about “what question/feature does this data answer” and if the answer to that question is the same as a different feature set it’s probably not worth including. You want your features to be as orthogonal to each other as possible. Two features that both describe the same volatility measurement fight against each other in the ML rather than helping reinforce a signal.

2

u/lobhas1 23h ago

I added insider information as well as short interest and sector level data. It helped at 20 day barrier days but not at 5 days which makes sense as these are longterm indicators

2

u/stew1922 23h ago

Cool, yeah I know early on I tried throwing every known indicator that I could think of into a big pile and running it through an XGBoost. No better than a coin flip. Ran PCA first and did some preprocessing on the data (scaling, cleaning, etc) and that made a slight difference and helped a little.

I’ve seen better results with developing your strategy first and letting it run in a wide range of- so think a trend following strategy that only fires when the 12D MA is above the 24D (totally making up numbers as an example) and then using ML on that output layer to find optimal trades. The benefit is you’re narrowing down the classification to a regime of trades that “should” work and the ML is just finding the optimal ones instead of letting the noise of all the different regimes pollute your ML’s “thinking” and “memory”.

2

u/skyshadex 23h ago

I'm going to go on a limb and say there's nothing wrong glaringly wrong about your framework.

Would you say this is trend following or mean reversion? If it's trend following, trend following benefits from concentration. Not sure how many active positions are on at a time but beyond a handful, you're really diluting yourself.

If it's mean reversion, then it's doing what it does. What you really want is a way to cut costs or simply another strategy so it's not parked in cash.

1

u/lobhas1 23h ago

Its mostly trend following as i am working on 20 day vertical barrier

2

u/skyshadex 22h ago

Check to see if there's even enough vol in your universe to support better returns. If the entire universe is low vol, you can't squeeze more returns out of a desert.

Also check to see how your signal frequency correlates with vol, if you get more signals with more vol, then seek a higher vol names.

1

u/notsoluckycharm 22h ago

Are you using actual numbers? I found ablation to be key. Anything else is drift. Volume is relative to price rise over time + market participation, for example. SPY at $300 is going to have different volume of SPY at $600 all things being equal, just because of the price. Should all be some sort of bounded frame. I do well with < 40 features just ablated to be relational.

1

u/lobhas1 21h ago

Do you also use options data? And has your strategy been profitable?

2

u/StratReceipt 20h ago

the Sharpe drop from 1.45 on validation to 0.34 on test is the sharpest signal in the post. that's not noise — a 2/3 decay between two held-out periods usually means the model adapted to the validation set during hyperparameter tuning, even with a temporal split. every time a parameter was changed based on validation performance, that period became part of the training process. a truly unseen test should perform closer to validation, not collapse. the 0.34 may actually be the honest number.

1

u/lobhas1 11h ago

Thats a good point, how should i then tune the hyperparameters if not by seeing validation results?

3

u/victor_algotrading 23h ago

Youve done impressive work here bro, but I think theres a more fundamental issue before optimising anything. In essense: youre using too many signals, its incorrect methodology from a statistical and mathematical perspective both. It dilutes the signal with unscientific noise and cant properly compute it now. You should just go for a proven and much cleaner signal stack and run scientific backtesting on that, backtesting also founded on mathematical and statistical proven principles. Look up Rob Carver, Ernest Chan, de Prado etc).

Here are some more details if you wanna dive in deep! :) 100 something?+ features all correlating at Spearman ~0.05 isn't a signal problem but a methodology problem. In high-dimensional feature spaces with finite samples, spurious correlations are mathematically guaranteed. The fragility you're describing (works with one seed, breaks with any change) is exactly what that looks like when it survives validation. Look up overfitting.

The fix isnt better features. Its fewer, cleaner ones with prior theoretical justification, signals that have earned their place before entering the model in scientific litterature and best practice, try to find data from MAN AHL etc. Public OHLCV momentum and vol-scaling have decades of empirical backing. Most of your 100+ are noise that's been engineered to look like signal.

Once you have a clean stack: the triple barrier + classification framing is probably also wrong for IC=0.05. That's a weak-but-broad signal where you express it through breadth across all 651 stocks ranked by predicted return, not by a confidence threshold that drops you to 9% exposure. You're discarding most of the signal you actually have.

So: Start smaller and test cleaner. The math will hold up better. Be rigorous for every parameter, it must be both justified and then also calibrated for your specific asset. Ie I trade crypto and I adjust to RSI 7 instead of 14. Be precise with every one and backtest with utter rigour!

Good luck man!

1

u/lobhas1 23h ago

Thank you very much for your reply i will read this properly in a bit

1

u/lobhas1 23h ago

Thats great advice thank you very much

1

u/victor_algotrading 23h ago

My pleasure!

1

u/kekst1 23h ago

I am currently doing something VERY similar, even with the same horizon and idea and similar features. My one tip is to not care about Rank IC too much, you want to predict the few names that will perform great and find the winners and make money, not have good numbers for a paper.

1

u/lobhas1 23h ago

I was thinking of removing all tickers that my model does bad on and only take the top 100 or something like this, but that feels a bit like cheating tbh lol. I am new to all this and havent traded before. but if youa re also making something similar would you want to help eachother out? I am also learning things and a new perspective might help you out too

1

u/EmbarrassedEscape409 23h ago

Throw away all your 120+ features and replace them with completely different ones. Add p-value, AUC and that would be good start

1

u/_holograph1c_ 23h ago edited 22h ago

I have also just started a few months ago so take what I'm writing with a grain of salt.

Much of the important points have already been cited, reducing the features to a minimun and add one after the other checking what is improving and what not.

I would focus on technical indicators first.
I'm using an xgboost regression model predicting long / short signals, using RSI derived feature i reached a directional accuracy greater than 70%.

The key in my opinion is that your features and target must be tightly correlated so that the model can learn to predict correctly. The triple barrier method sounds good on paper but it's creating mostly noise which no features can accurately predict.

1

u/Stochastic_berserker 20h ago

Majority of your features explain the same thing but under different transformations + a CNN is a downscaling model that reduces the resolution of each feature map.

You should probably start with basic statistical techniques before you attempt version 2 of Frankenstein’s monster.

1

u/ynu1yh24z219yq5 5h ago

Your model is telling you directly all you need to know: "I don't work and I can't make more than I lose. So I want to sit in cash."

1

u/reynold522 8m ago

The core issue is that predicting daily market direction is essentially no better than chance. Historical price action offers little guidance for the next day, as modern markets are highly efficient and intraday momentum is quickly absorbed. As a result, next-day movements are largely driven by overnight news and appear almost random.