Is My Backtest Overfit? We Ran the Gauntlet on Our Own Strategies

STS ResearchPublished June 13, 2026Updated June 17, 2026

A backtest is overfit when it describes the past instead of an edge. The clean way to find out is to run a few standard tests, and the honest way is to run them on your own system and publish what they say. We did that on our five NQ strategies. On their own, four of the five come back statistically soft: their Harvey-Liu t-stats sit below 3.0, the rough line for "this looks real." We trade all five anyway, because together they clear that line with room to spare, and every one of the five still made money on data the backtest never saw.

Below are the tests that matter, the t-stat for each strategy, and why four strategies that look soft alone still earn their seat. The book as a whole scores a t-stat of 4.9 against a hurdle of 3.0, so the softness is at the strategy level, not the portfolio level. We measure every t-stat on the same per-trade ruler, so the strategy numbers and the book's 4.9 are directly comparable.

4.9

Book t-stat (hurdle is 3.0)

1 of 5

Strategies that clear t over 3 alone

5 of 5

Profitable on unseen 2026 data

96.2%

Book Deflated Sharpe (need 95% to pass)

Whose trades are these (read this before using our numbers)

Everything below is measured on our own book: five systematic NQ strategies that run together, one position at a time (we call the combined five "the book"). TradingView backtests from June 2011 to June 2026, one to three contracts scaled by volatility, commissions and slippage included, $1,120,402 net. The entries are momentum and trend-continuation, intraday plus one overnight model. They are not mean reversion and not scalping.

That style shapes the test results. A momentum book lives or dies on a handful of big trending moves, which makes its statistics noisier and its overfit tests harder to pass than a high-win-rate system would be. Our exact t-stats describe our system. The method, running these tests on your own strategy before you trust it, transfers to any trader, including a discretionary one who wants to know whether a rule is real or just a story the chart told. The numbers themselves do not transfer.

Overfitting, in one plain idea

A strategy has thousands of dials: entry time, stop width, which indicator, what threshold. Turn enough dials and you can make almost any rule look great on past data, because you are fitting the noise, not the signal. That is overfitting, also called curve fitting. The tell is simple. An overfit strategy looks brilliant in the backtest and falls apart the moment it meets new data.

So the question "is my backtest overfit?" really means: how much of this result is edge, and how much is me having tried a lot of things until one looked good? You cannot answer that by staring at the equity curve. A curve-fit strategy and a real one can have the same beautiful curve. You answer it with tests built to punish the trying.

The four tests that actually matter

There are dozens of overfit checks. Four carry most of the weight, and each attacks the problem from a different side.

The Harvey-Liu t-stat. This is the headline test. The t-stat measures how far your average result sits from zero, scaled by how noisy your results are. A high t-stat means the profit is unlikely to be luck. Campbell Harvey and Yan Liu, in their 2014 paper on the flood of published trading "factors," argued that the usual bar of 2.0 is far too soft once you account for how many strategies people test before publishing one. Their tougher line is t-stat above 3.0. We use 3.0 as the pass mark, same as they recommend.

The Deflated Sharpe Ratio (DSR). A Sharpe ratio (return divided by how bumpy that return is) rewards smooth returns, but it is easy to inflate by testing many variations and reporting the best one. The DSR, from Marcos Lopez de Prado, takes your Sharpe and deflates it by how many configurations you tried and how skewed and fat-tailed your returns are. It answers: given all the trying, what is the chance this Sharpe is genuinely above zero? Higher is better. A whole diversified book can push it toward 100%, but a single optimization-heavy strategy rarely gets there.

Walk-forward, also called out-of-sample testing. Split history into a part the strategy was built on (in-sample) and a part it never touched (out-of-sample). If the edge only shows up in-sample, it was fit to that period. A real edge keeps working out-of-sample. This is the most intuitive test and the hardest to fake. If you only ever run one of these, run this one.

Probability of Backtest Overfitting (PBO). From the same Lopez de Prado line of work, this one is clever. It chops your history into many blocks, and across thousands of combinations it asks how often the configuration that looked best in-sample turned out below average out-of-sample. If your "best" settings are really just luck, they will flip to below-average a lot. PBO is the share of times they do. Lower is better; under 20% is the usual pass.

Passing all four does not promise a strategy will make money. It only lowers the odds you are fooling yourself.

The takeaway

No single test proves an edge. The t-stat asks if profit beats luck, the Deflated Sharpe punishes you for trying many versions, walk-forward demands it work on unseen data, and PBO checks whether your best settings were a fluke. We ran them on our own strategies and the book passed, even though, as the next section shows, four of the five strategies did not clear the t-stat line on their own.

Four of our five strategies fall short of the t-stat hurdle on their own

Here is the part most signal sellers would bury. When we run the gauntlet on each strategy by itself, on the same per-trade ruler the book uses, only one of the five clears the t-stat-above-3 line. The other four fall short. The last column shows the profit factor (gross wins divided by gross losses; above 1.0 means the strategy made money) on data the backtest never saw.

Strategy	Direction	t-stat	Out-of-sample profit factor
Trend	Long	3.28	1.93
Long ORB	Long	2.92	1.46
Short	Short	2.61	6.69
Overnight	Long	2.93	2.30
Intraday	Long & Short	2.54	1.58

Bar chart of the Harvey-Liu t-stat for each of our five NQ strategies, standalone backtests 2011 to 2026 on a per-trade basis, against the real-edge hurdle of 3.0 shown as a dashed line. Only Trend at 3.28 clears the hurdle; Overnight 2.93, Long ORB 2.92, Short 2.61, and Intraday 2.54 fall below it. Every t-stat is printed on its bar. — Only Trend clears the 3.0 t-stat line on its own. The other four sit below it. By the strict standalone test, four of the five are not solo edges, which is exactly why we trade them as one book.

Notice the short. Earlier versions of this book leaned on a fragile short sleeve that only paid off in crashes. We removed it. The short that remains here is a convex, below-VWAP momentum short. On its own its t-stat is a soft 2.61, yet it does the most damage out-of-sample by far, a 6.69 profit factor. That gap is the whole lesson: a low solo t-stat does not mean no edge, it means a noisy one that leans on the rest of the book. Only Trend, at 3.28, clears the line alone. Read plainly: four of these five are optimization-sensitive on their own. None is fabricated, but none except Trend is a bulletproof, bet-the-house standalone edge.

Why a statistically soft strategy can still belong in the book

If four of the five fall short of the t-stat line alone, why are they still running? Because the test they fail is the standalone test, and we do not trade these strategies standalone. They run together, one position at a time, and the value of a strategy inside a portfolio is not the same as its value alone.

Take Intraday, the softest of the five at 2.54. It is the only sleeve that trades both directions, long and short, from the same open-anchored trend logic. That is the job it does that the others cannot. Four of the five sleeves lean one way; Intraday fills in the days the others sit out, and it barely moves with them. Across the five strategies the average pairwise correlation of daily results is about 0.11, near zero, and Intraday is part of why. A sleeve that adds low-correlated return in both directions earns its seat even when its solo t-stat is soft.

That is the portfolio lesson under all of this. A strategy can look statistically weak by itself and still steady the whole book, because the test rewards solo profit, not diversification. Drop the four soft sleeves and you do not have a safer book, you have Trend alone, with none of the diversification that lifts the combined t-stat to 4.9. We keep all five.

On data the backtest never saw, all five held up

The single most convincing overfit test is also the simplest: does it work on data you did not build it on? We carved out January 2026 onward as a true out-of-sample window, market data that did not exist when these strategies were designed. It is the only stretch of history none of these strategies could have been fit to. We measured each one cold.

All five made money. Including the four that look soft on the standalone t-stat.

Strategy	Out-of-sample profit factor	Made money?
Trend	1.93	Yes
Long ORB	1.46	Yes
Short	6.69	Yes
Overnight	2.30	Yes
Intraday	1.58	Yes

Bar chart of out-of-sample profit factor for each of the five NQ strategies, January 2026 to June 2026, with a dashed break-even line at 1.0. All five bars sit above break-even: Trend 1.93, Long ORB 1.46, Short 6.69, Overnight 2.30, Intraday 1.58. Every value is printed on its bar, and all five made money on data the backtest never saw. — Profit factor above 1.0 means a strategy made money in the period. On data none of them had seen, all five cleared the line, including the four that look soft on the standalone t-stat.

These are small samples, 8 to 53 trades each in this short window, so read them as a check, not a verdict. The high 6.69 sits on the fewest trades, so lean on it least. The soft strategies are soft, not broken. They held their edge on genuinely unseen data, which is the test an overfit strategy fails.

The book passes every test the lone strategies struggle with

Put the five together and the picture flips. Because the strategies are nearly uncorrelated (their daily results barely move together, average pairwise correlation about 0.11) and because the longs and the short do their damage in different conditions, the combined book is far steadier than any single piece.

The portfolio t-stat is 4.9, comfortably past the 3.0 hurdle. The book Deflated Sharpe is 96.2%, past the 95% pass line, after deflating for a high trial count. The edge has also strengthened over time, not faded: the book's profit factor by non-overlapping four-year era rises 1.06, 1.21, 1.46, 1.92 from the 2011-2014 block to 2023-2026, which is the opposite of what a curve fit does as it ages. And the book's beta to buy-and-hold NQ is just 0.20, so most of the return is not the index in disguise. None of those numbers is true of any single sleeve alone. The diversification is doing real work, and these tests are how we measure it.

This is also why we do not sell individual strategies. A single sleeve is an ingredient. The product is the diversified book, because that is the thing the statistics actually support.

Two proof points we keep on the shelf

Two stories from our own logs make the point harder than any test does.

We once ran roughly 1,100 backtests trying to improve our short strategy: different stops, exit times, entry triggers, volatility filters, sizing schemes. Nothing beat the production version inside the book. Several variations looked healthier on their own and every one of them hurt the portfolio. That is overfitting caught in the act. The "improvements" were fitting the past, and the discipline of testing them inside the full book, not alone, is what exposed them. All that trying is also why we deflate the Sharpe by a high trial count. Every one of those attempts is a reason to trust a single pretty result less.

The second is a bug. In June 2026 we found our live book was firing two trades on a single bar when it should hold one position at a time. We fixed it. The fix cost about $110,000 of backtested profit, because honoring one position at a time means turning down trades the buggy version took. We could have quietly kept the bigger number. We changed the number instead. A backtest you are willing to make worse for the sake of accuracy is a backtest you can trust a little more.

What to do with this

To check your own strategy, do not trust the equity curve and do not trust a single test. Run the four together. Hold the t-stat to 3.0, not 2.0. Deflate your Sharpe by the number of versions you actually tried, and be honest about that count. Carve off the most recent stretch of history, never let the strategy see it, and check that the edge survives. And if you ran a parameter sweep, run PBO to see how often your best settings would have flopped out-of-sample.

Then judge each piece at the level it lives at. A strategy can fall short of the solo test and still earn its seat in a portfolio, the way our both-directions Intraday sleeve does, but only if you can show what job it does that the others cannot. If you cannot name that job, the weak strategy is probably just weak, and the honest move is to simplify it or set it aside. See how the five fit together in our NQ trading strategies, or read the stop-width study for another case where the obvious backtest answer was the overfit one.

How we measured this

Instrument: CME Nasdaq-100 E-mini (NQ), $100,000 initial capital, no compounding, one to three contracts scaled by volatility. Data: TradingView list-of-trades exports from our live five-strategy book, June 2011 through June 2026, 3,505 combined trades. The per-strategy gauntlet runs on each strategy's standalone export; the t-stats (per-trade percent-return basis, the same ruler as the book's 4.9, so the two are directly comparable) and out-of-sample profit factors are produced by our own scripts and reproduce on any TradingView export of the same form.

The tests have honest limits. The Deflated Sharpe depends on an assumed number of configurations tested; we report the t-stat (which does not) as the primary verdict, and the book Deflated Sharpe of 96.2% uses a high trial count (3,000) so the deflation is conservative, not flattering. The out-of-sample window (January 2026 onward) is short by design; a five-month window is suggestive, not proof, and the per-strategy out-of-sample samples are small (8 to 53 trades each), so we will extend them as time passes. Strategy backtests before about 2011 are small-sample and we do not lean on them. These are hypothetical backtest results, not live fills. The underlying exports are our proprietary trade history, so we cannot publish the raw files, but the methods reproduce on any export, and our book-level numbers reconcile to the full tear sheet and the strategy page. The same gauntlet results sit alongside the live record on our NQ futures signals page. Plans are on the pricing page.

We trade this book live and sell access to the signals, so judge the data accordingly. This article is educational and is not investment advice. Futures trading involves substantial risk of loss and is not suitable for every investor.

Hypothetical performance disclaimer (CFTC Rule 4.41): hypothetical or simulated performance results have certain limitations. Unlike an actual performance record, simulated results do not represent actual trading. Also, since the trades have not been executed, the results may have under- or over-compensated for the impact, if any, of certain market factors, such as lack of liquidity. Simulated trading programs in general are also subject to the fact that they are designed with the benefit of hindsight. No representation is being made that any account will or is likely to achieve profit or losses similar to those shown. Past performance does not indicate future results.