Why Live Trading Underperforms Backtests: Detecting Backtest Overfitting from Multiple Testing
Backtesting is the core link in quantitative strategy research and development, and it is also the key difference between quantitative investment and traditional active investment. Backtesting refers to conducting simulated transactions on an investment strategy that can be accurately characterized in a historical simulation environment, and using the historical performance of the strategy to infer its future performance, thereby making choices among multiple groups of strategies to form the final investment decision. Based on a brief introduction to backtest overfitting, this article further discusses the quantitative evaluation indicators of the historical performance of trading strategies, focusing on the false positives caused by multiple tests during the backtesting process. Quantitative trading uses computer programs to realize automated trading. Its key difference from traditional active investing is its reliance on “backtesting” to verify the effectiveness of the proposed strategy and estimate its expected performance. In the context of the growth of computer computing power and algorithm development, researchers have implemented an increase in the number of backtests. Multiple tests have led to frequent false positives. The seemingly best strategies during the backtest period lack the ability to generalize to out-of-sample data. This phenomenon is called backtest overfitting. How to estimate the probability of backtest overfitting and adjust expected performance indicators such as the Sharpe ratio to correctly reflect the true performance of the strategy under the influence of multiple tests has become an emerging research direction.
Introduction
Stock trading takes advantage of market volatility to optimize capital investment returns by buying and selling company shares. Quantitative stock trading strategies are trading agents that automatically adjust positions to beat the market, i.e., computer programs that decide which stocks to trade at what price and in what quantity. With the development of computer technology, machine learning algorithms and data availability continue to advance, quantitative trading strategies emerge in endlessly, and the market size continues to expand.
Backtesting is the core link in quantitative strategy research and development, and it is also the key difference between quantitative investment and traditional active investment. Backtesting refers to conducting simulated transactions on an investment strategy that can be accurately characterized in a historical simulation environment, and using the historical performance of the strategy to infer its future performance, thereby making choices among multiple groups of strategies to form the final investment decision.
However, the methodology of validating the effectiveness of proposed strategies through backtesting is being challenged. The reason is that computer computing power has increased rapidly, allowing researchers to conduct thousands of experiments on the same historical data, and the probability of finding false positives has also increased [@rupert2012simultaneous]. Specifically, if a quantitative trading strategy researcher writes a strategy based on the laws of economics and performs well in backtesting, then this strategy is likely to be effective. But if he writes a set of trading rules at random, and then randomly searches for the best performance of the strategy under hundreds of thousands of hyperparameter configurations, then he will eventually get a quantitative strategy that performs well in backtesting. Finally, he can claim that the best strategy is significantly effective and even go online for real trading. However, this strategy arises due to randomness under the influence of multiple tests and does not include the induction of profitable patterns in the market, so future performance will definitely attenuate significantly.
The phenomenon that the future performance of an investment strategy is weaker than the historical backtest performance is called “Backtest Overfitting”. When we consider the problem of backtesting overfitting, our core concern is: What is the probability that the best-performing strategy among candidate strategies is a spurious finding generated by multiple tests? If the performance degradation of a strategy on out-of-sample data is inevitable, how to adjust its performance indicators during backtesting to more accurately reflect actual future performance? As the problem of overfitting in backtesting has received more and more attention, a series of strategy evaluation algorithms have been proposed in recent years, but they all have shortcomings. This article aims to propose and implement a new backtest overfitting evaluation algorithm to make a more reasonable estimate of the future performance of the strategy.
Quantitative trading and backtesting overfitting
Quantitative trading is an automated software trading system that uses algorithms to give trading instructions. Its goal is to use statistical mathematics, computer algorithms and computing resources to minimize portfolio risks and maximize returns. With the development of computer technology, machine learning algorithms and data availability continue to advance, quantitative trading is booming, and the market size continues to expand.
The workflow of quantitative trading consists of six stages: data collection, data preprocessing, transaction analysis, portfolio construction, backtesting and execution [@ta2018prediction]. The strategy backtesting stage is the key difference between quantitative trading and traditional investment management methods. After determining the specific trading rules and obtaining available historical data, strategy developers calculate the profit and loss of the strategy in the historical market environment through simulation [@chan2021quantitative]. The statistical values of the strategy during the backtest period, such as annualized return, Sharpe Ratio and Information Ratio, will become the basis for strategy evaluation and selection.
Generally speaking, the actual future performance of quantitative trading is not consistent with the historical backtest performance, and the actual performance of the strategy is usually weaker than the performance during the backtest period. For example, the literature [@suhonen2017quantifying] reproduced 215 alternative beta strategies that had been published at that time. The median Sharpe ratio of all strategies during the backtest period (before the strategy was published) was 1.20. On the test data after the backtest period, the median Sharpe ratio was reduced to 0.31, attenuating approximately 73%.
The literature [@bailey_pseudo-mathematics_2014] first discussed the phenomenon that this strategy performs well on in-sample data but performs poorly out-of-sample, and called it backtest overfitting. The in-sample (IS) data refers to the historical data used to develop this strategy. For data-driven strategies, it includes the training set, the validation set and all data used for backtesting. The out-of-sample (OOS) data refers to the data not used in strategy development, usually future data. “Overfitting” is a concept borrowed from the field of machine learning to describe a model that is optimized for specific samples without improving generalization performance.
The paper [@bailey_probability_2016] provides the first formal definition of backtest overfitting in the context of trading strategy selection. Consider the process of selecting the optimal strategy among strategies \(S_1,…,S_K\) through backtesting. For a given performance metric (such as Sharpe ratio), the in-sample and out-of-sample performance of the strategy are represented by the vectors \(\mathbf{R} = (R_1, R_2, …, R_K)\) and \(\mathbf{\overline{R}} = (\overline{R_1}, \overline{R_2}, …, respectively. \overline{R_K})\), use vectors \(r, \overline{r}\) to represent the sorting of elements in \(\mathbf{R}, \mathbf{\overline{R}}\). For example, there are 3 strategies participating in the backtest, using Sharpe ratio as the performance indicator, and \(\mathbf{R}^c = (0.5, 1.1, …, 0.7)\) and \(\mathbf{\overline{R}}^c = (0.6, 0.7, …, 1.3)\). Then at this time \(r = (1,3,2), \overline{r}=(1,2,3)\). In the context of the selection process described above, we can define backtest overfitting.
In a strategy set, if the strategy with the best in-sample performance out-sample performance is lower than the median, then the strategy selection process is said to have overfitted.
$$ \sum_{i=1}^{K}{E[\overline{r_i}|r_i=K]Prob[r_i=K]} < K/2. $$
Quantitative strategy developers not only need to qualitatively analyze whether the strategy is overfitting, but also quantitatively evaluate the probability of strategy overfitting and the adjustment coefficient of the strategy's performance index on in-sample data to better select models. In the Bayesian sense, backtest overfitting is not a deterministic fact. We define the probability of backtest overfitting (PBO),$$ PBO = \sum_{i=1}^{K}{Prob[\overline{r_i} < K/2 | r_i=K]Prob[r_i=K]}. $$
The percentage we need to subtract from the in-sample Sharpe ratio to obtain a realistic estimate of future performance is defined as Haircut,
$$HC = 1 - \frac{E[SR_{OOS}]}{SR_{IS}}.$$
Regarding the attenuation degree of the strategy Sharpe ratio during the actual measurement period, the rule of thumb of \(HC=0.5\) is commonly used in the industry. [@harvey_backtesting_2015]
An important reason for overfitting in backtesting is multiple testing. When researchers perform multiple tests on the same data set, the probability of finding a false positive increases with the number of tests [@rupert2012simultaneous]. Under the influence of randomness, the excellent performance of investment strategies within the sample comes from overfitting of the data rather than the induction of the real rules, and the out-of-sample performance attenuates. With the rapid development of computer computing power and data availability, researchers can test a large number of trading strategies on limited financial time series, which also greatly increases the probability of backtest overfitting. The literature [@bailey_pseudo-mathematics_2014] deduces the relationship between the best result and the number of experiments \(N\) when the Sharpe ratio of each experiment is independent and \(\hat{SR} \sim \mathcal{N}(0,1)\),
$$ E\left[\max _{N}\right] \approx(1-\gamma) Z^{-1}\left[1-\frac{1}{N}\right] +\gamma Z^{-1}\left[1-\frac{1}{N} e^{-1}\right], $$
Where \(Z\) is the cumulative distribution function of the standard normal distribution, \(\gamma \approx 0.5772156649…\) is Euler’s constant. When a researcher attempts ten independent strategy backtests, even though the Sharpe ratio expectation for all experiments is zero, the optimal strategy’s in-sample Sharpe ratio expectation reaches 1.57. Backtest overfitting caused by multiple tests is an urgent problem that needs to be solved in the field of quantitative trading. In recent years, it has received more and more attention. A series of algorithms have been proposed to evaluate the degree of backtest overfitting.
Strategy performance and backtest overfitting evaluation algorithm
Traditional method
Sharpe Ratio (SR)
Sharpe ratio (SR) [@Sharpe1994The; @1966Mutual] is a classic indicator for evaluating portfolio performance. The ex-ante Sharpe ratio is defined as the ratio of average excess return (the excess return on risk-free assets, such as government bonds) to its standard deviation.
$$SR=\frac{E\left[R-R_{b}\right]}{\sqrt{\operatorname{var}\left[R-R_{b}\right]}},$$
Among them, \(R\) is the asset return, and \(R_b\) represents the risk-free return. In order to compare the performance of strategies with different trading frequencies, it is generally necessary to calculate the annualized form of the Sharpe ratio. Under the assumption that the daily excess return \(r_t\) is independently and identically distributed and \(r_t \sim \mathcal{N}(\mu, \sigma^2)\), the calculation formula of the annualized Sharpe ratio [@lo2002statistics] is
$$SR^a = \frac{\mu}{\sigma}\sqrt{q},$$
Where \(q\) is the number of observation periods per year, for example, 243 trading days per year.
In quantitative research, the ex-post Sharpe ratio of a strategy is unpredictable, so the actual performance of the strategy during the backtest period is used to calculate the ex-post Sharpe ratio as the performance measure of the strategy. The calculation formula of the ex-post Sharpe ratio is the same as the ex-post Sharpe ratio, but it is calculated using the actual realized returns.
$$\widehat{SR}^a = \frac{\hat{\mu}}{\hat{\sigma}}\sqrt{q},$$
Where \(\hat{\mu}, \hat{\sigma}\) are the mean and standard deviation of the sample respectively. It is generally believed that \(\widehat{SR}>1\) represents better strategy performance.
Method Based on the t-Test
The traditional method to test the performance significance of a single strategy is to apply the t test. For example, the mean of the return of strategy \(S_c\) in \(T\) observation periods is \(\mu_c\) and the variance is \(\sigma_c\). Calculate the t value
$$ TR = \frac{\mu_c}{\sigma_c\sqrt{T}},$$
Two-sided p-value$$ p^S = Prob[|X|>TR], $$
where \(X\) is a random variable obeying the t distribution with degrees of freedom \(T-1\). The implicit assumption here is that returns follow a normal distribution. If the calculated p value \(p^S\) is small enough, such as 0.01, then the conclusion of the t test is that the strategy has a high probability of finding positive returns.
For a given t value, the required minimum annualized Sharpe ratio can be calculated,
$$\widehat{SR}_k^a = \frac{\mu_k}{\sigma_k}\sqrt{q} = TR_k \sqrt{\frac{q}{T}},$$
Through this formula, the mutual conversion between annualized Sharpe ratio and t value can be completed.Probability Sharpe Ratio PSR
The paper [@lo2002statistics] proves that under the assumption that returns obey the normal distribution, when the number of return samples \(T \rightarrow{\infty}\) during the backtest period, the ex post Sharpe ratio \(\widehat{SR}\) gradually converges to the normal distribution,
$$\widehat{S R} \stackrel{a}{\longrightarrow} \mathcal{N}\left[S R, \frac{1+\frac{S R^{2}}{2 q}}{T}\right].$$
[@mertens2002comments] extended the above conclusion and demonstrated that when the income is independently and identically distributed but does not satisfy the condition of obeying the normal distribution, the ex post Sharpe ratio still satisfies the normal distribution.$$ \widehat{S R} \stackrel{a}{\rightarrow} N\left(SR, \frac{1+\frac{1}{2} S R^{2}-\gamma_{3} S R+\frac{\gamma_{4}-3}{4} S R^{2}}{T}\right), $$
Among them, \(\gamma_3\) is the return skewness, and \(\gamma_4\) is the return kurtosis.
Based on the above conclusion, [@bailey_sharpe_2012] proposed the Probabilistic Sharpe ratio (PSR). PSR is defined as the probability that the actual Sharpe ratio \(SR\) of the strategy in the future exceeds the given benchmark \(SR^*\) (can be set to 0)
$$ \begin{aligned} \widehat{P S R}\left(S R^{*}\right) & = \operatorname{Prob}[S R \leq \widehat{S R}] \\ & = \int_{-\infty}^{\widehat{S R}} z\left(S R \mid S R^{*}, \hat{\sigma}_{\widehat{S R}}\right) \cdot d S R \\ & = Z\left[\frac{\left(\widehat{S R}-S R^{*}\right) \sqrt{T-1}}{\sqrt{1-\widehat{\gamma}_{3} \widehat{S R}+\frac{\hat{\gamma}_{4}-1}{4} \widehat{S R}^{2}}}\right], \end{aligned} $$
where \(Z\) represents the cumulative distribution function of the standard normal distribution.
PSR is a strategy performance evaluation index proposed from a probability perspective. Compared with directly using \(\widehat{SR}\) as an estimate of future SR, it not only considers the size of \(\widehat{SR}\), but also considers the impact of the income structure (skewness, kurtosis) and the number of samples on the distribution of \(SR\). Specifically, for a given benchmark performance \(SR^*\), the higher the strategy’s \(\widehat{SR}\) during the backtest period, the greater the number of sampling points T, the greater the return skewness, and the smaller the kurtosis, the greater the probability that the strategy’s true SR exceeds the baseline performance, that is, the higher the PSR.
Improved traditional method for backtesting overfitting problem
Improved method based on t test
[@harvey_backtesting_2015] pointed out that the p-value calculation formula for a single strategy Sharpe ratio cannot reflect the process of researchers back-testing K candidate strategies until significant results are found, and proposed a p-value calculation formula that considers multiple testing issues.
$$p^M = Prob[max\{|X_i|,i=1,...,K\}>TR],$$
Where \(X_i\) is K random variables obeying t distribution with degrees of freedom \(T-1\). For a single p-value \(p_i^S,i=1,…,K\) calculated from the performance of K strategies, several different correction methods can be used to convert it into a p-value \(p^M\) that takes into account multiple testing. If Bonferroni correction is applied to expand \(p^S\) by K times,
$$p^{Bonferroni}_i = min\{Kp^S_i,1 \} ,i=1,...,K,$$
Similar p-value adjustment methods include Holm correction and BHY method. These methods essentially impose stricter limits on the single p value as the number of tests increases.Shrinkage Sharpe Ratio DSR
The calculation formula of PSR also does not take into account the fact that under the influence of multiple tests, the expectation of the best \(\widehat{SR}\) increases with the number of tests, resulting in a bias in the estimate of the true SR of the strategy. Therefore, in the paper [@bailey_deflated_2014], the author combined the relationship between \(E(max{\widehat{SR}_i})\) and the number of experiments and proposed the Deflated Sharpe Ratio (DSR),
$$ \widehat{D S R} \equiv \widehat{P S R}\left(\widehat{S R}_{0}\right)=Z\left[\frac{\left(\widehat{S R}-\widehat{S R}_{0}\right) \sqrt{T-1}}{\sqrt{1-\widehat{\gamma}_{3} \widehat{S R}+\frac{\hat{\gamma}_{4}-1}{4} \widehat{S R}^{2}}}\right], $$
in
$$\widehat{S R}_{0}=\sqrt{Var\left[\left\{\widehat{S R}_{n}\right\}\right]}\left((1-\gamma) Z^{-1}\left[1-\frac{1}{N}\right]+\gamma Z^{-1}\left[1-\frac{1}{N} e^{-1}\right]\right),$$
N is the number of independent experiments. Essentially, DSR adjusts the performance benchmark \(SR^*=0\) required in PSR based on the variance of the number of experiments N and \(\widehat{SR}\). The value of DSR represents the probability that the true SR exceeds 0 considering the influence of multiple tests.
Combined Symmetric Cross Validation (CSCV)
Combinatorially symmetric cross-validation (CSCV) [@bailey_probability_2016] is a PBO calculation method proposed under the pbo framework. The overall idea of this method is to repeat the model selection process multiple times, statistically backtest the frequency of overfitting, and use it as the expectation of PBO, while fitting the relationship between \(SR_{IS}\) and \(SR_{OOS}\). The specific method is to split the data set multiple times and simulate the process of selecting strategies through in-sample performance.
CSCV divides the income matrix \(R^{T \times K}\) into S blocks, half of which is used as simulated in-sample data \(R^{T/2 \times K}\), and the other half is used as simulated out-of-sample data \(R^{T/2 \times K}\), there are a total of \(C_S^{S/2}\) kinds of in-sample/out-of-sample combinations, and the strategy selection process of conducting a simulation for each combination: According to the definition of backtest overfitting, if the best strategy in the sample performs lower than the median out-of-sample, then this backtest will be counted as backtest overfitting. After all combination tests are completed, the frequency of backtest overfitting will be counted and used as the estimated value of PBO. Then count the relationship between \(SR_{IS}\) and \(SR_{OOS}\) as an estimate of the performance degradation of the best strategy in the future.
The disadvantage of CSCV is that it can only calculate a PBO estimate for a set of candidate strategies and cannot evaluate a single strategy. And its prediction of the relationship between \(SR_{IS}\) and \(SR_{OOS}\) is obviously biased: considering a group of strategies with the same average return rate, the best strategy selected on the sample each time has a high probability of poor out-of-sample performance, and \(SR_{OOS}\) is negatively correlated with \(SR_{IS}\).
Domestic [@Xu Rui 2020 based on combined symmetric cross-validation] reproduced the CSCV method and studied its application in the A-share market.
Method based on Bayesian inference
[@witzany_bayesian_2021] proposes an algorithm that uses Bayesian inference to evaluate backtest overfitting. This paper uses random variables to model the process of generating returns from the strategy, and then uses the Markov chain Monte Carlo method (MCMC) to sample from the posterior distribution of the parameters. Finally, based on the obtained samples, a new income matrix is generated through Monte Carlo simulation, the strategy backtesting and actual measurement process are simulated, and indicators such as PBO and Haircut are calculated based on the simulated data.
A total of two models are proposed in this article. The first model is to assume that each row of the income matrix \(R^{T \times K}\) obeys a multivariate normal distribution.
$$R_t=[r_{t,1},r_{t,2}...r_{t,K}] \sim \mathcal{N}(\mu,\Sigma).$$
The second model improves the above model and introduces a hidden variable \(\gamma \sim Ber(p)\) that obeys the Bernoulli distribution to indicate whether the strategy is a true discovery:
$$R_t=[r_{t,1},r_{t,2}...r_{t,K}] \sim \mathcal{N}(\mu^*,\Sigma),$$
Among them$$\mu^*_i=\gamma_i \mu$$
A major improvement of this Bayesian method compared to the above method is the introduction of the covariance matrix \(\Sigma\), which takes into account the correlation between multiple strategies. In the actual strategy development process, due to the need to perform parameter searches on the same model, the strategy backtest results are often very relevant. However, the assumption that the returns of each period of the multi-strategy are subject to a multivariate normal distribution is quite different from the actual situation - the quantitative trading strategy will dynamically control positions based on market information and is not a static investment portfolio.
Why Live Trading Underperforms Backtests: Detecting Backtest Overfitting from Multiple Testing