Evaluation Metrics for Quantitative Models and Portfolios

Quantitative trading relies on backtesting to iterate on investment strategies and improve out-of-sample portfolio performance. This article uses formulas together with code to introduce a selected set of return-risk metrics for portfolios, and also discusses how to evaluate return-prediction models and risk-prediction models.

Portfolio Performance

Sharpe Ratio and Its Variants

Under the assumption that leverage can be adjusted freely, the absolute level of portfolio return or volatility has limited meaning. The classic metric for assessing both risk and return together is the Sharpe ratio:

$$ SR=\frac{E\left[R-R_{b}\right]}{\sqrt{\operatorname{var}\left[R-R_{b}\right]}}, $$

where \(R\) is the asset return and \(R_b\) is the risk-free return. To compare strategies with different trading frequencies, one usually computes the annualized Sharpe ratio. Under the assumption that daily excess returns \(r_t\) are i.i.d. and follow \(r_t \sim \mathcal{N}(\mu, \sigma^2)\), the annualized Sharpe ratio is

$$SR^a = \frac{\mu}{\sigma}\sqrt{q},$$

where \(q\) is the number of observation periods per year, such as 243 trading days.

Because long-only strategies are often highly correlated with the overall market, replacing the risk-free rate with market performance \(R_m\) gives a better measure of the strategy’s ability to earn excess return. This is called the information ratio:

$$ IR=\frac{E\left[R-R_{m}\right]}{\sqrt{\operatorname{var}\left[R-R_{m}\right]}}, $$

Volatility can also be split into upside and downside volatility. Investors care more about downside risk, so if only downside volatility is considered, we obtain the Sortino ratio:

$$ Sortino=\frac{E\left[R-R_{b}\right]}{\sqrt{\operatorname{var}\left[R_d-R_{b}\right]}} $$

where \(R_d < 0\).

All of the above rely on the assumption that returns are independent over time. If returns are autocorrelated, annualized volatility will be higher than under independence, and one can write

$$ SR'=\frac{E\left[R-R_{b}\right]}{\gamma \sqrt{\operatorname{var}\left[R-R_{b}\right]}} $$

Another approach is to skip explicit autocorrelation estimation and instead use the historical drawdown series as the risk input to compute the ulcer_performance_index:

1
2
3
4
5
6
7
8
9
10
11
12
def ulcer_index(returns):
"""Calculates the ulcer index score (downside risk measurment)"""
dd = to_drawdown_series(returns)
return _np.sqrt(_np.divide((dd**2).sum(), returns.shape[0] - 1))


def ulcer_performance_index(returns, rf=0):
"""
Calculates the ulcer index score
(downside risk measurment)
"""
return (comp(returns) - rf) / ulcer_index(returns)

Another useful metric is the probabilistic Sharpe ratio, PSR, which measures strategy quality in probability terms. See the earlier article Probabilistic Sharpe Ratio PSR.

Other Risk-Return Metrics

Gain to Pain Ratio is simple and intuitive: total return divided by total drawdown. A monthly GPR above 2.0 is generally considered excellent.

Calmar ratio is annualized return divided by maximum drawdown. But if the backtest window is long, this metric can overemphasize optimization around one particular historical interval.

CPC index combines payoff ratio, win rate, and profit factor from a trading perspective.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def profit_factor(returns, prepare_returns=True):
"""Measures the profit ratio (wins/loss)"""
if prepare_returns:
returns = _utils._prepare_returns(returns)
return abs(returns[returns >= 0].sum() / returns[returns < 0].sum())


def cpc_index(returns, prepare_returns=True):
"""
Measures the cpc ratio
(profit factor * win % * win loss ratio)
"""
if prepare_returns:
returns = _utils._prepare_returns(returns)
return profit_factor(returns) * win_rate(returns) * win_loss_ratio(returns)

Model Evaluation

Evaluation metrics for quantitative models can be grouped as follows:

  • error-based metrics: these assess how well the algorithm predicts the realized return \(y_t\) with forecast \(\hat y_t\) through measures of prediction error. Examples include mean squared error, mean absolute error, and their variants. Botchkarev (2018) provides a classification and analysis of error-based metrics for machine-learning regression.
  • accuracy-based metrics: these measure how accurately the algorithm assigns predicted returns to categories when compared with the realized return category ex post. The classification can be binary, such as positive expected return versus negative expected return or invest versus do not invest, or more complex. Hossin and Sulaiman (2015) survey evaluation metrics for data classification. These metrics are built on confusion matrices, correlation coefficients, and related tools, and include R, \(R^2\), accuracy, F1, precision, recall, and Matthews correlation coefficient.
  • portfolio-based metrics: these measure the outcome of investment strategies built from buy, hold, and sell signals generated by the algorithm. They can be subdivided into:
    • outcome-based metrics: measures such as annualized return, volatility, or maximum drawdown that do not adjust return and risk against each other
    • risk-adjusted return metrics: the risk-return benchmark metrics discussed above, which account for both return and risk and measure how efficiently the algorithm generates return under risk constraints and how well it optimizes the risk-return tradeoff. Different metrics mainly differ in how risk is measured. Examples include the Sharpe ratio, Sortino ratio, and Calmar ratio.

Specifically, risk-model evaluation is relatively straightforward. Since the true covariance matrix is unobservable, one mainly uses the risk model to construct pure long-only or long-short minimum-variance portfolios and examines their realized volatility. For factor risk models, one can also compute the goodness of fit \(R^2\) of stock returns against the risk factors on each cross section. In addition, for absolute risk forecasts one can compute the bias statistic \(B\).

The output format of a return model is simpler than that of a risk model: for each asset and each period it produces one prediction, which is structurally similar to a factor value. But evaluating a return model is much harder. During portfolio construction, only the head of the ranked list has a major impact on realized return, so traditional model metrics based on global error such as \(R^2\) and MSE do not necessarily map to final portfolio performance. The only practical route is to construct top-ranked portfolios for indirect evaluation. In practice, one even has to consider the interaction between the return model and the risk model.

Evaluation Metrics for Quantitative Models and Portfolios

https://en.heth.ink/QuantStat/

Author

YK

Posted on

2024-04-12

Updated on

2024-04-12

Licensed under