The Role of Advanced Models in Performance Boosting
Published in Automated Trader Magazine Issue 08 Q1 2008
In the second part of a two-part article, David Aronson, President of Hood River Research, examines the modelling techniques used in arriving at a valuable predictor set for boosting ‘raw’ trading model performance.
The development of a boosting model is a two-stage process. The first is discovering which, if any, of the large list of candidate indicators proposed for consideration are helpful in predicting signal returns. The second is establishing the shape of the surface that best describes the relationship between the selected candidate indicators and signal returns. One widely used modelling technique is multiple linear regression, which assumes that the shape of the surface is linear (flat with no hills and valleys). Only the slope of the surface with respect to each axis (i.e. the weight of each indicator) is left open to discovery. However, modern data modelling techniques have allowed the constraining assumption of a flat surface to be eliminated. This allows the modelling procedure to discover the most appropriate shape for the model's hyper-surface1.
Advanced modelling vs. linear regression
As powerful as multiple linear regression is, relative to the intuitive judgment of human experts, greater predictive power can be attained with more advanced modelling methods that are not constrained by the simplifying assumption of linearity. This creates the opportunity for more accurate predictions of signal outcomes. Advanced methods such as kernel regression can detect complex non-linear relationships. Figures 1 to 5 illustrate how more sophisticated non- linear modelling differs from traditional linear regression. For simplicity, the illustrations depict a single indicator (Xi) on the horizontal axis and the return earned by the signal on the vertical axis. Figure 1 shows the true functional relationship between signal returns and Xi, which is unknown in the sense that the true shape of the function sought in any predictive modelling problem is by definition unknown and remains to be inferred from an observed sample of data. Note that the relationship is not linear.
In Figure 2, a sample of trading signals is shown, with each
point representing a single signal. The position with respect to
the vertical axis is the return earned by the signal and the
position with respect to the horizontal axis is the value of
indicator Xi at the time the signal was given. For purposes of
illustration, there is an obvious relationship between Xi and
signal return, which is unlikely to exist in practice.
Figure 3 shows the model surface that results from modelling this
data with linear regression. This linear model is too simple and
wrong in a systematic sense in that it assumes that the model
surface is flat throughout the range of the predictor variable
Xi, thus causing the model to make systematic errors (i.e. the
model is biased). In some ranges of Xi the model's predictions of
signal return are systematically too low, while in other ranges
the predictions are systematically too high. Systematic errors
are symptomatic of a model surface that does not accurately
represent the underlying functional relationship between the
indicators and signal return.
A reasonable solution would be to propose a more complex model by
visual inspection, such as a parabola. However, in the case of a
linear model, a quadratic (parabolic) model, a cubic model where
Xi is raised to the third power, or any model where the
functional relationship is assumed prior to analysis, the model
is constrained to adopt the assumed form. This is perfectly
legitimate when there is well established theory that suggests
what the correct functional form should be.
However, for many phenomena characterised by high complexity and
high randomness, such as financial markets, there is no
well-established theory to support the choice of a particular
functional form. In these situations, constraining the analysis
to an assumed functional form is too limiting. Instead an
approach is needed that does not require the assumption of the
shape of the model's surface - in other words, non-parametric
modelling.
Adapting non-parametric models
One example of non-parametric modelling is kernel regression. Figure 4 shows the model surface that would be obtained by applying kernel regression to the sample data. Note that the discovered model surface conforms closely to the true relationship depicted in Figure 1. Kernel regression discovers the correct shape of the surface by estimating the value of the dependant variable Y (signal return) within small local ranges along the Xi axis. The simplest approach to kernel regression takes an average of the Y values in each local Xi neighbourhood. This becomes the altitude of the model surface in that region of Xi. A more sophisticated kernel method fits linear models to each small Xi neighbourhood.
Kernel regression is one of a
family of advanced modelling methods that do not impose the
constraint of a particular functional form. However, while the
flexibility of these techniques allows the model surface to
conform to the true underlying relationships in the data, if not
properly constrained, this flexibility can cause the model to
'over fit' the data. This results in the model's surface becoming
contaminated with random effects in the data, describing not only
the authentic relationship between signal returns and indicator
reading but also the random variation in the particular sample of
data used to develop the performance-boosting model (see Figure
5).
Over fitting can be mitigated by using cross-validation, which
involves breaking up the historical sample into several subsets:
training, testing and evaluation. Starting with the simplest
possible model, i.e. a single indicator and a flat model surface,
the modelling procedure tests a progression of increasingly
complex models utilising more indicators and surfaces that bend
ever more closely to the data. At the same time, the modelling
procedure is cross-validating between the training and testing
sets to discover the degree of complexity that yields the most
accurate predictions.
The training and testing subsets are used to discover which of
the candidate indicators warrant a place in the model space and
how much the model's surface should be allowed to bend in order
to fit the data without the surface becoming over fitted. Over
fitting is detected when an increase in model complexity improves
fit on the training set, but degrades fit on the testing set.
When the model of optimal complexity has finally been discovered,
then and only then is it applied to the evaluation data. This
third subset of data did not participate in the modelling
process, so it provides an unbiased estimate of the model's true
efficacy. An illustration of these concepts is shown in Figure 6.
For clarity, the foregoing illustrations were confined to a
single predictor variable. In practice, the number of predictors
will be larger. Because financial markets are complex non-linear
systems, it is likely that the governing predictive relationships
will also be non-linear. Nevertheless, linear regression should
not be discarded out of hand as it can prove to be a valuable
method when combined with non-linear methods.
Prediction model ensembles
There are numerous advanced modelling techniques available that each utilise a distinct mathematical paradigm and search procedure. Therefore, if various modelling methods are applied to the same set of signals the boosting models produced will differ and tend to make different uncorrelated prediction errors. This intuitively leads to the concept of ensembles of prediction models. Just as in investing, where it makes sense to combine securities whose returns are uncorrelated into a portfolio, in performance boosting it makes sense to combine the predictions of an ensemble of models whose prediction errors are uncorrelated. In other words, the degree of volatility of returns experienced by a portfolio of securities is analogous to the size of the prediction errors made by a combined forecast. When the predictions are combined the errors tend to negate one another, resulting in a combined prediction that has a smaller average error.
A performance boosting case study
The following is an example of applying the boosting process to a two-factor technical strategy that takes long positions in stocks. The strategy is based on two factors: a stock's recent price change normalised by the stock's volatility; and its recent change in trading volume. Called the C-K rule, it buys stocks that have recently fallen on declining trading.
Figure 7 shows the performance of the boosting model on data that was not used in the model's development (i.e. out-of-sample validation data). This is the acid test of a boosting model's efficacy. The boosting model was developed on an in-sample data set (1984-1999) and was then used to predict returns for signals in an out-of-sample data set (2000-2004). The out-of-sample signals were then grouped into ten groups or deciles based on the boosting model's predicted return. In Figure 7 note that signals that were predicted to do the best (decile 10) did indeed have the highest average returns (+1.76 per cent) and performed better than the average of all signals (+0.37 per cent). The signals predicted to have the worst returns (decile 1) earned an average return of -0.64 per cent. By taking smaller positions on signals predicted to do the worst or avoiding them entirely and taking larger than normal positions on signals predicted to do the best, the overall strategy returns produced by the C-K rule would have been increased relative to a strategy of taking the same position size on all signals.