The development of a boosting model is a two-stage process. The first is discovering which, if any, of the large list of candidate indicators proposed for consideration are helpful in predicting signal returns. The second is establishing the shape of the surface that best describes the relationship between the selected candidate indicators and signal returns. One widely used modelling technique is multiple linear regression, which assumes that the shape of the surface is linear (flat with no hills and valleys). Only the slope of the surface with respect to each axis (i.e. the weight of each indicator) is left open to discovery. However, modern data modelling techniques have allowed the constraining assumption of a flat surface to be eliminated. This allows the modelling procedure to discover the most appropriate shape for the model's hyper-surface1.
Advanced modelling vs. linear regression
As powerful as multiple linear regression is, relative to the intuitive judgment of human experts, greater predictive power can be attained with more advanced modelling methods that are not constrained by the simplifying assumption of linearity. This creates the opportunity for more accurate predictions of signal outcomes. Advanced methods such as kernel regression can detect complex non-linear relationships. Figures 1 to 5 illustrate how more sophisticated non- linear modelling differs from traditional linear regression. For simplicity, the illustrations depict a single indicator (Xi) on the horizontal axis and the return earned by the signal on the vertical axis. Figure 1 shows the true functional relationship between signal returns and Xi, which is unknown in the sense that the true shape of the function sought in any predictive modelling problem is by definition unknown and remains to be inferred from an observed sample of data. Note that the relationship is not linear.
In Figure 2, a sample of trading signals is shown, with each point representing a single signal. The position with respect to the vertical axis is the return earned by the signal and the position with respect to the horizontal axis is the value of indicator Xi at the time the signal was given. For purposes of illustration, there is an obvious relationship between Xi and signal return, which is unlikely to exist in practice.
Figure 3 shows the model surface that results from modelling this data with linear regression. This linear model is too simple and wrong in a systematic sense in that it assumes that the model surface is flat throughout the range of the predictor variable Xi, thus causing the model to make systematic errors (i.e. the model is biased). In some ranges of Xi the model's predictions of signal return are systematically too low, while in other ranges the predictions are systematically too high. Systematic errors are symptomatic of a model surface that does not accurately represent the underlying functional relationship between the indicators and signal return.
A reasonable solution would be to propose a more complex model by visual inspection, such as a parabola. However, in the case of a linear model, a quadratic (parabolic) model, a cubic model where Xi is raised to the third power, or any model where the functional relationship is assumed prior to analysis, the model is constrained to adopt the assumed form. This is perfectly legitimate when there is well established theory that suggests what the correct functional form should be.
However, for many phenomena characterised by high complexity and high randomness, such as financial markets, there is no well-established theory to support the choice of a particular functional form. In these situations, constraining the analysis to an assumed functional form is too limiting. Instead an approach is needed that does not require the assumption of the shape of the model's surface - in other words, non-parametric modelling.
Adapting non-parametric models
One example of non-parametric modelling is kernel regression. Figure 4 shows the model surface that would be obtained by applying kernel regression to the sample data. Note that the discovered model surface conforms closely to the true relationship depicted in Figure 1. Kernel regression discovers the correct shape of the surface by estimating the value of the dependant variable Y (signal return) within small local ranges along the Xi axis. The simplest approach to kernel regression takes an average of the Y values in each local Xi neighbourhood. This becomes the altitude of the model surface in that region of Xi. A more sophisticated kernel method fits linear models to each small Xi neighbourhood.
Kernel regression is one of a
family of advanced modelling methods that do not impose the
constraint of a particular functional form. However, while the
flexibility of these techniques allows the model surface to
conform to the true underlying relationships in the data, if not
properly constrained, this flexibility can cause the model to
'over fit' the data. This results in the model's surface becoming
contaminated with random effects in the data, describing not only
the authentic relationship between signal returns and indicator
reading but also the random variation in the particular sample of
data used to develop the performance-boosting model (see Figure
Over fitting can be mitigated by using cross-validation, which involves breaking up the historical sample into several subsets: training, testing and evaluation. Starting with the simplest possible model, i.e. a single indicator and a flat model surface, the modelling procedure tests a progression of increasingly complex models utilising more indicators and surfaces that bend ever more closely to the data. At the same time, the modelling procedure is cross-validating between the training and testing sets to discover the degree of complexity that yields the most accurate predictions.
The training and testing subsets are used to discover which of the candidate indicators warrant a place in the model space and how much the model's surface should be allowed to bend in order to fit the data without the surface becoming over fitted. Over fitting is detected when an increase in model complexity improves fit on the training set, but degrades fit on the testing set. When the model of optimal complexity has finally been discovered, then and only then is it applied to the evaluation data. This third subset of data did not participate in the modelling process, so it provides an unbiased estimate of the model's true efficacy. An illustration of these concepts is shown in Figure 6.
For clarity, the foregoing illustrations were confined to a single predictor variable. In practice, the number of predictors will be larger. Because financial markets are complex non-linear systems, it is likely that the governing predictive relationships will also be non-linear. Nevertheless, linear regression should not be discarded out of hand as it can prove to be a valuable method when combined with non-linear methods.
Prediction model ensembles
There are numerous advanced modelling techniques available that each utilise a distinct mathematical paradigm and search procedure. Therefore, if various modelling methods are applied to the same set of signals the boosting models produced will differ and tend to make different uncorrelated prediction errors. This intuitively leads to the concept of ensembles of prediction models. Just as in investing, where it makes sense to combine securities whose returns are uncorrelated into a portfolio, in performance boosting it makes sense to combine the predictions of an ensemble of models whose prediction errors are uncorrelated. In other words, the degree of volatility of returns experienced by a portfolio of securities is analogous to the size of the prediction errors made by a combined forecast. When the predictions are combined the errors tend to negate one another, resulting in a combined prediction that has a smaller average error.
A performance boosting case study
The following is an example of applying the boosting process to a two-factor technical strategy that takes long positions in stocks. The strategy is based on two factors: a stock's recent price change normalised by the stock's volatility; and its recent change in trading volume. Called the C-K rule, it buys stocks that have recently fallen on declining trading.
Figure 7 shows the performance of the boosting model on data that was not used in the model's development (i.e. out-of-sample validation data). This is the acid test of a boosting model's efficacy. The boosting model was developed on an in-sample data set (1984-1999) and was then used to predict returns for signals in an out-of-sample data set (2000-2004). The out-of-sample signals were then grouped into ten groups or deciles based on the boosting model's predicted return. In Figure 7 note that signals that were predicted to do the best (decile 10) did indeed have the highest average returns (+1.76 per cent) and performed better than the average of all signals (+0.37 per cent). The signals predicted to have the worst returns (decile 1) earned an average return of -0.64 per cent. By taking smaller positions on signals predicted to do the worst or avoiding them entirely and taking larger than normal positions on signals predicted to do the best, the overall strategy returns produced by the C-K rule would have been increased relative to a strategy of taking the same position size on all signals.