Pretests for genetic-programming evolved trading programs: 'zero-intelligence' strategies and lottery trading - Part 1
First Published Tuesday, 10th June 2008 07:59 am from Automated Trader
In this paper, we discuss a series of pretests, based on several variants of random search, aiming at giving more clearcut answers as to whether a GP scheme, or any other machine-learning technique, can be effective with the training data at hand. Precisely, pretesting allows us to distinguish between a failure due to the market being efficient of due to GP being inefficient. The analysis is illustrated with GP-evolved strategies for three stock exchanges exhibiting different trends.
Abstract - Over the last decade, numerous papers have investigated the use of GP for creating financial trading strategies. Typically in the literature results are inconclusive but the investigators always suggest the possibility of further improvements, leaving the conclusion regarding the effectiveness of GP undecided.
To download this article in PDF format, please click here
1 Motivation and Introduction
The computational intelligence techniques such as genetic programming3, with their continuous advancement, persistently bring us something positive to expect, and incessantly push the application domain to more challenging issues. However, sometimes, the costs and benefits of using these advanced CI techniques are uncertain. Usually the benefits are not assured, while the costs are immediate.On the one hand, the CI techniques are frequently used as intensive search algorithms, which are inevitably computationally demanding, and take up a great amount of computational resources. On the other hand, whether there is a needle in the haystack remains dubious.4 Certainly, if such a needle does not exist at all, the all efforts are made to no avail. Given this asymmetry between costs and benefits, it would be economical, at the first stage, to test the existence of such a needle before a fully-fledged version of search is applied. We refer to this procedure as a pretest.
The pretest procedure proposed here is in a sense similar to the pretests used in econometrics where the estimator of an unknown parameter is chosen on the basis of the outcome of a pretest (). Pretesting, also known as "data-snooping" in finance, classically serves to select the right model that will be used later on for forecasting purpose ([2,3]). More broadly, pretesting can be considered to be a practice of a sequential decision-making process, which is used when the decision involves a great deal of uncertainty, and the costs of making a wrong decision are huge.5 In this case, at the first stage, we would like to expend some limited resources in probing into gaining some initial information, e.g. the distribution of a very uncertain environment, while in the later stages, we will make our decision based on the gauged distribution.
The reasoning behind prestesting is very intuitive, and  is the first to apply this idea to the financial application of genetic programming (GP).  proposed a measure known as the i statistic.The i statistic is a measure of predictability. Basically, using a simple (vanilla) version of GP, one can first gauge the predictability based on i. When i is low or close to zero, it indicates that there is nothing to forecast. So, the use of fully-fledged GP is not advised. The virtue of doing this is to distinguish two kinds of possibilities when we see a failure of an initial attempt based on simple GP. First, the series itself has nothing to forecast; second, GP has not been used appropriately.Understanding this distinction can result in big differences in our second stage of the decision. In the former case, we may simply give up any further search to avoid wasting resources. In the latter case, we should keep on exploring different deliberations of GP to search for potential gains before a final conclusion can be made.In either case, we have a clear-cut situation.However, when a pretest is absent, we become less conclusive: we are no longer sure whether the problem is due to the non-existence of the needle, or the improper use of GP.
Unfortunately, in most financial trading applications of GP, a pretest has been largely neglected.6 We think that this negligence may give rise to many inconclusive results. Typically, what happens is that the results from using GP are not very convincing, but the investigators always suggest directions for further improvement, leaving the actual conclusion regarding the effectiveness of GP undecided. Therefore, this study attempts to provide practical pretesting procedure aimed at reducing the number of cases where the conclusion is inconclusive.
Needless to say, there are various ways of implementing different types of pretesting. For example, the i statistic mentioned above can be used as a pretest, as  did, but that is mainly applied to forecasting time series. That a series is to a certain extent predictable does not necessarily imply that we can develop profitable trading strategies. For example, the predictability horizon might be too short, the fluctuation might not be volatile enough to cover the round-trip transaction costs or, simply, the right trading instrument might not be available (e.g., no short selling in a downward oriented market) or else they are some regulation and rules (e.g., the "uptick rule" makes intraday trading with short selling more difficult). Consequently, literature on forecasting with GP and literature on trading with GP usually are separated. Therefore, in this paper, we attempt to develop pretest procedures that are more suitable for trading purposes.
More precisely, we will propose several different styles of pretests, which when put together can help us decide whether there are hidden patterns to be discovered and whether GP is properly designed to do the job. The essential idea underlying all proposed pretests is to compare the performance of GP with random trading strategies or behavior. However, as we shall see in Section 2, just making trading strategies or trading behavior arbitrarily random is not sufficient to provide a fair and informative comparison. To do so, some constraints are expected, and the intriguing point is how to impose these constraints properly.
The rest of the paper is organized as follows. Section 2 provides a detail formulation of four pretests. The first three are concerned with the trading strategies, whereas the last one is concerned with the trading behavior. Normally, trading behavior comes from trading strategies, and they cannot be separated; however, when randomness is introduced, difference between the two may arise.In particular, in the vein of algorithmic complexity, random trading strategies can imply trading behavior actually using knowledge, while random trading behavior presumably excludes such a possibility. We, therefore, intentionally distinguish between the two by calling the former zero-intelligence strategies, and the latter lottery trading. Section 3 discuss how to use these proposed tests together to make a better judgement given the initial results we have. Section 4 illustrates the proposed pretests based on the real detail and the experimental designs detailed in the appendix. Section 5 gives the concluding remarks.
2 Pretests: description and rationale
In this section, we describe a series of 4 pretests and discuss their purpose and implementation. Of the 4 pretests, we highlight that 2 are of particular interest and, as shown in Section 3, enable us to gain complementary knowledge on the data under study and on the efficiency of the GP's implementation. In the following, we consider GP with a validation stage before the actual testing on the out-of-sample data. Validation means that the best rules induced on the training interval are further selected on the unseen data, the validation period, before being applied out-of-sample. The validation step is a device to fight overfitting7 that has been widely used in earlier GP work (see for instance [7,8]). Note that our pretest proposals remain valid for GP without the validation step except that pretest 2 replaces pretest 1, which requires validation.
2.1 GP versus equivalent intensity random search
The basic idea here is to compare the outcome of GP with an equivalent intensity random search. We say that two search algorithms are equivalent in terms of search intensity if their execution leads to the evaluation of the same number of different trading strategies on the training data. For instance, let us consider GP with the parameters chosen for this study: a population of 500 individuals evolved over 100 generations. In the first approximation, the equivalent random search (ERS) would consist of evaluating 50,000 randomly created solutions. In practice, search algorithms sometimes rediscover identical solutions over the course of their execution. This can be detected by keeping track of all created individuals since the beginning of the execution, and, in doing so, useless fitness evaluations can be skipped, which actually saves computing time when the fitness function is rather time-consuming as it is in our context. Since, computationally speaking, what is preponderant is the fitness evaluation, and since the extent to which GP re-discovers the same individuals is very dependent upon the implementation, we impose that our definition of equivalent search intensity only accounts for unique individuals, i.e. individuals which require evaluation. We consider two solutions to be different if their expression is syntactically different8, in our context, if the trees representing the programs are different.
The three following pretests compare GP with a random search both with and without training and validation stage. In the latter search technique, the biologically inspired evolution process of GP is simply replaced by the creation of solutions at random. Since with random search the strategies do not benefit from the "intelligence" resulting from the evolution or learning process, we dub randomly created solutions as zero-intelligence trading strategies.
For each pretest i, we formulate the null hypothesis fli,0 that GP does not outperform the technique it is compared with at pretest i, where the alternative hypothesis is denoted by fli,, .The experiments will provide us with the answer on whether fli,0 should be rejected in favour of fli,, or not.
Pretest 1: GP versus equal search intensity random search with training and validation stage.
The implementation of the random search strategy is straightforward: parameters of GP are set in such a way that only the initial generation, where individuals are created at random, is used. The size of the initial population is adjusted so that the resulting search intensity is identical to the one of the regular GP.
- Hypothesis fl,,0 cannot be rejected: the first explanation that can be envisaged is that, GP or not, there is nothing essential to be learned from the past. It that case GP would strongly "overfits" the training data, possibly explaining that its out-of-sample performance is worse than with a random search. This can be due by the market being efficient or because the training interval is very dissimilar to the out-of-sample9. Another explanation is that the GP machinery is not working properly, for instance due to a wrong choice of the function/terminal sets, because the parameters are inappropriate (e.g. too low search intensity), or the genetic operators are unable to create betterthan-random individuals.
- Hypothesis fl,,0 is rejected in favour of fl,,,: there may be something to learn from the past and GP, with the chosen parameters, may be effective in that task.
Rejecting fl,,0 is of course a first indication of the efficiency of GP but we cannot rule out the case where there would nothing useful to learn on the data at hand and GP would beat random search by mere luck. We will see in Section 3, that further investigation may provide additional evidence to answer that question.
Pretest 2: GP versus equal search intensity random search with training but without a validation stage.
Here, the best random solutions on the training interval are applied directly to the out-of-sample period. With regard to pretest 1, pretest 2 could give us some insight about how effective is validation as a device to fight against overfitting. However, since overfitting is unlikely to occur with random solutions, the rationale for using pretest 2 is unclear and it will not be further considered in this study. A more direct and effective way to evaluate the effect of the validation stage is simply to compare regular GP with and without validation,0.
Pretest 3: GP versus equal search intensity random search withouttraining and without validation stage.
In pretest 3, the selection of the strategies on the training set is removed:a large number random strategies are created and applied directly out-of-sample. The performance is evaluated as the average performance (e.g.average total return) over the set of random strategies. Comparing the outcome of pretest 3 with regard to pretest 1 and regular GP tells us something about how effective the selection process is, the extent to which a top performing rule on the training and validation sets will keep on performing well out-of-sample.If strategies selected by GP or random search on the training and validation intervals have some predictive ability out-of-sample, it provides use with some evidence that there is something to learn from the past.It is worth pointing out that the randomness of the strategies is here constrained by the GP language: rules can only be made with GP functions/terminals organized according to the typing scheme. For instance, it is possible that the GP language is not sufficiently expressive to define a rule consisting in buying and selling every other period11. In the remainder of this study, we will consider pretest 4, presented in Section 2.2, that is similar in spirit to prestest 3, but is more random in the sense that it does not possess the bias in randomness induced by the GP language.
2.2 GP versus lottery trading
We refer to lottery trading as a strategy that would consist in making the investment decision at random on the basis of the outcome of a random variable. In its simplest form, the random variable would follow a Bernoulli distribution where the parameter p expresses the probability to take a long position and 1-p the probability to be out of the market.
In our context, this requires refinement since we are interested in profitability and profitability takes into account transaction costs. So, to allow a fair comparison with GP, we should make sure that the expected number of transactions for lottery trading is the same as for GP. We refer to the expected number of transactions per unit of time as the frequency of a trading strategy.Another important characteristic of a trading strategy is what we term its intensity, i.e. the number of periods where a position12 "in the market" is held, over the length of the trading interval. We should also enforce lottery trading to have the same expected intensity as GP to avoid misleading results, for instance, the case where, given its frequency, the intensity of lottery trading is not sufficient to cover the transaction costs with the volatility of the market under study.
One denotes by Fce and Ice respectively the average frequency and average intensity observed for the set of GP evolved rules applied on the testing interval over all GP runs, Nce is the number of transactions leading to Fce.For the experiments made in the following, a sequence of investment decisions SLT resulting from lottery trading is generated at random according to the following procedure:
- the intensity for lottery trading, ILT , is uniformly chosen in [Ice (1 -
c), min(1, Ice (1 + c))] with 0 < c < 1. In a first step, SLT is made of the
'0' positions (i.e. out of the market) followed by the block of '1' positions
(i.e. in the market) corresponding to ILT,
- the number of transactions NLT is uniformally chosen in the set of integer values that are even13 in interval [Nce (1 -c), Nce (1 +c)]. The block of '1' is subdivided at random in NLT/2 sub-sequences and each sub-sequence is inserted at random inside the block of '0'. This design avoids the problem of overlapping among the '1' sub-sequences that may occur with other schemes.
We formulate the pretest comparing GP and lottery trading and denote by fl4,0 the null hypothesis that GP does not outperform lottery trading.
Pretest 4: GP versus lottery trading.
Obviously, if GP is not able to outperform lottery trading, it gives strong evidence that GP will not be good at evolving effective trading strategies with the data at hand.In Section 3, we shall discuss this point in more detail.
⋆ An extended version of this paper will
appear as chapter 8 of the book Computational
Intelligence in Economics and
Finance - Volume 2 , Springer Verlag, to be published in 2007.
3 Although, in this paper, we only focus on genetic programming, but the general ideas and some specific implementations may also be applicable to other computational intelligence techniques used to induce trading strategies.
4 For example, in the financial application domain, the lack of such a needle may be due to the efficient market hypothesis or the no-arbitrage condition.
5 The problem of sequential decision making under incomplete knowledge has been studied by researchers
in various fields, such as optimal control, psychology, economics, and game theory.
6 This may not be completely so. In fact, most earlier studies selected the buy-and-hold strategies or a risk-free investment (e.g., treasury bills) as the benchmark. However, the conclusion that "GP performs better than buy-and-hold in a bearish market and worse in a bullish market" is often found in the literature. However, nothing different can be expected since buy-and-hold is the worst possible strategy in a steadily decreasing market and the best possible strategy in a steadily increasing market. This shows the limits of choosing buy-and- hold as a benchmark. See, for example, .
7 The actual effectiveness of validation in this context is, however, still an open question, see  and .
8 Two individuals can be syntactically different while being equivalent in the sense that they lead to equivalent trading decisions, the equivalence could thus also be defined in terms of semantics. With symbolic simplification using rewriting rules and interval arithmetic on the function arguments, we could detect that some syntactically different individuals are in fact semantically identical. However, there is no way of making sure that all duplicates will be detected and the implementation of this procedure would be so complex and time consuming at run time that, in our opinion, a definition based on semantics would be of little practical interest. Alternatively, the equivalence in search intensity could be defined in terms of equivalent computing time. However there is such a difference of complexity between a fully-fledged GP implementation and random search that it is hard to imagine how we can ensure that the two implementations have been optimized in a similar manner, while a better implementation of GP for instance may lead us to come to an opposite conclusion.
9 In , numerous experiments have highlighted that when training and out-of-sample data sets are very "dissimilar", for instance if the market exhibits an opposite trend, then there is little chance that GP will perform well out-of-sample.
10 For instance, as it is done in .