The Gateway to Algorithmic and Automated Trading

Statistical Analysis

The following excerpt is from Chapter 4 of David Aronson's recently published book "Evidence-Based Technical Analysis". Together with Chapters 5 and 6 of the book (which will be available as excerpts later) it addresses aspects of statistics that are particularly relevant to evidence-based (as opposed to subjective) technical analysis.

Statistics is the science of data. In the late nineteenth century, renowned British scientist and author H.G. Wells (1866-1946) said that an intelligent citizen in a twentieth-century free society would need to understand statistical methods. It can be said that an intelligent twenty-first-century practitioner or consumer of TA has the same need.

Statistical methods are not needed when a body of data conveys a message loudly and clearly. If all people drinking from a certain well die of cholera but all those drinking from a different well remain healthy, there is no uncertainty about which well is infected and no need for statistical analysis. However, when the implications of data are uncertain, statistical analysis is the best, perhaps the only, way to draw reasonable conclusions.

Identifying which TA methods have genuine predictive power is highly uncertain. Even the most potent rules display highly variable performance from one data set to the next. Therefore, statistical analysis is the only practical way to distinguish methods that are useful from those that are not.

Whether or not its practitioners acknowledge it, the essence of TA is statistical inference. It attempts to discover generalizations from historical data in the form of patterns, rules, and so forth and then extrapolate them to the future. Extrapolation is inherently uncertain. Uncertainty is uncomfortable.

The discomfort can be dealt with in two ways. One way is to pretend it does not exist. The other is the way of statistics, which meets uncertainty head on by acknowledging it, quantifying it, and then making the best decision possible in the face of it. Bertrand Russell, the renowned British mathematician and philosopher said, "Uncertainty, in the presence of vivid hopes and fears, is painful, but must be endured if we wish to live without the support of comforting fairy tales."

Many people are distrustful or disdainful of statistical analysis and statisticians are often portrayed as nerdy number-crunching geeks divorced from reality. This shows up in jokes. We deride what we do not understand. There is the story about the six-foot-tall man who drowns in a pond with an average depth of only two feet. There's the tale about three statisticians who go duck hunting. They spot a bird flying overhead. The first shoots a foot too far to the left. The second shoots a foot too far to the right. The third jumps up and exclaims, "We got it!!" Even though the average error was zero, there was no duck for dinner.

Powerful tools can be put to bad purpose. Critics often charge that statistics are used to distort and deceive. Of course, similar ends can be achieved with words, although language is not held liable. A more rational stance is needed. Rather than viewing all claims based on statistics with suspicion or taking them all at face value, "a more mature response would be to learn enough about statistics to distinguish honest, useful conclusions from skullduggery or foolishness." "He who accepts statistics indiscriminately will often be duped unnecessarily. But, he who distrusts statistics indiscriminately will often be ignorant unnecessarily. The middle ground we seek between blind distrust and blind gullibility is an openminded skepticism. That takes an ability to interpret data skillfully."


Statistical reasoning is new terrain for many practitioners and consumers of TA. Trips to strange places are easier when you know what to expect. The following is a preview of the next three chapters

For reasons discussed in Chapter 3, it is wise to start with the assumption that all TA rules are without predictive power and that a profitable back test was due to luck. This assumption is called the null hypothesis. Luck, in this case, means a favorable but accidental correspondence between the rule's signals and subsequent market trends in the historical data sample in which the rule was tested. Although this hypothesis is a reasonable starting point, it is open to refutation with empirical evidence. In other words, if observations contradict predictions made by the null hypothesis, it is abandoned and the alternative hypothesis, that the rule has predictive power, would be adopted. In the context of rule testing, evidence that would refute the null hypothesis is a back-tested rate of return that is too high to be reasonably attributed to mere luck.

If a TA rule has no predictive power, its expected rate of return will be zero on detrended data. However, over any small sample of data, the profitability of a rule with no predictive power can deviate considerably from zero. These deviations are manifestations of chance-good or bad luck. This phenomenon can be seen in a coin-toss experiment. Over a small number of tosses, the proportion of heads can deviate considerably from 0.50, which is the expected proportion of heads in a very large number of tosses.

Generally, the chance deviations of a useless rule from a zero return are small. Sometimes, however, a useless rule will generate significant profits by sheer luck. These rare instances can fool us into believing a useless rule has predictive power. The best protection against being fooled is to understand the degree to which profits can result from luck. This is best accomplished with a mathematical function that specifies the deviations from zero profits that can occur by chance. That is what statistics can do for us. This function, called a probability density function, gives the probability of every possible positive or negative deviation from zero. In other words, it shows the degree to which chance can cause a useless rule to generate profits. Figure 4.1 shows a probability density function. The fact that the density curve is centered at a value of zero reflects the null hypothesis assertion that the rule has an expected return of zero.

probability density of chance performance - range of possible performance for a useless TA rule

In Figure 4.2, the arrow indicates the positive rate of return earned by a rule when it was back tested. This raises the question: Is the observed rate of return sufficiently positive to warrant a rejection of the null hypoth- esis that the rule's true rate of return is zero? If the observed performance falls well within the range of the deviations that are probably attributable to chance, the evidence is considered to be insufficient to reject the null hypothesis. In such a case, the null hypothesis has withstood the empirical challenge of the back-test evidence, and a conservative interpretation of the evidence would suggest that the rule has no predictive power.

The strength of the back-test evidence is quantified by the fractional area of the probability density function that lies at values equal to or greater than the rule's observed performance. This portion of the density function is depicted by the darkened area to the right of the vertical arrow in Figure 4.2. The size of this area can be interpreted as the probability that a rate of return this high or higher could have occurred by chance un- der the condition that the rule has no predictive power (expected return = 0, or the null hypothesis is true). When this area occupies a relatively large fraction of the density curve, it means that there is an equivalently large probability that the positive performance was due to chance. When this is the case, there is no justification for concluding that the null hypothesis is false. In other words, there is no justification for concluding that the rule does have predictive power.

However, if the observed performance is far above zero, the portion of the probability density function lying at even more extreme values is

probability of chance performance for a useless rule

small. Performance this positive would be inconsistent with the assertion that the rule has no predictive power. In other words, the evidence would be sufficient to refute the null hypothesis. Another way to think of this is as follows: If the null hypothesis were true, a level of performance this positive would have a low probability of occurrence. This probability is quantified by the proportion of the density function that lies at values equal to or greater than the observed performance. This is illustrated in Figure 4.3. Note that the observed performance lies in the extreme right tail of the density curve that would pertain if the rule were devoid of predictive power.

It is important to understand what this evidence does not tell us. It tells us nothing about the probability that either the null hypothesis or the alternative hypothesis is true. It only speaks to the probability that the evidence could have occurred under the assumption that the null hypothesis is, in fact, true. Thus, the probability speaks to the likelihood of the evidence, not the likelihood of the truth of the hypothesis. Observed evidence that would be highly improbable, under the condition that the null hypothesis is true, permits an inference that the null hypothesis is false.

Recall that, in Chapter 3, it was shown that evidence that a creature has four legs cannot conclusively establish the truth of the hypothesis: The creature is a dog. Although evidence of four legs would be consistent with the hypothesis that the creature is a dog, it is not sufficient to prove, deductively, that the creature is a dog. Similarly, while the observation of positive performance would be consistent with the hypothesis that a rule has predictive power, it is not sufficient to prove that it does. An argument that attempts to prove the truth of a hypothesis with observed evidence

probability of chance performance for a good rule

that is consistent with the hypothesis commits the logical fallacy of affirming the consequent.

If the creature is a dog, then it has four legs.

The creature has four legs.

Invalid Conclusion: Therefore, the creature is a dog.

If a rule has predictive power, then it will have a profitable back test.

Back test was profitable.

Invalid Conclusion: Therefore, rule has predictive power.

However, the absence of four legs is sufficient to prove that the hypothesis, the creature is a dog, is false. In other words, observed evidence can be used to conclusively prove that a hypothesis is false. Such an argument uses the valid deductive form, denial of the consequent. The general form of an argument, in which the consequent is denied, is as shown:

If P is true, then Q is true.

Q is not true.

Valid Conclusion: Therefore, P is not true (i.e., P is false).

If the creature is a dog, then it has four legs.

Creature does not have four legs.

Valid Conclusion: Therefore, it is false that the creature is a dog.

The argument just given uses the evidence, the absence of four legs, to conclusively falsify the notion that the creature is a dog. However, this level of certitude is not possible in matters of science and statistics. One can never conclusively falsify a hypothesis. Nevertheless, a similar logic can be used to show that certain evidence is highly unlikely if the hypothesis were true. In other words, the evidence gives us grounds to challenge the hypothesis. Thus, a highly profitable back test can be used to challenge the hypothesis that the rule has no predictive power (i.e., that it has an expected return of zero).

If a rule's expected return is equal to zero or less, then a back test should generate profits that are reasonably close to zero.

The back-tested performance was not reasonably close to zero; in fact, it was significantly above zero.

Valid Conclusion: Therefore, the contention that the rule's expected return is equal to zero or less is likely to be false.

How unlikely or rare must the positive performance be to reject the notion that the rule is devoid of predictive power? There is no hard and fast rule. By convention most scientists would not be willing to reject a hypothesis so unless the observed performance has a 0.05 probability or less of occurrence under the assumption the null is true. This value is called the statistical significance of the observation.

The discussion so far pertains to the case where only one rule is back tested. In practice, however, TA rule research is typically not restricted to testing a single rule. Economical computing power, versatile backtesting software, and plentiful historical data make it easy, almost inviting, to test many rules with the aim if selecting the one with the best performance. This practice is known as data mining.

Although data mining is an effective research method, testing many rules increases the chance of a lucky performance. Therefore, the threshold of performance needed to reject the null hypothesis must be set higher, perhaps much higher. This higher threshold compensates for the greater likelihood of stumbling upon a useless rule that got lucky in a back test. This topic, the data mining bias, is discussed in Chapter 6.

Figure 4.4 compares two probability density functions. The top one would be appropriate for evaluating the significance for a single rule back test. The lower density curve would be appropriate for evaluating the sig-

Great performance in a single rule back test is only mediocre when 1,000 rules are tested.

nificance of the best performing rule out of 1,000 back-tested rules. This density curve takes into account the increased likelihood of luck that results from data mining. Notice that if this best rule's observed performance were to be evaluated with the density curve appropriate for a single rule, it would appear significant because the performance is far out in the right tail of the distribution. However, when the best rule's performance is evaluated with the appropriate probability density function it does not appear statistically significant. That is to say, the rule's rather high performance would not warrant the conclusion that it has predictive power or an expected return that is greater than zero.


The tools and methods of a discipline limit what it can discover. Improvements in them pave the way to greater knowledge. Astronomy took a great leap forward with the invention of the telescope. Though crude by today's standards, the earliest instruments had 10 times the resolving power of the unaided eye. Technical analysis has a similar opportunity, but it must replace informal data analysis with rigorous statistical methods.

Informal data analysis is simply not up to the task of extracting valid knowledge from financial markets. The data blossoms with illusory patterns whereas valid patterns are veiled by noise and complexity. Rigorous statistical analysis is far better suited to this difficult task.

Statistical analysis is a set of well defined procedures for the collection, analysis, and interpretation of data. This chapter and the next two will introduce the way statistical tools and reasoning can be used to identify TA rules that work. This overview is necessarily condensed, and in many instances I have sacrificed mathematical rigor for the sake of clarity. However, these departures do not dilute the essential message: If TA is to deliver on its claims, it must be grounded in a scientific approach that uses formal statistical analysis.


Statistical reasoning is abstract and often runs against the grain of common sense. This is good and bad. Logic that runs counter to informal inference is good because it can help us where ordinary thinking lets us down. However, this is exactly what makes it difficult to understand. So we should start with a concrete example.

The central concept of statistical inference is extrapolating from samples. A sample of observations is studied, a pattern is discerned, and this pattern is expected to hold for (extrapolated to) cases outside the observed sample. For example, a rule observed to be profitable in a sample of history is projected to be profitable in the future.

Let's begin to think about the concept in the context of a problem that has nothing to do with technical analysis. It comes from an excellent book: Statistics, A New Approachby Wallis and Roberts. The problem concerns a box filled with a mixture of white and grey beads. The total number of beads and the numbers of grey and white beads are unknown. The task is to determine the fraction of beads that are grey in the entire box. For purposes of brevity, this value will be designated as F-G(fraction-grey in box).

To make this situation similar to statistical problems faced in the real world, there is a wrinkle. We are not allowed to view the entire contents of the box at one time, thus preventing a direct observation of F-G. This constraint makes the problem realistic because, in actual problems, observing all the items of interest, such as all beads in the box, is either impossible or impractical. In fact, it is this constraint that creates the need for statistical inference.

Although we are not allowed to examine the box's contents in its entirety, we are permitted to take samples of 20 beads at a time from the box and observe them. So our strategy for acquiring knowledge about F-G will be to observe the fraction of grey beads in a multitude of samples. In this example 50 samples will be taken. The lowercase f-g stands for the fraction of grey beads in a sample.

A sample is obtained as follows: The bottom of the box contains a sliding panel with 20 small depressions, sized so that a single bead is captured in each depression. The panel can be slid out of the box by pushing it with a similar panel that takes its place. This keeps the remaining beads from dropping out of the bottom. Consequently, each time the panel is removed, we obtain a sample of 20 beads and have the opportunity to observe the fraction grey (f-g) in that sample. This is illustrated in Figure 4.5.

After the fraction of grey beads (f-g) in a given sample has been deter- mined, the sample beads are placed back into the box and it is given a thorough shaking before taking another sample of 20. This gives each bead an equal chance of winding up in the next sample of 20. In the parlance of statistics, we are making sure that each sample is random. The entire process of taking a sample, noting the value f-g, putting the sample back in the box, and shaking the box is repeated 50 times. At the end of the whole procedure we end up with 50 different values for f-g, one value for each sample examined.

Determining f-g for each sample

By placing each sample back in the box before taking another sample, we are maintaining a stable concentration for the fraction of grey beads in the box. That is to say, the value for F-G is kept constant over the course of the 50 samplings. Problems in which the statistical characteristics remain stable over time are said to be stationary. If the beads were not replaced after each sample the value F-G would change over the course of the 50 samplings, as groups of beads were permanently removed from the box. A problem in which the statistical characteristics change over time is said to be nonstationary. Financial markets may indeed be nonstationary, but for pedagogical purposes, the box-of-beads problem is designed to be stationary.

It is important to keep in mind the distinction between F-G and f-g. F-G refers to the fraction of grey beads in the entire box. In the language of statistics, all the observations in which we are interested are called a population. In this example, the term population refers to the color of all the beads in the box. The term sample refers to a subset of the population. Thus, F-G refers to the population while f-g refers to the sample. Our assigned task is to gain as much knowledge as possible about the value of F-G by observing the value f-g over 50 separate samples. It is also important to keep clear the distinction between two numbers: the number of observations comprising a sample-in this case 20 beads-and the number of samples taken-in this case 50.


Probability is the mathematics of chance. A probability experimentis an observation on or a manipulation of our environment that has an uncertain outcome. This would include actions such as noting the precise time of arrival after a long trip, observing the number of inches of snow that falls in a storm or the face of a coin that appears after the tossing of a coin.

The quantity or quality observed in a probability experiment is called a random variable such as the face of a coin after a flip, the number of inches of snow that fell, or a value that summarizes a sample of observations (e.g., a sample average). This quantity or quality is said to be random because it is affected by chance. Whereas, an individual observation of a random variable is unpredictable, by definition, a large number of observations made on a random variable may have highly predictable features. For example, on a given coin toss, it is impossible to predict head or tails. However, given one thousand tosses, that the number of heads will lie within a specified range of 500 is highly predictable.

Depending on how it is defined, a random variable assumes at least two different values, though it can assume more perhaps even an infinite number. The random variable in a coin toss, the face visible after the toss, can assume two possible values (heads or tails). The random variable, defined as the temperature at noon taken at the base of the Statue of Liberty, can assume a very large number of values, with the number limited only by the precision of the thermometer.

To read the entire chapter please click the Open as PDF button at the top of the page. click here to return to the top of the page

To buy a copy of "Evidence-Based Technical Analysis" at a 37% discount, please click here.

Excerpted with permission of the publisher John Wiley & Sons, Inc. from "Evidence-Based Technical Analysis". Copyright (c) 2007 by David Aronson. This book is available at all bookstores, online booksellers and from the Wiley web site at, or call 1-800-225-5945.