In rule data mining, many rules are back tested and the rule with the best observed performance is selected. That is to say, data mining involves a performance competition that leads to a winning rule being picked. The problem is that the winning rule's observed performance that allowed it to be picked over all other rules systematically overstates how well the rule is likely to perform in the future. This systematic error is the data-mining bias.
To buy a copy of "Evidence-Based Technical Analysis" at a 37% discount, please click here.
Despite this problem, data mining is a useful research approach. It can be proven mathematically that, out of all the rules tested, the rule with the highest observed performance is the rule most likely to do the best in the future, provided a sufficient number of observations are used to compute performance statistics.1 In other words, it pays to data mine even though the best rule's observed performance is positively biased. This chapter explains why the bias occurs, why it must be taken into account when making inferences about the future performance of the best rule, and how such inferences can be made.
I begin by introducing this somewhat abstract topic with several anecdotes, only one of which is related to rule data mining. They are appetizers intended to make later material more digestible. Readers who want to start on the main course immediately may choose to skip to the section titled "Data Mining."
The following definitions will be used throughout this chapter and are placed here for the convenience of the reader.
- Expected performance: the expected return of a rule in the immediate practical future. This can also be called the true performance of the rule, which is attributable to its legitimate predictive power.
- Observed performance: the rate of return earned by a rule in a back test.
- Data-mining bias: the expected difference between the observed performance of the best rule and its expected performance. Expected difference refers to a long-run average difference that would be obtained by numerous experiments that measure the difference between the observed return of the best rule and the expected return of the best rule.
- Data mining: the process of looking for patterns, models, predictive rules, and so forth in large statistical databases.
- Best rule: the rule with the best observed performance when many rules are back tested and their performances are compared.
- In-sample data: the data used for data mining (i.e., rule back testing).
- Out-of-sample data: data not used in the data mining or back-testing process.
- Rule universe: the full set of rules back tested in a data mining venture.
- Universe size: the number of rules comprising the rule universe.
FALLING INTO THE PIT: TALES OF THE DATA-MINING BIAS
This following account is apocryphal. Some years ago, before I took up the study of statistics, I was approached by an impresario seeking backers for a show business venture that he claimed would be enormously profitable. The show was to feature a monkey that could write Shakespearian prose by dancing on the keyboard of a word processor.
At each show, the literate monkey, whom the promoter had named the Bard, would be placed at a keyboard and a large screen would display to an audience what the monkey wrote, as it was being written. Surely, people would rush to see the Bard, and my share of ticket sales would yield a handsome return on the required investment of $50,000. At least, that was the promoter's claim. "It can't miss!" was his refrain.
I was intrigued, but wanted some proof that the monkey could actually produce Shakespearian prose. Any investor would. I was given proof . . . of a sort. It was what accountants call a cold-comfort letter. The letter said, "We have examined the Bard's previous works and he has in fact written the words 'To be or not to be, that is the question.' We are, however, unfamiliar with the circumstances under which these words were written."
What I really wanted was a live demonstration. Regrettably, my request could not be accommodated. The impatient promoter explained the monkey was temperamental and besides, there were many other anxious investors clamoring to buy the limited number of shares being offered. So I seized the opportunity and plunked down $50,000. I was confident it was just a matter of time before the profits would start flowing in.
The night of the first show arrived. Carnegie Hall was packed to capacity with a crowd that anxiously awaited the first words. With everyone's eyes glued to the big screen, the Bard's first line of text appeared.
lkas1dlk5jf wo44iuldjs sk0ek 123pwkdzsdidip'adipjasdopiksd
Things went downhill quickly from there. The audience began screaming for refunds, I threw up, and the Bard defecated on the keyboard before scampering off the stage. My investment went up in a cloud of smoke.
What happened? Thinking it unimportant, the promoter failed to disclose a key fact. The Bard had been chosen from 1,125,385 other monkeys, all of whom had been given the opportunity to dance on a keyboard every day for the past 11 years, 4 months, and 5 days. A computer monitored all their gibberish to flag any sequence of letters that matched anything ever written by Shakespeare. The Bard was the first monkey to ever do so.
Even in my state of statistical illiteracy, I doubt I would have invested had I known this. Mere common sense would have told me that chance alone favored the occurrence of some Shakespearian quote in such a large mass of nonsense. The freedom to data mine the trillions of letters generated by an army of monkeys raised the probability of a lucky sequence of letters to a virtual certainty. The Bard was not literate, he was just lucky.
The fault, Dear Brutus, lay with the sincere but statistically naïve promoter. He was deluded by the data-mining bias and attributed too much significance to a result obtained by data mining. Despite my loss, I've tried not to judge the promoter too harshly. He sincerely believed that he had found a truly remarkable monkey. He was simply misled by intuition, a faculty inadequate for evaluating matters statistical and probabilistic.
By the way, the promoter has kept the Bard as a pet and still allows him to dance on that once-magical keyboard in hopes of new evidence of literacy. In the meanwhile, to keep body and soul together, he is now selling technical trading systems developed along similar lines. He has a throng of dancing monkeys developing rules, some of which seem to work quite well, in the historical data.
Proving the Existence of God with Baseball Stats
Collectors of sports statistics have also been seduced by the data-mining bias. For example, there is Norman Bloom, who concluded that interesting and unusual patterns found in baseball statistics prove the existence of God. After thousands of searches through his database, the dedicated data miner found patterns he believed to be so amazing they could only be explained by a universe made orderly by God.
One of Bloom's patterns was as follows: George Brett, the third baseman for Kansas City, hit his third home run in the third game of the playoffs, to tie the score 3-3. Bloom reasoned that, for the number three to be connected in so many ways, compelled the conclusion it was the handiwork of God. Another interesting pattern discovered by Bloom had to do with the stock market: The Dow Jones Industrial Average crossed the 1,000 level 13 times in 1976, miraculously similar to the fact that there were 13 original colonies that united in 1776 to form the United States.
As pointed out by Ronald Kahn,2 Bloom committed several errors on the way to his unjustified conclusions. First, he did not understand the role of randomness and that seemingly rare coincidences are in fact quite probable if one searches enough. Bloom found his mystical patterns by evaluating thousands of possible attribute combinations. Second, Bloom did not specify what constituted an important pattern before he began his searches. Instead, he took the liberty of using an arbitrary criterion of importance defined after the fact. Whatever struck his fancy as interesting and unusual was deemed to be important. Kahn points out that one is guaranteed to discover "interesting" patterns when they are searched for in such an undisciplined manner.
Discovering Hidden Predictions in the Old Testament
Even Bible scholars have fallen into the data mining pit. In this instance, the well intentioned but statistically unsophisticated researchers found predictions of major world events encoded in the text of the Old Testament. Knowledge of the future would certainly imply that the words had been inspired by an omniscient Creator. However, there was one small problem with these predictions, known as Bible Codes. They were always discovered after the predicted event had taken place. In other words, the codes predict with 20/20 hindsight.3
The Bible Codes are clusters of words imbedded in the text that are discovered by linking together letters separated by a specific number of intervening spaces or other letters. These constructed words are referred to as equal letter sequences or ELS. Code researchers grant themselves the freedom to try any spacing interval they wish and allow the words comprising the cluster to be arranged in virtually any configuration so long as the cluster occurs in what the researcher deems4 to be a compact region of the original text. What constitutes a compact region and what words constitute a prediction are always defined after a code has been discovered. Note the use of an evaluation criterion defined after the fact. This is not scientifically kosher.
The Bible Code scholars contend that the occurrence of a code is so statistically unlikely that it can only be explained by its having been put there by God. Their fundamental error-the error made by all naive data miners-is the failure to understand that given enough searching (data mining), the occurrence of such patterns is actually highly probable. Thus, it is likely that researchers will find codes that correspond to names, places, and events of historical importance. For example, the word 1990 in the same region of text as Saddam Hussein and war are not rare events requiring a metaphysical explanation. However, when found in 1992, after the first Iraq war had taken place, the word pattern seemingly predicted the 1990 Iraq war. Bear in mind that the words Bush, Iraq, invasion, and desert storm would serve just as nicely as a code that also appears to predict the 1990 Iraq war. Indeed, there are a huge number of word combinations that would correspond to the 1990 war, after the particulars of that historical event are known.
In his 1997 book, The Bible Code, author Michael Drosnin, a journalist with no formal training in statistics, describes the research of Dr. Eliyahu Rips. Dr. Rips is an expert in the mathematics of group theory, a branch of mathematics that is not particularly relevant to the problem of datamining bias. Though Drosnin claims that the Bible Codes have been endorsed by a roster of famous mathematicians, 45 statisticians who reviewed Rips's work found it to be totally unconvincing.5 Data-mining bias is, at its heart, a problem of faulty statistical inference.
Statisticians take a dim view of the unconstrained searching practiced by Bible Code researchers. It commits a mathematical sin called the excessive burning of degrees of freedom. To the statistical sophisticate, the stench produced by this incineration is most foul. As pointed out by Dr. Barry Simon, "A Skeptical Look at the Torah Codes,"6 in Chumash, just one of the 14 books comprising the Torah, approximately 3 billion possible words can be produced from the existing text when the ELS spacing interval is allowed to vary from 1 to 5,000. Searching this set of manufactured words for interesting configurations is no different than searching through tons of gibberish written by an army of monkeys dancing on keyboards.
The foolishness of the Bible Code scholars' search algorithms becomes apparent when they are applied to non-Biblical texts of similar length, such as the Sears catalogue, Moby Dick, Tolstoy's War and Peace, or the Chicago telephone directory. When these texts are searched, coded after the fact predictions of historical events are also found. This suggests the codes are a by-product of the search method and not of the text being searched.
In a more recent book by Drosnin, The Bible Code II: The Countdown, he shows how the codes "predicted" the terrible events of 9/11/01. Why, you ask, didn't he warn us before the events happened? He did not because he could not. He discovered the predictions after the tragedy occurred. Drosnin is another example of a well-intentioned but naïve researcher fooled by the data-mining bias.
Data-Mining Newspaper Reporters
Newspaper reporters have also been duped by the data-mining bias. In the mid 1980s, they reported the story of Evelyn Adams, who had won the New Jersey state lottery twice in four months.7 Newspaper accounts put the probability of such an occurrence at 1 in 17 trillion. In fact, the probability of finding a double winner was far higher and the story far less newsworthy than the reporters had thought.
The before-the-fact (a priori) probability that Ms. Adams or any other individual will win the lottery winner twice is indeed 17 trillion to one. However, the after-the-fact probability of finding someone who has already won twice by searching the entire universe of all lottery players is far higher. Harvard statisticians Percy Diaconis and Frederick Mosteller estimated the probability to be about 1 in 30.
The qualifier after-the-fact is the key. It refers to a perusal of data after outcomes are known. Just as the probability that any individual monkey will in the future produce a Shakespearian quote is extremely small, the probability that there exists some monkey, among millions of monkeys, that has already produced some literate prose, is substantially higher. Given enough opportunity, randomness produces some extraordinary outcomes. The seemingly rare is actually quite likely.
Mining the UN Database for Gold and Finding Butter
David J. Leinweber, on the faculty of California Institute of Technology and formerly a managing partner at First Quandrant, a quantitative pension management company, has warned financial market researchers about the data-mining bias. To illustrate the pitfalls of excessive searching, he tested several hundred economic time series in a UN database to find the one with the highest predictive correlation to the S&P 500. It turned out to be the level of butter production in Bangladesh, with a correlation of about 0.70, an unusually high correlation in the domain of economic forecasting.
Intuition alone would tell us a high correlation between Bangladesh butter and the S&P 500 is specious, but now imagine if the time series with the highest correlation had a plausible connection to the S&P 500. Intuition would not warn us. As Leinweber points out, when the total number of time series examined is taken into account, the correlation between Bangladesh butter and the S&P 500 Index is not statistically significant.
The bottom line: whether one searches sport statistics, the Bible, the random writings of monkeys, the universe of lottery players, or financial market history, data mining can lead to false conclusions if the data mining bias is not taken into account.