The Gateway to Algorithmic and Automated Trading

Fiddling with figures

Published in Automated Trader Magazine Issue 35 Winter 2015

When an academic paper equated backtest overfitting with financial charlatanism, there was little surprise from the trading community. Investment managers want to prove profitable trading strategies, but do they ignore the evidence? Priyanjana Bengani reports on the discord as well as industry leaders who are playing the right tune.

Jim O'Shaughnessy, CEO, O'Shaughnessy Asset Management

A group of researchers have taken issue with the misuse of quantitative techniques, claiming there is a lack of stringent mathematical diligence in testing investment strategies. The paper, "Pseudo- mathematics and financial charlatanism", goes so far as to equate the effects of backtest overfitting on out of sample performance to fraud, though with the caveat that such actions could be inadvertent.

Irrespective of its economic and financial viability, a strategy in its first stages is unproven. To help determine real-world performance, it undergoes backtesting: multiple simulations of the strategy are run against historical financial to derive a host of performance metrics. In the absence of scrupulous due diligence, the metrics may not indicate an optimal strategy at all, rather are biased by statistical flukes or false positives.

When a strategy or its parameters are tweaked multiple times against the same initial data set, or the in sample set, the problem of overfitting, or confusing noise with signal, arises. David Bailey, a research fellow at the Department of Computer Science in the University of California and one of the four authors of the paper, explained that "when a statistical test is applied multiple times, the probability of obtaining false positives increases as a function of the number of trials".


Similarly, model complexity - when multiple parameters are added to the strategy in order to achieve the most favourable result - is also a concern. That's because by using enough input variables, any result can be obtained; a fact described by renowned mathematician John von Neumann, who said: "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk."

Even if the basis of the strategy is false, there exists a configuration for that strategy which yields an outstanding performance with in sample data. Any model that is specifically tailored for such a data set is unlikely to perform similarly against different out of sample data or in live markets.

And so, many strategies fail on go live.

It is imperative, once a model and its parameters are extrapolated from the in sample set, to run through an out of sample test to ensure that the observations pertaining to the model can be generalised. But even out of sample testing does not prevent overfitting. The reason is that one can repeat those tests as many times as needed to obtain a desirable outcome.

The key is to control for the number of trials involved in a particular finding to determine whether the result is a fluke.

Marcos Lopez de Prado, a research affiliate at Lawrence Berkeley National Laboratory, also author of the paper, adds another dimension to the problem of obtaining trustworthy results.

Marcos Lopez de Prado, Lawrence Berkeley National Laboratory

"A backtested strategy is doubly questionable if the time period covered by the backtest is not truly representative of what can be expected in the future," he said. "For a given confidence level, the minimum track record is smaller when many independent bets take place per year, and returns exhibit positive skewness and negative excess kurtosis."

In other words, the test results should indicate that the returns surpass the average while ensuring that there are no results too far out from the average.

The paper's authors also suggest frameworks and tools to help address these issues in addition to encouraging discussions on how to decrease the prevalence of widespread fraudulent test results.

University of California's David Bailey pointed out that perhaps investment funds, and even academics publishing research papers, should disclose the extent and nature of testing, and the number of strategy variations explored.

"Something along this line may be workable, although at first glance it seems hard to design and enforce any specific procedure that would truly be effective in curbing abuse," he said. "In any event, we are not optimistic that there is an easy way to completely prevent these problems."

His advice, along with the other authors, to an investment firm or researcher is to learn as much as possible about potential problems and then "let your competitor make the bigger mistake".


The problem with the academic approach is that it's a chimera, said Jim O'Shaughnessy CEO of O'Shaughnessy Asset Management and author of "What works on Wall Street". "If you don't account for float, liquidity and market capitalisation, it's not real," he said.

The firm manages some $7.3 billion across a broad range of equity portfolios, with all of its strategies historically tested.

"The rule of thumb is that each strategy must make intuitive economic sense and have no exotic correlations," he said. "You can't simply take the best performant stocks. Start with the entire universe and run the selection strategy on it," he said.

After strategy parameters are defined, they are bootstrapped, where each parameter is run against random sub- periods of historical data independent of the others. None of the individual runs should be vastly different to the overall run and directionality should always be consistent. Otherwise, an investigation is needed to check that selection bias hasn't crept in.

David Bailey, Department of Computer Science, University of California

"It is also important to determine how liquid the names are. If illiquid names make up the majority, it is a red flag," said O'Shaughnessy. "You also have to go through the results and see if there's over-concentration on any one of the ten economic sectors, with the caveat that certain strategies are designed to work on the more attractive sectors."

Once the strategy is backtested successfully, it runs in a "seed account", for up to three years. During this time-frame, the strategy is closely watched to ensure that real-time performance is not diverging from the behaviour observed during backtesting. Only then is the strategy promoted to production and made available to clients.

O'Shaughnessy highlights the fact that backtesting numbers are not made available to the client. Instead, live performance numbers and a metric derived by an independent third party - the gIPS compliant track record - are used to solicit funds. This ensures that clients are not misled by overfit backtesting numbers.

O'Shaughnessy stressed the importance of testing over lengthy periods and ensuring that the dataset used is reliable and comprehensive. Survivorship bias - ignoring stocks that are not in the historical data sets due to valuation, corporate action or bankruptcy - can skew results.

As can recency bias, which only looks at products with the best price momentum in the recent past.


A paper published by Vanguard in 2012 explored the consequences of recency bias when it came to launching ETFs based on newly created, highly customised indices. Out of 370 indices with at least six months of backfilled data and six months of live data between 2000 and 2011, 87% outperformed the broad US stock market on an annualised basis, whereas only 51% outperformed the market after the index launch.

This would seem to indicate that sponsors are inclined to create new products based on recent performance. In the paper, Vanguard puts it this way: "The adage that past performance may not be an indicator of future results is especially true when the past performance is hypothetical."

Joel Dickson, global head of investment research and development at Vanguard, explained that some investment managers go down this route, despite the uncertainty of long-term benefits, in part because ETFs are most successful in the first two years.

"The performance is unknown but investors may validate the strategy with backfilled data to determine its viability. With backtested data, the average cash flow in the first six months of the ETF inception can be almost twice as much as the average cash flow generated withoutbbacktested data," said Dickson. "This can help jumpstart a new fund."

Vanguard ETFs track vanilla indices that target broad exposure, eliminating the issues that emerge from using highly specialised indices, he added.

When new indices are adopted, they undergo a standard verification process. Relying on index providers for the methodology and compositions, Vanguard's quantitative group thoroughly backtest new models by looking at the historical data, underlying financial models and economies of scale.

Among other due diligence checks, the firm establishes the similarities and differences between indices through time, compares index methodologies, conducts transaction cost analyses, and ensures that material differences across the benchmarks are accounted for. This is used as the basis to determine the index a given ETF should track.

Dickson cautioned investors influenced by attractive hypothetical past performance history of the narrow segment indices. "As new indices are constructed, data is available on the provider's website or through third- party data providers," he said. So, even though ETF providers cannot market the backfilled performance for regulatory reasons, investors can often access the performance metrics for the underlying indices easily and consequently make ill-informed decisions.

"The onus in on the investors. Buyers beware," Dickson added.


Rodney Ngone, chief investment officer, SAGAT Capital

Rodney Ngone, chief investment officer at SAGAT Capital, agreed that clients have to do their homework and look at live track records, daily returns, as well as understand operational due diligence to ascertain a manager's future performance. The NFA (National Futures Association) does not allow backtesting numbers to be marketed to clients if a manager has a sufficient live trading track record. He added that most investors want a proven track record and not a simulation number.

Currently, SAGAT Capital trades short-term systematic futures across commodities, equity indices and foreign exchange in ten different US markets, using automated momentum-based algorithms - one per market - taking advantages of patterns distinct to each market. They take a top-down approach looking to emulate the intuitiveness possessed by a macro trader. Ngone described this as evolutionary learning where models are constructed by hand.

"The future isn't people trading," he said. "Automation and artificial intelligence are the way forward. (Systematising) processes is the best way to avoid letting emotions in the market get the better of you," he said.

This is hardly straightforward.

"When you train a model, it can be highly optimistic. Noise can potentially be considered a legitimate pattern," said Ngone. To ensure that each model is free of noise and unbiased a robust technique known as the "walk forward" process is used to test against three different data sets: trading set (in sample), validation set (in sample) and test set (out of sample). The model runs through the data sets in chronological order, with the optimisation windows tuned per market.


Backtesting execution algorithms has its own set of challenges. For new strategies, it is necessary to run basic tests, check risk controls, validate transitions and run against the simulator, said Takis Christias, Citi's EMEA head of Equities Algorithmic Products.

"Every strategy has a different benchmark," he said. "However, the big weakness is market participants might behave differently day-to-day."

Christias' concern is that there is no real way to predict how the market will react, and backtesting falls short in this respect.

One might remember that the great physicist Isaac Newton lost £20,000 during the South Sea bubble, stating: "I can calculate the movement of the stars, but not the madness of men."

Christias recommends the use of a market simulator rich enough in simulating a wide range of market scenarios at both the macro level (mean reversion or trending of market prices) and the micro level (sizes at different levels of the book, spreads changes, trading frequency changes, quote updates, among others).

"There is a need for a sophisticated market simulator that incorporates the trading costs into the evolution of prices and order book dynamics," he said, adding that running a strategy multiple times against such a simulator and observing the results would be more beneficial.

The backtesting protocol is relatively straightforward.

"Come up with a model to approximate market variables, go through the calibration phase to find the optimal parameters, which yields the statistical approximation of the model," Christias said. "You don't need years of market data for testing. Old market data is not representative of the current market."

Presently, his team uses five to seven years of market data to backtest strategies and run separate simulations against data from days where major price moves occurred.

New strategies are first tested internally prior to being rolled out to clients, as Citi's internal trading desks are early adopters of new developments. This approach also ensures that new functionality has been tested thoroughly, and any necessary modifications are implemented well before it is offered for use by clients.

"Clients are given performance statistics based on real trading, and not simulation results to evaluate," he pointed out.


Financial charlatanism is strong language and an eye-catching title for an academic paper, but the problem is hardly new nor restricted to the financial industry. Decade- old concerns raised over medical research resulted in a major overhaul across the pharmaceuticals industry, pushing companies to become more accountable and transparent.

He also references comments by Campbell Harvey, professor at Duke University's Fuqua School of Business, who said that most of the empirical research in finance, whether published in academic journals or put into production as an active trading strategy by an investment manager, is likely false. Marcos Lopez de Prado, of Lawrence Berkeley National Laboratory, considers overfitting, whether intentional or not, to be a form of false advertising, similar in spirit to the SEC's recent finding that certain hedge funds are guilty of "cherry picking" historical results in their promotional material.

The good news is that academics, regulators and practitioners are aware of the issues surrounding backtesting and particularly of the consequences of misleading clients.

Jim O'Shaughnessy pointed out that the asset management industry is held to a fiduciary standard by regulators, meaning that interests of clients trump all. "It is the height of extreme foolishness to mislead investors," he added.