The Gateway to Algorithmic and Automated Trading

Cointegration: Assume Nothing, Check Everything

Published in Automated Trader Magazine Issue 28 Q1 2013

This issue, in a departure from our usual review format, we take a look at something that the Wrecking Crew frequently encounters when reviewing stat arb software - cointegration testing. Used appropriately, cointegration tests can add value, but misapply them and the only reversion you'll see is your trading equity in the direction of zero. Crew members Aly Kassam and Michael Weidman of Quantitative Support Services explain how really understanding what you are doing and appreciating the limitations of cointegration tests is critical to their successful deployment.

Familiarity breeds contempt, and the extensive dissemination of information on cointegration tests is a case in point: their ready availability in both econometric and trading software has led to their frequent misapplication. It's easy to assume that since pairs trading has grown rapidly in popularity in the retail space that this is where most of this misapplication takes place. Some of it almost certainly does, but many professional trading operations have also tended to make easy assumptions about the tests, without really taking the trouble to understand what is going on under the hood. This understanding doesn't need to be at a microscopic level, but should at least extend to knowing exactly which flavour of the test is appropriate for a particular purpose and understanding what any results really mean in practical terms.

Built for something else

The first thing to say is that the cointegration tests commonly used in trading today were never originally intended for that purpose. Instead, they have their origins in the world of economics where the concept of cointegration was created to model long run relationships between economic factors. In many cases there was a credible theoretical underpinning for these relationships, which is very seldom the case in the way they are applied in most trading scenarios.

Even in the world of economics there has been considerable debate among economists as to the efficacy of cointegration tests, even for modelling econometric time series. One issue is that tests for stationarity and cointegration are not particularly 'strong', meaning that it is not too difficult to fool them. There is also no shortage of academic papers with titles including phrases such as 'fractional integration' or 'near integration'. Even in the field of economics, it has been pointed out that the tests are not particularly powerful, and that so-called 'near integrated' series are often mistakenly identified as fully integrated when this is not the case, thereby leading to unreliable results1.

Whose Numbers
Do You Trust?

When using the Engle-Granger test for cointegration the standard Dickey-Fuller tables are not applicable for gauging the stationarity of the residual spread because it is estimated, not observed. If you are confident that your software provider is not using the default values, how confident are you that the values they are using are valid?

The tables do not need to be specific to the time series being evaluated, but they do need to be sensitive to things such as the number of data points in the time series and the number of coefficients in the cointegrating equation. There is no closed form solution2 available for them, so they have to be generated by Monte Carlo simulation. An alternative for those with the hardware and inclination is therefore to crunch them for themselves.

Caution is the watchword

If cointegration tests can be problematic in the area for which they were originally intended it goes without saying, hopefully, that caution is essential when applying them in another area, such as trading. It is essential to avoid the classic data mining trap of assembling a pile of data, cranking the metaphorical handle and believing what comes out of the other end.

Cointegration tests are just statistical hypothesis tests, so even in optimum conditions they are wrong by design five or one percent (depending on which confidence interval you choose) of the time in one direction (type I error) and potentially wrong a much higher percentage of the time in the opposite direction (type II error).

There is also a tendency to forget what terms such as 95% confidence mean, namely that the test is unreliable one time out of 20 so that by virtue of the way the test is designed it will give the wrong answer. In a pairs trading context, this is how you can easily end up with a pair that has passed a cointegration test but that still has a spread that is trending all over the place and is obviously not remotely stationary.

One might naturally respond to this by upping the confidence test threshold to 99%, so false identifications of cointegration will only happen once in 100 times, but unfortunately there is no free lunch on offer here. While this would reduce the risk of type I errors (false positives) it would also make type II errors (false negatives) more common. In other words, in order to reduce the number of pairs wrongly identified as cointegrated you would be increasing the number of pairs wrongly identified as not cointegrated and thereby passing up potentially profitable trading opportunities.

If one reconsiders the previous statement in the context of two time series run through one of the oldest and still very commonly used cointegration tests - Engle-Granger (see below) - the following emerges. Engle-Granger has the null hypothesis that the time series are not cointegrated and the alternative hypothesis that they are. Rather like the proverbial ball of mud, squeezing harder in one place to minimise type I errors allows the mud (profit) to escape elsewhere as type II errors. The essential trade off here is between significance (the chance of making a type I error) and power (the chance of making a type II error).

This may all seem relatively manageable when one thinks in terms of just 100 pairs, but what happens if you blindly extrapolate a pairs strategy across the entire EURO STOXX 500 index? Even at 99% confidence there will be approaching 2500 pairs falsely identified as cointegrated, to say nothing of the cointegrated pairs wrongly excluded.

Aly Kassam

Aly Kassam

Popular cointegration tests I: Engle-Granger

There are two major cointegration testing frameworks commonly used in trading today - Engle-Granger and Johansen.Engle-Granger is almost certainly the more commonly used (especially among pairs traders). The Engle-Granger is itself a two-step process, but it also requires an additional preliminary step for the results to be valid. For the property of cointegration to hold, none of the inputs (typically no more than two time series in the case of Engle-Granger) to the test should themselves be individually stationary. Using an already stationary time series as an input will generate misleading results, such as incorrectly showing that a trading opportunity exists. (If one of the inputs is already stationary it could be traded individually anyway on a mean reverting basis, though this would of course lose the benefit of market risk hedging implicit in pairs trading.) The preliminary step, therefore, is to test them to ensure that they do not exhibit this property, using something like an augmented Dickey-Fuller test.

Assuming this gives satisfactory results, the first step of Engle-Granger is to regress one time series onto the other and calculate the residuals (the spread). The second step is to perform another stationarity test (typically an augmented Dickey-Fuller test, which Engle and Granger themselves originally used) on the residuals.If the residuals are themselves stationary, then the time series are considered to be cointegrated. You then hope that the estimated coefficients for the spread are sufficiently accurate for profitable trading. Although Engle-Granger is straightforward to understand, it does have a number of significant limitations:

Michael Weidman

Michael Weidman

• Preliminary stationarity test: as mentioned above, for the test results to have any meaning, none of the inputs can themselves already be stationary, which requires a separate preliminary stationarity test. However, using such a test only increases the likelihood of introducing additional type I and II errors, which can have a cumulative effect.

• Y/X designation: during the first step of the Engle-Granger test, one of the series must be designated the 'response' (or, Y) series and the others designated the 'predictor' (or, X) series.This designation is usually arbitrary, and switching the roles of predictors and responses will at the very least lead to different estimates of the cointegrating parameters and will often additionally lead to contradictory test results. (Though there are techniques available for determining the most appropriate designation of Y and X, this of course involves an additional separate test, plus the potential risk of introducing further type I and II errors.)

• Wrong critical values: an alarmingly common problem is the tendency of many implementations of the Engle-Granger test in software packages to use the wrong set of critical values when determining whether or not cointegration is present, with obvious consequences for accuracy and profitability. The vital point to understand is that the augmented Dickey-Fuller test commonly used in the second stage of Engle-Granger is being applied to estimated residuals (which by definition contain errors) and not on directly observed time series. As a result, it is incorrect to use the default Dickey-Fuller tables, which are intended for use on time series that can be directly observed. Instead, a different set of tables of critical values intended for estimated time series should be used. But this begs a further question of who is providing these values and how reliable they are (see sidebar 'Whose Numbers Do You Trust?'). Bottom line: be certain your software is using the appropriate critical values and be certain that they come from a credible source.

Stationarity and Integration


A time series is deemed stationary if the statistical distribution of its data points does not depend upon time. In other words shifting them with a lag - such as shifting y1, y2,....yn by lag j to become y1+j, y2+j,....yn+j - does not change the distribution.

A weaker definition often used is that a time series is stationary if there is no systematic change in either its mean or variance. This is a weaker definition because very different process dynamics could still generate a constant mean but a very different distribution.


The order of integration of a time series denotes the number of times that it needs to be differenced (adjacent values subtracted from each other, e.g. yt - yt-1, yt-1 - yt-2, yt-2 - yt-3....) in order for a stationary series to result. So a time series that is integrated of order 0 is already stationary, while one that is integrated of order 1 (like many financial time series) needs to be differenced once to become stationary.


While Engle-Granger is still a popular cointegration test, and if used appropriately is still worthwhile for activities such as pairs trading, it isn't really ideal where more than two time series are involved. Therefore, for activities such as basket trading there is a general preference for the cointegration test framework developed by Johansen, which in addition to handling multiple time series as a matter of course, has a number of other important advantages: Y/X designation: There is no need to designate X 'predictor' or Y 'response' series for a regression as in Engle-Granger (because Johansen does not use regression) and the results are always consistent regardless of the order in which the series are organised. This makes the Johansen test particularly useful and unfussy when examining more than two time series.

• Integrated stationarity test: unlike Engle-Granger, there is no need to run a separate stationarity test on the individual variables. Johansen accomplishes this by outputting the cointegrating relationship, but also the probability of there being a trivial relationship between the variables. For example, if the coefficients are close to one or zero that would suggest that one of the variables was stationary in its own right anyway, and therefore the other variable wasn't being used.

• Timescale: if desired, the Johansen framework can also output a reversion timescale indicating the length of time any spread will take to revert to its mean. If using Engle-Granger, this has to be estimated using a separate test such as Ornstein-Uhlenbeck.

However, Johansen does lack the intuition of Engle-Granger and it can be challenging to really understand what the test results actually mean. This can be particularly demanding when figuring out how many independent cointegrating relationships exist when considering more than two time series.

The key with the Johansen framework is to figure out the rank of the cointegrating relation space. In the case of pairs trading, this isn't too tricky as you are only dealing with two time series, which means there are only three possible ranks:

• Rank is zero (no cointegrating space): no cointegrating relationships are possible as no combination of the time series results in a residual that is stationary, so pair is not cointegrated

• Rank is one: this implies that there are some combinations of the time series that yield stationarity, so pair is cointegrated (subject to confidence interval and hypothesis test limitations) and a possible candidate for trading

• Rank is two: this means that every single linear combination of the two time series is itself a stationary time series.

In practice, Johansen doesn't just spit out an 'it's rank zero/one/two/x' answer, but generates probability output that indicates whether the cointegrating relation space is more likely to be rank one than rank zero and more likely to be rank one than rank two (and so on if more than two time series are involved).

Which purpose, which flavour?

It's also important to bear in mind that both Engle-Granger and Johansen offer an extensive range of options when it comes to specifying the many possible flavours of time series and the nature of their cointegrating relationship.Even more importantly, the majority of these options are completely irrelevant/wrong for most types of trading activity - many of them reflect the origins of cointegration testing in economics, and so are principally relevant to that instead. Therefore using the right option(s) and being clear as to exactly what is being tested are essential.

The test options range from the highly restrictive (such as assuming that one time series is simply a multiple of the other) to the highly unrestricted (such as also allowing constant, linear, and quadratic deterministic functions of time as offsets).In addition there are also the lags that can be specified between time series.

Apart from specifying options that are completely irrelevant to trading, there is also the risk of specifying those that might be marginally relevant but pointless, such as pairs that have a linear function with a deterministic function for the spread. If this is decaying over time then so may any potential profitability during the trade lifespan.

This isn't the same as situations where there is an offset between time series. For example, Brent Crude and West Texas Intermediate have an offset, which one could argue is the cost of transport across the Atlantic Ocean. The distinction is that this offset is a constant over time, not a linear trend, so it is effectively cancelled out when a trade is closed. More generally, while it is definitely important to specify the right options when cointegration testing, the bigger picture is what really counts. A pair may be technically cointegrated, but if you can't make money out of trading them, why would you bother?

Conclusion: where the buck stops - testing

This question is also a timely reminder of something that should underpin all statistical arbitrage - rigorous simulation and testing. Cointegration alone guarantees nothing, so the ultimate backstop is how realistically any statistical arbitrage trading strategy is evaluated. First step is backtesting and optimisation (such as of lookback windows), while ensuring that no data that would not be available in real time is used in evaluation. Second step is out of sample testing that uses the same parameters derived during backtesting and any optimisation, and applies them to completely unseen data. The bottom line is both literally and metaphorically what ultimately matters when testing for cointegration. There's no question that using the right cointegration test in the right way and really understanding the results is vital, but it is still only one cog in a far bigger trading machine.


1. See and

2. "An equation is said to be a closed-form solution if it solves a given problem in terms of functions and mathematical operations from a given generally accepted set. For example, an infinite sum would generally not be considered closed-form. However, the choice of what to call closed-form and what not is rather arbitrary since a new "closed-form" function could simply be defined in terms of the infinite sum." (