This article is split into three parts. Firstly, we discuss the relative merits of various programming languages for analysing financial markets. This part is especially relevant for readers less familiar with Python or coding in general. There are short explanations of how some of the more common languages operate and what is of particular importance when it comes to performance and usability.
Secondly, we go into detail about the libraries available in Python to analyse data. Our discussion covers some libraries which might be less well-known within the Python data community. We suggest that developers familiar with Python should jump to this part.
Finally, we introduce Cuemacro's open-source financial market libraries written in Python: Chartpy (visualisation), Findatapy (market data) and Finmarketpy (backtesting trading strategies). We conclude by presenting some examples of market analysis written in Python using these libraries.
Part I: Which programming language should you choose?
The most important aspects you need to consider when choosing a programming language are related to time.
One determining factor which you need to pay attention to is execution time, or the time it takes to run your analysis. Another equally important factor is development time, or the time it takes to write the actual code.
The relative importance of execution time versus development time is a key consideration when it comes to choosing an appropriate programming language. When running a high frequency trading (HFT) strategy in production, execution time is likely to be crucial. This contrasts with longer term trading strategies or prototyping, where execution time is less of a consideration.
We expand upon this idea of balancing execution and development time in the following sections, in which we discuss the relative merits of different types of programming languages for financial market applications.
Statically typed languages
In instances where a short execution time is paramount, such as in HFT, you are most likely to want use a lower level language which compiles to machine code, such as C++. Lower level languages tend to use static typing. Static typing involves specifying the type of data we want to store in variables at compilation, before runtime, which reduces the amount of processing needed at execution.
Historically, C++ has been the language of choice in quantitative finance, in particular for option pricing. However, coding in C++ is time-consuming and requires programmers to have a clear understanding of lower level concepts such as memory allocation and pointers.
One alternative to C++ is Java. Like C++, Java is a statically typed language. Unlike C++, Java does not require users to manage lower-level memory allocation and offers features such as automatic garbage collection. This means they do not need to worry about freeing up memory space once they have finished using a variable. (This, of course, does not totally eliminate the chances of a memory leak in code, which can crash the program).
Unlike C++, Java does not compile directly to machine code and is instead compiled to Java bytecode, which is executed by the Java Virtual Machine (JVM). While the bytecode is more portable than the machine code, it still needs to be translated to machine code at time of execution by the virtual machine - known as just-in-time (JIT) compilation. This introduces a startup time delay to your program.
Historically, JVMs have been slow at executing Java bytecode. In recent years, however, they have become faster. Indeed, NumFOCUS (2017) shows that for basic mathematical operations Java's execution time is now comparable with that of C++. Furthermore, owing to bytecode to machine code JIT compilation, you can execute the same Java bytecode on a number of different platforms without having to recompile the source code. This adds to the convenience of using Java: It is possible in principle to compile your code on a Mac and run it on Linux or Windows, reducing development time when using multiple operating systems.
Java is not unique for being compiled to bytecode. C#, which bears many similarities to Java in its syntax, and other languages from the .NET framework are also compiled into intermediate code (similar to Java's bytecode) which is subsequently JIT'ed into native instructions by the Common Language Runtime (CLR).
When the primary goal is to reduce development time, rather than execution time, we can turn to interpreted languages, which are very useful for scripting.
Common interpreted languages used in finance include Python, Matlab and R. They are chosen since they reduce development time when prototyping trading strategies. Execution happens through an interpreter, without the need for pre-compilation into a machine (or byte-) code executable - unlike compiled languages such as C++.
Interpreted languages are generally dynamically typed (as opposed to statically typed). This means that the types of variables are associated with their assigned values at runtime, and not specified by the programmer (or inferred by a compiler). This is one feature that makes scripting languages less verbose, making it quicker to write code. On the flip side, execution can take longer.
Whilst Matlab is primarily known for its matrix algebra capabilities, it also has many libraries known as toolboxes, which offer additional functionality ranging from signal processing to computational finance to image analysis. Matlab remains popular partially because so much legacy code in financial firms is written in it. It can also interface well with many other languages with minimal effort, including Python and Java.
In recent years Matlab has faced competition from R and Python. Both R and Python offer similar functionality to Matlab, but have the added benefit of being open-source languages. However, there is an implicit cost from transitioning from Matlab to either Python or R, notably in terms of time spent learning a new language. It also takes time to rewrite legacy Matlab code in Python and R.
R is an open-source version of the statistical package S. Historically, cutting edge statistical techniques have tended to be implemented in R before other languages. This has attracted a large following among the data science community. However, if your application is not purely based around statistics, R might not be the best choice. It is relatively slow compared to most other languages (see Boraan Aruoba & Fernández-Villaverde, 2014, and NumFOCUS, 2017) and the syntax is more suited to those with a mathematical rather than a programming background.
Julia is a more recent scripting language, which has been designed to address many of the issues associated with R and Python. (For an introduction to Julia, see this issue's "Julia - A new language for technical computing", page 37.) In particular, when Julia code is first run, it generates native machine code for execution. This contrasts with R and Python code which is executed by an interpreter. Theoretically, native machine code should be quicker than interpreted code. NumFOCUS (2017) gives a set of benchmarks that indicate the language has comparable performance with C for a number of functions such as matrix multiplication and sorting lists.
Functional and query languages
So far we have focused on imperative languages. But what about using other types of languages?
Haskell is a functional language. For programmers used to imperative programming and the idea of mainly using loops, it can be challenging to adopt a functional approach to programming. However, certain mathematical problems can be more naturally expressed in a functional framework.
Lisp is another common functional language and is often used in natural language processing. Indeed, one of the biggest companies in this area, RavenPack, actively uses Lisp. F#, Microsoft's functional language, also has the benefits of being part of the .NET Framework, so it can be called easily by other .NET framework languages such as C#. The JVM also has functional languages, such as Clojure. Scala combines object-oriented development with functional elements and also compiles into Java bytecode.
Q is a query-based language. It is primarily designed to be used with kdb+, a high performance database which can deal with large amounts of columnar style data, such as time series data. kdb+ is often used to store tick data from financial markets.
It might seem odd to consider using a database language for financial analysis. The idea is that we can do a lot of analysis within the database and then output a summary (see Bilokon & Novotny, 2018). This avoids the overhead associated with retrieving the data from a database.
Whilst the 32-bit version of kdb+ is available for free, the 64-bit version is subject to a licence fee. Another downside of Q is that it tends to be relatively complicated to get to grips with (although there is the simpler q-SQL language which, as the name suggests, has a similar syntax to SQL).
Why Python for financial analysis?
So far we have discussed the relative merits of several languages when analysing financial data. As we have noted, the language chosen largely depends on the aims of your analysis. If you want to conduct real-time analysis of tick data, you likely need to choose a high performance language like C++. However, for most other purposes, where short execution time is not the primary consideration, such as when analysing lower frequency data, there are many other choices.
Python can be viewed as a compromise language for market analysis. It has a lot of libraries, just as R and Matlab do. It is easier to learn than lower level languages like C++. An important part of any larger programming project is the ability to reuse code. This is facilitated by object-oriented coding, which tends to be easier in Python than in either R or Matlab.
Whilst Python is certainly not the fastest language for execution, it is quicker than R and by most standards comparable with Matlab (see NumFOCUS, 2017). Parallelising code, or splitting up the computation into chunks which can be solved at the same time, can cut execution time. Today, processors usually have many cores for computations, hence a processor can run multiple calculations at the same time. Notably, one drawback of Python is its global interpreter lock (GIL) which only allows one native thread to execute at any one time. As a result, the GIL can make it more challenging to parallelise code. Later in the article we discuss other techniques for reducing execution time for Python code.
Python and the financial community
A number of large financial organisations use Python and have adopted it in their core processes. On the sell side, JPMorgan's Quartz system uses Python extensively. Quartz is used for pricing trades, managing exposure and computing risk metrics across all asset classes. Athena, a similar system at BAML also extensively uses Python. Of course, this is not to say that the sell side has suddenly dumped technologies like the .NET Framework and Java, but it is a sign that Python has come of age. Many large quant hedge funds, such as AHL, have also adopted Python.
In recent years, financial firms have begun to open source some of the their code. This is likely to be helpful for the adoption of Python within the financial community. AHL has even open sourced its Arctic Python project, which is a high-performance time series data storage wrapper for MongoDB. Another library, Pandas, very popular for data analysis, originally started as a project at the investment management firm AQR.
How can we speed up Python?
Just as with R and Matlab, it is beneficial to vectorise Python code. For example, rather than using a for-loop code structure to multiply matrices, which can be slow, we can use highly optimised matrix multiplication functions instead. Admittedly, in more complicated cases it is not always trivial to vectorise code in this way.
As we discussed earlier, given that Python has the GIL, it can be more challenging to do true parallelised computation within a single process. You need to use a work-around, such as the multiprocessing library, which creates separate Python processes in memory. This approach allows you to do computation on multiple cores. Nonetheless, this also makes it more challenging to share memory between the processes. If the bottlenecks in your code are related to input/output (IO) processes (like downloading web pages), then using other approaches could potentially be better. These include the multithreading library or the asyncio library, which handle asynchronous IO requests without blocking.
Cython presents with another way to speed up Python code. Cython is a static compiler for Python, which also lets you call C functions and declare C types. Python code has dynamic typing, unlike, for example, Java, which has static typing. If you declare C types in Cython, it allows you to convert your slow Python for loops into C.
You can use Cython to wrap C++/C libraries. Cython also makes it possible to release the GIL in order to use multithreading directly in functions without the need to create separate processes. Many libraries in Python extensively use Cython. Admittedly, Cython is not a magic bullet to reduce the execution time of Python code. In some cases, it can be time consuming to rewrite Python code for Cython's compiler. This can be the case when your Python code contains more complicated syntax which cannot be converted easily into low level C code by Cython.
An alternative to Cython is Numba. It is a low level virtual machine (LLVM) that generates machine code from Python at runtime, which can also be done on a static basis. The code generated by Numba can be compiled to run either on the CPU or the graphics processing units (GPUs). GPUs are typically useful for large scale computations with repeatable operations, like matrix multiplication.
Part II: The Python data libraries
Python now boasts a lot more data libraries than it did several years ago. This has encouraged quants to use Python. In this section we discuss some of the most popular Python data libraries.
The SciPy stack
The SciPy stack comprises several popular libraries for scientific and technical computing. It includes NumPy, Pandas, IPython, Matplotlib and the SciPy library, which we discuss below in some detail.
The first step of learning Python is developing a basic understanding of the syntax. For those wishing to analyse financial markets, it is important to have an understanding of the SciPy stack. In particular, we would recommend focusing on NumPy and Pandas, given that financial market data often consists of time series data.
NumPy is at the core of the stack and offers a large number of functions to deal with matrix manipulation of 'ndarray' objects, which are n-dimensional arrays. NumPy is written in a mix of Python and C and uses the underlying BLAS and LAPACK libraries to do much of its computation quickly. These types of functions are at the source of much of the computation in financial analysis. NumPy can be viewed as the Python equivalent of Matlab's matrix functionality.
Pandas is a Python data analysis library which deals with time series. It offers functions to perform common manipulations of time series, such as aligning or sorting them. At its core are several data structures: the Series (single time series), the DataFrame (multi-column time series) and the Panel (three-dimensional time series). These data structures can be seen as Python's equivalent of R's data frames. The underlying dates and data within these data structures are stored as NumPy arrays.
IPython is an interactive notebook-based environment for Python code. We can combine Python code, text and results in a single file with IPython. It enables us to create interactive research documents, where the code and results of our output are in a single place. This contrasts with the typical alternative, such as a static PDF file.
One of the reasons for R's popularity is its ggplot library which produces high quality visualisations. Matplotlib is the most popular visualisation library for Python and it is designed to replicate much of the functionality of ggplot. Matplotlib can generate a multitude of plots, ranging from simple 2D plots to more complicated 3D plots and animations. However, some of its functionality can be challenging to use, which has led to the development of wrappers to simplify its interface. These include the libraries Seaborn and Chartpy.
The SciPy library - not to be confused with the SciPy stack - provides methods for a number of different computations used in financial analysis, including numerical integration, optimisation, interpolation, linear algebra, statistics and image processing.
Machine learning and statistics
As computing power has become cheaper and more datasets have become available, the interest in machine learning has grown significantly. In a nutshell, the idea of machine learning is to make inferences between different variables within a dataset where we do not know the underlying function or a process beforehand. Python has many libraries for machine learning; we describe a few popular ones.
Scikit-learn is perhaps the best known of the machine learning libraries for Python. It can be used for a number of tasks including classification, regression, clustering, dimensionality reduction, model selection and pre-processing. The algorithms range from linear regressions to techniques which can handle non-linear relationships, like support vector machines and k-nearest neighbours.
The deep learning library TensorFlow was released by Google in 2015. TFLearn provides a simplified interface for using TensorFlow, similar to scikit-learn. Other deep learning frameworks include Theano and PyTorch.
PyMC3 is a package for Bayesian statistical modelling and probabilistic machine learning for Python. The underlying matrix computation is done by Theano on either CPU or GPU whilst the higher level functions accessed by users are in pure Python.
QuantEcon is an econometrics library for Python and Julia, which is maintained and used in a number of academic institutions including New York University. Its functionality includes agent-based modelling.
Text and natural language processing
The ever-growing amount of content on the web has resulted in a huge amount of unstructured data. A lot of this data is text. In order to make text data usable for traders, it needs to be cleaned and structured. Furthermore, you might want to create metrics to describe text, such as sentiment scores. These can then be used to trigger trading signals. Python has many features to deal with text data and there are also a number of open-source libraries for natural language processing (NLP) and cleaning text.
The Natural Language Toolkit is the most well-known Python library for NLP and began life in 2001. It has many features to deal with and understand text information. It allows, for example, tagging text, creating parsing trees for sentences and identifying entities in text. It also comes with many existing word corpuses.
spaCy is a much newer library for natural language processing. Benchmarks quoted by Explosion AI (2015) show that the library is much faster than other similar frameworks and offers very high precision. It is used by a number of large companies such as Quora.
BeautifulSoup can be used to extract usable text from webpages in Python for offline processing. Text extraction can be a lengthy process; this library can strip away parts of webpages which are not relevant to the meaning, such as HTML tags and menus.
Market data and databases
At the heart of any financial markets analysis is market data. There are many Python libraries that help access and store market data. There are also a number of libraries which simplify the process of storing this data in databases.
MongoDB is a one of the most well-known non-SQL databases. While SQL databases store data in tables in a relational way, NoSQL databases use different data structures for storage. For example, they might store data as documents or in key-value stores. Arctic is AHL's open-sourced Python library which acts as a wrapper for MongoDB when storing time series data. It transparently compresses and decompresses Pandas' data structures locally, reducing the impact on the network.
Python has many other wrappers for accessing external databases: PyMongo for MongoDB; qPython for kdb+; and SQLAlchemy for SQL. Redis is an in-memory database, which uses a key-value style data store, similar to a dictionary- style object in Python. This allows access to in-memory data store much faster than a disk-based data store. The obvious limitation is the RAM available.
Many vendors also offer Python APIs for accessing market data. Bloomberg has an open-source Python API called Blpapi, which can be used both with the desktop and server Bloomberg products. Quandl, a popular online data provider, also offer its own Python API.
Once you have completed your market analysis, you probably need to present your results. Typically, this involves creating charts. Aside from Matplotlib, part of the SciPy stack as described above, there are numerous other libraries for generating charts.
VisPy is a more specialised GPU-accelerated library for visualisation in Python. Whilst it is less mature than the other visualisation libraries we have discussed, its big advantage is its ability to plot complicated charts very quickly (for example those with millions of points).
Part III: Cuemacro's open-source financial libraries
Building upon a large number of open-source libraries, over the past few years at Cuemacro we developed our own Python framework for analysing market data. We had originally developed a library called PyThalesians. We later rewrote this and split it into several smaller, more specialised libraries. They were designed to provide a relatively easy-to-use, high-level interface for analysing financial markets and to allow users to focus on developing trading strategies.
For example, Chartpy is a visualisation library. It does not render charts directly, but instead allows users to render charts with a number of Python chart libraries like Matplotlib, Bokeh, Plotly and VisPy, using a consistent and simple interface. This means that users do not have to worry about the low level details of Matplotlib, Plotly and others, which are all very different. To switch between the various plotting libraries, only a single word needs to be changed in the source.
Having given an overview of Python and its data libraries, we now move to some practical code examples.
Loading FX tick data from a retail broker
In Listing 01 we show how we can load market data using the library Findatapy. Our first step is to import various dependencies. We instantiate a Market object, which can be used to fetch market data according to the parameters set in MarketDataRequest. The principle for downloading data from other data providers is the same as from Quandl: All we need to do is to change the data_source parameter. Hence, we do not need to learn the underlying APIs for each data provider, just the simple API provided by Findatapy. We then use the fetch_market method to return a Pandas DataFrame, which is later printed.