Please access the print version of this article here .
Quantitative investment managers such as WorldQuant deploy large arrays of servers to construct their portfolios. For the thousands of data scientists - 'quants' in industry parlance - located around the world using the WorldQuant platform, computing resources are like oxygen: Our quants rely on them to do their work and they can consume just as much compute as they are given access to. The firm's ability to provide quants with the computing resources they need, when they need them, is of paramount importance.
Although the basic resource-sharing concepts of cloud computing hark back to the days of mainframes, it arguably has been one of the largest disrupters in the world of information technology management in recent years. The premise of cloud providers is that they can make computing resources available in minutes, compared with a more typical multimonth IT provisioning cycle. With WorldQuant's insatiable need for ever-more computing resources, effective leveraging of the cloud is a potent weapon in our armory.
Amazon began offering its Elastic Compute Cloud more than a decade ago and other major players, such as Google and Microsoft, rapidly followed. Although many businesses, particularly start-ups, embraced the cloud relatively quickly, others, especially those in regulated industries such as finance, have been much slower to adopt the technology. Foremost among their concerns are the security and privacy of data and the risk to intellectual property. The multitenant nature of public cloud, wherein multiple customers share the same physical plant, and the cloud's exposure to the public Internet, exacerbates these concerns in the minds of IT professionals, business leaders and investors.
A no-brainer for smaller firms
As cloud technology has matured over the past decade, adoption has steadily grown beyond the start-up world and beyond token deployments for quality assurance and development environments by Fortune 500 companies. Security is still important, but the conversation about the use of cloud technologies for production systems has moved beyond the issue of fear, uncertainty and doubt to the question of the most effective ways to leverage the cloud. Concerns about security have given way to the practicalities of how best to integrate the cloud's capabilities. A cloud-only option is now a no-brainer for smaller and start-up buy-side firms. The investment required to build internal IT infrastructure and recruit talented staff is significant and in many cases unnecessary if the organisation has the luxury of starting with a blank slate. The situation for larger and more established firms is far more complex, especially for those that have already made significant investments in on-premise data centres.
The cloud is enormously enticing and the prospect of never having to consider building and operating a data centre or attempting to hire competent systems administration staff is all too alluring for many CTOs and CIOs. Wouldn't it be nice to simply move everything into the cloud and just be done with it?
Once the euphoria of not having to think about infrastructure starts to settle down and real-world business practicalities set in for IT executives with a significant on-premise install base, they find themselves examining a number of important areas to validate the feasibility of adding the cloud to what is likely already a very complex estate. The need to consider the effect on economics, security, operations, staffing, vendor lock-in and applications is critical. Decisions in each of these areas have serious implications for the business case around the use of cloud, the speed of adoption and the investment required.
Security has always been among the biggest concerns for organisations considering cloud deployments. However, as cloud offerings have been maturing and more and more enterprises have had positive experiences with deployments, a less hysterical way of looking at risks has emerged. The most immediate example is the realisation for enterprises like WorldQuant that major cloud providers are able to make far more significant investments in information security resources and safeguards than a medium-size enterprise could likely ever do. Cloud providers such as Amazon and Google are better positioned and have access to more comprehensive information security intelligence to manage infrastructure amid a backdrop of hostile actors. Assurance programs, bug bounties, coordinated disclosure of vulnerabilities with software vendors and in-house resources to quickly address zero-day exploits all serve to bolster the security of their infrastructure platforms.
While many IT executives are still comforted by the fact that the security of their networks is protected by state-of-the-art firewalls, there is a growing realisation that relying solely on a secure perimeter is no longer appropriate in the age of mobile and cloud technologies. The traditional approach is akin to an M&M - crunchy on the outside, soft in the middle. Once an attacker pierces the shell, he has access to all resources on the internal network. In a cloud environment a firm no longer has control over all the layers of the technology stack that make up that shell. A misconfiguration or flaw in the cloud provider's security controls, a zero-day exploit in the cloud's hypervisor or an as-yet-undetected vulnerability, such as a cache side-channel attack, all negate the legacy approach to network security.
In 2003 the Jericho Forum came together to discuss the challenge of 'de-perimeterisation' - the erosion of the network perimeter. The group later developed a series of commandments that seek to address the security concerns of a de-perimeterised future. Similarly, Forrester Research has put forward its Zero Trust Model (see Kindervag et al., 2016) of information security, in which the consulting firm eliminates entirely the notion of trusted and untrusted networks. Simply put, in Zero Trust all network traffic is untrusted. A 2016 report by the US House of Representatives' Committee on Oversight and Government Reform detailed events leading up to the hacking of the US Office of Personnel Management, in which the records of 4.2 million federal employees were compromised. The report recommends that "agencies should move toward a 'zero trust' model of information security and IT architecture". In response to the Operation Aurora cyberattacks of 2009, which targeted Adobe, Morgan Stanley and others, Google developed its own Zero Trust security framework, called BeyondCorp.
Cloud governance is critical
At WorldQuant we are building applications with security as a first-class consideration. Security is built into software architecture at the start, rather than as an afterthought, with an integrated development and deployment mindset performed on a fully automated basis. The downside of increased automation is that it is possible for a trivial error in deployment and configuration software to have far-reaching security and availability implications. Thus, solid cloud governance is key, with the ability to introspect the infrastructure on a regular, automated basis and immediately resolve unexpected configuration changes. In other words, there is a need for automation to watch the automation.
Cloud economics seemingly are simple but can quickly get quite complex, with a broad set of instance choices and prepayment options, billing time slices and regional pricing differences. The promise is reduced cost and time to market, but can every organisation reap those benefits? We consider the impact of server utilisation, measured in duty cycles, on the cost landscape. By way of example, High Performance Computing (HPC) workloads can be inelastic, consuming large numbers of machines for hours, days and weeks at a time, with little to no unused capacity. In such cases cloud options may be considerably more expensive. In the model presented in Figure 01, using data from Dell and Amazon Web Services (AWS) online calculators, it is easy to see that as the percentage of hardware utilisation goes up, the costs of cloud deployments can begin to exceed the costs of on-premise deployments. This inflection point can occur as early as 25% of server utilisation. The graph also highlights the economic advantages of application architectures that are able to utilise special instance types, such as AWS Spot or Google Cloud Platform (GCP) Preemptible, as the cost equation shifts dramatically. The specific equilibrium points will vary by organisation and will be affected by internal drivers, such as the relative value of instantly available compute resources. Financing and cash flow considerations are also important factors, as well as each organisation's operational efficiency and its maturity and technical acumen for on-premise installations. At small scale, time-to-market may trump all of the cons, but at the high degree of WorldQuant's consumption of computer resources, the economics really matter.
Figure 01: Three-year Total Cost of Ownership (TCO) for bare metal and various public cloud configurations by percent utilisation (duty cycle). Data and assumptions drawn from dell.com and the AWS TCO calculator; pricing as of April 2017. Results will vary by organisation based on multiple factors, including operational efficiency and maturity of on-premise implementations.
Once a decision has been made to incorporate cloud into the overall IT strategy, the next step is to determine applications that can effectively run in the cloud. Considerations around data access, data placement, latency and the performance of each cloud component are critical to the success of the endeavor.
Over time, two infrastructure patterns for established firms to leverage the cloud have emerged. One common approach is hybrid IT, whereby new applications are developed as cloud-native while legacy applications continue to run on traditional enterprise infrastructure. Another approach is simply to try to force the cloud to behave like a traditional data centre - often with fairly poor results. Neither strategy is particularly attractive because both typically result in a protracted migration cycle, during which time the firm is required to contend with the limitations of both cloud and legacy infrastructures.
Building flexible platforms
Cloud migrations often are further complicated by the reality that most firms have siloed application development and systems administration teams. By contrast, modern cloud-native applications rely on tightly coupled, fully integrated teams to be able not only to take full advantage of cloud capabilities, but also - perhaps even more important - to deal with the inherent limitations that cloud platforms impose. Technology multidisciplinarians are required who are able to think of applications as end-to-end systems that encompass both software and infrastructure management components.
For organisations that already have significant on-premise capabilities, particularly those with HPC-style deployments, there is an opportunity to create a fully integrated infrastructure that combines on-premise and cloud, minimises overall costs and maximises infrastructure flexibility and availability. This calls for the creation of a platform that allows for workload orchestration and seamless placement of applications based on their specific resource demands and current availability. By leveraging tools such as Consul, Kubernetes and Mesos, organisations can build flexible platforms efficiently and without significant software development investments.
There are also a number of application-architecture issues to consider. Should applications make cloud's native and often easy-to-use features part of their design? Although leveraging services offered by cloud providers to reduce development time lends significant value, for many CTOs doing so conjures up memories of being locked into a vendor. Years down the road they are at the mercy of that same vendor's whims to change pricing, features or even the availability of a product. There is no doubt that this is a concern - vendors are able to bring this about simply by making firms code against their proprietary application programming interfaces (APIs). A platform-independent approach therefore may be attractive.
Unique challenges of the cloud
Deploying a third-party abstraction layer does not solve the problem. It merely creates its own form of lock-in, but in this instance to a software provider that possibly poses greater counterparty risk than the cloud providers themselves. Another option - and one that we are investing in at WorldQuant - is to develop a cloud-provider abstraction layer, leveraging open source tools as much as possible. The advantages of being in control of your own destiny seem clear. However, doing so is not without risk: An organisation must have access to the necessary skills and talent to pull off something like this, as well as the appetite to commit to maintaining such a layer on an ongoing basis. For WorldQuant we aim to mitigate these downsides by leveraging open source software and tools as much as we can, making enhancements as necessary.
Operating a large application estate in the cloud comes with unique challenges not present in traditional data centre deployments. Capacity and scale limitations may not be well understood or even known. Upgrades and feature changes in cloud services may be harder to stay on top of than if an organisation manages the infrastructure itself and is responsible for all upgrades and reconfigurations. No longer do application teams have the luxury of never upgrading some key component or library simply because it is inconvenient or because they lack the time or have other priorities.
To avoid surprises, it is important to employ instrumentation of applications and collection of performance metrics, such as running up against cloud instance, storage and networking limits that may be explicit but in some cases are entirely opaque. At WorldQuant we aim to collect metrics from every application and its components, as well as from the underlying infrastructure. We use machine learning techniques to mine this data to identify trouble spots and to understand how various concurrent incidents relate to one another, so they can be managed effectively and without the need to chase multiple issues created by the same root cause. This approach improves the operational efficiency of the staff, decreases the overall downtime of infrastructure and application components and allows us to develop tools that predict outages based simply on changes in the telemetry of our systems.
Changing role for systems administrators
Despite popular belief, the need for systems administrators does not entirely go away. The roles of those individuals are changing with the adoption of cloudlike approaches to both cloud and on-premise infrastructure. Custom-configured systems and bespoke scripts used to manage the environment are giving way to infrastructure-as-code approaches. The emerging role of a systems administrator is one of total automation and enabling of the DevOps and NoOps paradigms. These individuals must still possess the skills necessary to debug operating system and network issues and understand how applications interact with operating systems and underlying infrastructure. Furthermore, they are responsible for defining infrastructure deployment and software upgrade mechanisms. This paradigm shift, at least for now, presents challenges for many organisations that have existing staff in the systems administration roles who need to be retrained with respect to both skills and their attitude to systems management. The potential for organisational disruption is significant and needs to be managed thoughtfully and delicately. After all, most organisations cannot go from a traditional mode of operation to the new, cloud-style operating environment overnight, so a need for both skill sets will continue to exist during a transition.
For a quantitative financial firm like WorldQuant, deploying large swaths of computing resources has been the name of the game for a number of years. High Performance Computing is the mainstay of quant shops and an array of applications has been developed to take advantage of typical HPC architectures. While HPC technologies have evolved over the years, becoming less specialised and less reliant upon esoteric hardware, cloud capabilities themselves have become more sophisticated. At WorldQuant we are leveraging this convergence of capabilities to optimise past and future investments into on-premise and cloud infrastructure.
We construct application architectures that utilise both HPC and cloud technologies almost interchangeably. Such a blending of the two architectural approaches requires applications to be able to take advantage of modern data management techniques, such as distributed data stores that allow seamless transitions among traditional enterprise, HPC and cloud infrastructures. Rather than simply treating the cloud as a legacy data centre, we are making our on-premises infrastructure more cloudlike. This means that all servers become innominate and the expectation at all levels of the organisation is that components can and will fail.
Developers must build applications that are able to deal with such failures of underlying infrastructure and not make assumptions that it will always be available via complex high availability constructs and similar techniques. We endeavor to treat components in our infrastructure as cattle rather than as irreplaceable pets. Each application must be able to deal in a seamless fashion with component failure, self-healing as components automatically get redeployed to other pieces of the infrastructure, whether in the data centre or in the cloud. This also creates a requirement that testing of each application must provide coverage for such inevitable component failure - this may, for example, include approaches such as using Chaos Monkey or similar tools.
The converged approach described herein enables the private data centre and the public cloud to appear virtually the same from an application developer's perspective. From the top-down view of IT and business management, it allows the firm to leverage the best tool for the job: public cloud when time-to-market is imperative or application architecture allows for effective use of spot-style instances, and on-premise deployment when server utilisation is consistently high or special constraints exist that make on-premise deployment preferred. Security concerns are mitigated by adopting a Zero Trust mindset in which all network traffic is considered untrusted, no matter where it originated. Cloud provider lock-in is acknowledged and addressed through internal software development coupled with mature open source solutions. The inevitable failure of infrastructure components is embraced and handled within each application and not by the infrastructure itself or the heroics of its operators. In combination, these approaches to building and deploying infrastructure should serve to supply our quants with the oxygen they need to fuel the future of trading.
|No more chewy centers: the zero trust model of information security||Kindervag, J. et al.||2016|