We will be hosting a talk about our work on, “A Platform Agnostic Observability System for AI Accelerators” during our virtual Systems @Scale event at 10:20 a.m. PT on Wednesday, June 30, followed by a live Q&A session. Please submit any questions to [email protected] before the event. Accelerators are special-purpose hardware devices optimized for specific […]
We will be hosting a talk about our work on, “A Platform Agnostic Observability System for AI Accelerators” during our virtual Systems @Scale event at 10:20 a.m. PT on Wednesday, June 30, followed by a live Q&A session. Please submit any questions to [email protected] before the event.
Accelerators are special-purpose hardware devices optimized for specific applications, like AI prediction and video encoding. And Application-specific hardware platforms play an important role in meeting the growing latency and compute demands of workloads like deep learning, content understanding, and video encoding.
At Facebook, the inevitable rise in use of accelerators in our data centers has led to better performance and energy efficiency. However, it is challenging to operate these heterogeneous platforms efficiently at scale. To ensure that these complex accelerators operate smoothly, we need an excellent observability system with monitoring and tracing capabilities so we can understand the performance and interactions between CPUs and accelerators.
To meet these challenges, we’ve introduced three new tools:
Facebook’s cloud infrastructure handles about 150 trillion AI predictions per day for tasks ranging from feed recommendations to combating harmful content. Running these AI models comes with heavy infrastructure demands. And as these models improve, so do their computational requirements.
The graph below of AI model adoption at Facebook illustrates this unmistakable pattern.
Good old general-purpose processors (CPUs) offer versatility and have grown exponentially faster over the decades. However, CPUs fail to meet the rising computational demands of AI applications today. They also tend to exhibit inefficiency in terms of energy used per AI prediction. As investigated by the OpenAI community, we’ve seen two distinct eras of compute in AI models. In recent times, model complexity and compute requirements for AI have grown by roughly a factor of 10 each year. This far outpaces improvements in CPU performance.
How do we remedy this? By designing hardware that is customized to accelerate AI operations via application-specific integrated circuits (ASICs).
Since 2019, Facebook has invested heavily in deploying accelerator-based servers to provide higher performance and energy efficiency. Today, our first-generation systems are 10-30x more performant on our largest AI models. They also delivered a 3-10x performance-per-watt improvement over a CPU.
We also invested in specialized hardware for video encoding and decoding. This enables Facebook to process the nearly 250 million videos uploaded to our app each day. These videos are viewable on any device and with varying internet bandwidth. Our first-generation video accelerators delivered a 10x performance-per-watt improvement in processing 4K videos.
The figure below illustrates the design of our AI inference server. As you can see, it consists of two Twin Lake CPUs and multiple accelerators (M.2 modules) connected to them using a PCIE switch.
In your typical cloud server, the CPU represents the most complex component. We focus a lot on building software to efficiently operate the CPU and monitor its performance and availability. However, with an accelerator system, we can imagine the CPU now has a complicated and brawnier sibling! The accelerator, or ASIC, represents a complex hardware and software system in its own right.
To deliver an excellent user experience, the cloud infrastructure needs to keep hundreds of thousands of accelerators running reliably and efficiently. This is where observability systems come to our rescue. Observability allows us to understand what happens in the accelerator hardware and software when any issue arises. It is useful in multiple ways:
Specialization is both a boon and bane for accelerators. As a result, we end up running multiple types of accelerators in our data centers at any given point.
In 2020 we started deploying the first generation of these accelerators. In the near future, we will be developing two to three new accelerators for the second generation. Each accelerator will have unique driver interfaces, making the task of operating them harder. But duplicating the observability software for each accelerator would not be feasible in the timeline we have set out. The observability framework must be easy to prototype and adapt to multiple types of accelerators in a short time. It also needs to be efficient to avoid interfering with the original application.
Our first challenge involved finding a way to effectively monitor different types of accelerators without duplicating code (and developer time). As you may have guessed, we can leverage abstraction to achieve this.
For example, consider an abstract metric: device_utilization — the measure of how busy an accelerator is — which becomes useful for balancing load across accelerators. To compute this metric, we may need to understand the internal architecture of the accelerator. With an abstract counter, however, engineers working on load balancing can more easily use the metric without being aware of finer details.
device_utilization = max(compute_core_active_i) / total_time
With the above in mind, we designed Asicmon with these design objectives:
The diagram below illustrates the overall software stack for monitoring accelerators. Asicmon acts as a bridge between individual accelerator drivers and the rest of the internal monitoring software. The left top illustrates automated health check tools that spot bad health signals and automatically fix faulty ASICs. On the right, a telemetry daemon periodically publishes performance metrics for engineers to inspect the accelerators. Furthermore, automated load balancing and auto-scaling systems like Shard Manager utilize these counters.
Under the hood, Asicmon creates an instance of a monitoring module per accelerator device. It maintains a cache of statistics that it updates periodically by probing the accelerator driver and computing-derived metrics. Queries to Asicmon’s standard interface for counters get implemented as a lookup into this cache. This shields the system against accidental overload of counter requests.
All great so far! We used abstraction to address the scalability aspect of observability software layers above Asicmon. However, the problem of building the glue code between the accelerator driver and these standard metrics still eluded us. This has to be done separately for each of the accelerators that have aggressive and overlapping timelines. So, we needed a method to develop on Asicmon that was quick to iterate and easy to ramp up on, while also being efficient. That’s where Asimov comes in.
Asimov is an expressive Python-like custom language to instrument the accelerator driver. It essentially allows developers to focus on how to probe the accelerator interfaces and express derived metrics using them. The Asimov compiler generates an efficient C++ implementation of the monitoring module. It also handles details like caching the metrics, periodically reading them, and providing thread safety.
The code snippets below show examples of Asimov being used to read system metrics using interfaces ranging from Linux sysfs files (a) to custom library C functions (b).
Asimov incorporates the same standard interface as Asicmon in its internal representation (the stats data structure, left hand side in the code). We can also invoke C-library functions provided by the device driver and express equations/conditions for derived metrics like any regular language.
Asimov is built with the ANTLR compiler framework under the hood to provide the lexer/parser logic for the language. We then emit C++ code using templates that manage all the essential parts, like initialization, thread safety, etc., so someone using Asimov doesn’t need to worry about it.
Let’s look at a few illustrative examples of how Asimov and Asicmon are beneficial for operating accelerators at scale.
For AI inference applications, we use a system called Shard Manager to automatically scale the inference service instances. A shard is essentially a copy of the AI model that can serve inferences. Asicmon measures the load on the device using an abstract metric — accelerator device utilization. This helps Shard Manager effectively balance the load among servers and automatically scale up or down the number of shards. The diagram below explains how the number of shards gets scaled automatically during model update rollouts and increases in traffic.
The figure below illustrates the advantages of building observability early on in a project’s development cycle. In our test deployment for video accelerators, we detected a memory leak using an Asicmon counter for available device memory. It took multiple fixes to the driver to finally resolve the issue, well in time before its debut in production.
Finally, let’s take a look at the ease of prototyping with Asimov. While we certainly took longer to build the first version of Asimov alongside the first video accelerator, supporting the second one (the AI inference accelerator) went incredibly fast. Bootstrapping basic metrics for the AI inference accelerator took less than a week. Since implementing Asicmon we’ve been able to increase our AI accelerator metrics support from ~30 percent to ~75 percent
Now that we can monitor the performance of accelerators in our data centers, the next step involves addressing why performance metrics like the latency and throughput change over time. The tried-and-tested method for CPUs involves leveraging a stack-based profiler to sample the running function call stack at periodic intervals. However, for inference accelerators, tracing is the best form of profiling. Why? Because accelerators use special hardware units and thus do not have an equivalent notion of a function stack on a core.
As shown in the figure below, a trace essentially consists of a time series of events occurring on different parts in a system. Events in a trace can represent, among many things, functions, execution of AI operators, or data transfers. Traces offer deeper insights into the operation of the system, including understanding the latency and scheduling of operators and how the CPU and accelerator interact with each other.
While AI inference accelerator vendors do provide tools and APIs to collect traces from the device. These tools are designed to work on a single server and are often hard to use. In order to profile production systems better, we set out building a layer on top of this native capability. This better scales out the collection, processing, and analysis of traces themselves.
We kept two target use cases in mind while developing Atrace:
To develop a scalable and ubiquitous tracing solution, we built a set of components that remotely trigger and collect traces. We save each trace to a shared storage and post process and summarize it. The diagram below outlines this, starting on the left with the trace being triggered, to the trace collection and post processing on the right.
Traces themselves can be enormous and overwhelming to dive into directly. However, we can learn a great deal about an AI program by summarizing the trace at a high level. To achieve this, we built a summary of trace statistics grouped by various AI operator types, as shown below.
This operator breakdown shows our engineers which operators consume the most execution time and merit optimization. It also allows for comparisons and debugging of performance regressions between two software versions.
For advanced users, who might want to delve deeper into the traces, we added visualization support for both the open source Chrome trace viewer and an internal trace visualization tool from Facebook. It works all from a single click. We can also run automated analysis on the trace to infer the critical path of operators. This uses the dependency graph of the AI model and trace statistics.
This analysis lets us optimize the latency of the AI prediction. It can also highlight issues like an imbalance in operators. Doing so closed a 10 percent latency gap between the Caffe2 and PyTorch versions of one of our AI models.
Lastly, it is also noteworthy that several software layers exist to handle the processing of an inference request. These include the application layer, PyTorch framework, and Glow, an open source graph lowering compiler for accelerators.
For more complex models involving video understanding or natural language processing, we learned that the model may be run partially on a CPU and partially on an accelerator. Thus, tracing the operations across multiple layers on the CPU and correlating them with the accelerator becomes a necessity.
We developed a prototype of trace correlation into Glow and PyTorch. This allowed us to connect operations on the CPU in the Glow runtime, to the accelerator. Trace correlation is important for examining the complex software stack used for AI inference.
In addition to continuing to support next-generation AI and video accelerators using Asimov and the Asicmon we are also exploring:
We would like to thank our many collaborators at Facebook, including Jerry Liu, Thiara Ortiz, Jeremy Yang, Ashwin Poojary, Deng Pan, Craig Ross, Ashwin Narasimha, Gisle Dankel, Michael Anderson, Allan Di Wu, Yinghai Lu, Satish Nadathur, Garret Catron, and Jack Montgomery for supporting us in creating this framework.
The post Asicmon: A platform agnostic observability system for AI accelerators appeared first on Facebook Engineering.