Peripheral component interconnect express (PCIe) hardware continues to push the boundaries of computing thanks to advances in transfer speeds, the number of available lanes for simultaneous data delivery, and a comparatively small footprint on motherboards. Today, PCIe connectivity-based hardware delivers faster data transfers and is one of the de facto methods to connect components to […]
Peripheral component interconnect express (PCIe) hardware continues to push the boundaries of computing thanks to advances in transfer speeds, the number of available lanes for simultaneous data delivery, and a comparatively small footprint on motherboards. Today, PCIe connectivity-based hardware delivers faster data transfers and is one of the de facto methods to connect components to servers.
Our data centers contain millions of PCIe-based hardware components — including ASIC-based accelerators for video and inference, GPUs, NICs, and SSDs — connected either directly into a PCI slot on a server’s motherboard or through a PCIe switch like a carrier card.
As with any hardware, PCIe-based components are susceptible to different types of hardware-, firmware-, or software-related failures and performance degradation. The variety of components and vendors, array of failures, and the challenges of scale make monitoring, collecting data, and performing fault isolation for PCIe-based components challenging.
We’ve developed a solution to detect, diagnose, remediate, and repair these issues. Since we’ve implemented it, this methodology has helped make our hardware fleet more reliable, resilient, and performant. And we believe the wider industry can benefit from the same information, strategies, and help build industry standards around this common problem.
First, let’s outline the tools we use:
The sheer variety of PCIe hardware components (ASICs, NICs, SSDs, etc.) makes studying PCIe issues a daunting task. These components can have different vendors, firmware versions, and different applications running on them. On top of this, the applications themselves might have different compute and storage needs, usage profiles, and tolerances.
By leveraging the tools listed above, we’ve been conducting studies to ameliorate these challenges and ascertain the root cause of PCIe hardware failures and performance degradation.
Some of the issues were obvious. PCIe fatal uncorrected errors, for example, are definitely bad, even if there is only one instance on a particular server. MachineChecker can detect this and mark the faulty hardware (ultimately leading to it being replaced).
Depending on the error conditions, uncorrectable errors are further classified into nonfatal errors and fatal errors. Nonfatal errors are ones that cause a particular transaction to be unreliable, but the PCIe link itself is fully functional. Fatal errors, on the other hand, cause the link to be unreliable. Based on our experience, we’ve found that for any uncorrected PCIe error, swapping the hardware component (and sometimes the motherboard) is the most effective action.
Other issues can seem innocuous at first. PCIe-corrected errors, for example, are correctable by definition and are mostly corrected well in practice. Correctable errors are supposed to pose no impact on the functionality of the interface. However, the rate at which correctable errors occur matters. And if the rate is beyond a particular threshold, it leads to a degradation in performance that is not acceptable for certain applications.
We conducted an in-depth study to correlate the performance degradation and system stalls to PCIe-corrected error rates. Determining the threshold is another challenge, since different platforms and different applications have different profiles and needs. We rolled out the PCIe Error Logging Service, observed the failures in the Scuba tables, and correlated events, system stalls, and PCIe faults to determine the thresholds for each platform. We’ve found that swapping hardware is the most effective solution when PCIe-corrected error rates cross a particular threshold.
PCIe defines two error-reporting paradigms: The baseline capability and the AER capability. The baseline capability is required of all PCIe components and provides a minimum defined set of error reporting requirements. The AER capability is implemented with a PCIe AER extended capability structure and provides more robust error reporting. The PCIe AER driver provides the infrastructure to support PCIe AER capability and we leveraged PCIcrawler to take advantage of this.
We recommend that every vendor adopt the PCIe AER functionality and PCIcrawler rather than relying on custom vendor tools, which lack generality. Custom tools are hard to parse and even harder to maintain. Moreover, integrating new vendors, new kernel versions, or new types of hardware requires a lot of time and effort.
Bad (down-negotiated) link speed (usually running at 1/2 or 1/4 of the expected speed) and bad (down-negotiated) link width (running at 1/2, 1/4, or even 1/8 of the expected link width) were other concerning PCIe faults. These faults can be difficult to detect without some sort of automated tool because the hardware is working, just not as optimally as it could.
Based on our study at scale, we found that most of these faults could be corrected by reseating hardware components. This is why we try this first before marking the hardware as faulty.
Since a small minority of these faults can be fixed by a reboot, we also record historical repair actions. We have special rules to identify repeat offenders. For example, if the same hardware component on the same server fails a predefined number of times in a predetermined time interval, after a predefined number of reseats, we automatically mark it as faulty and swap it out. In cases where the component swap does not fix it, we will have to resort to a motherboard swap.
We also keep an eye on the repair trend to identify nontypical failure rates. For example, in one case, by using data from custom Scuba tables and their illustrative graphs and timelines, we root-caused a down-negotiation issue to a specific firmware release from a specific vendor. We then worked with the vendor to roll out new firmware that fixed the issue.
It’s also important to rate-limit remediations and repairs as a safety net to prevent bugs in the code from mass draining and unprovisioning, which can result in service outages if not handled properly.
Using this overall methodology, we’ve been able to add hardware health coverage and fix several thousand servers and server components. Every week, we’ve been able to detect, diagnose, remediate, and repair various PCIe faults on hundreds of servers.
Here’s a step-by-step breakdown of our process for identifying and fixing PCIe faults:
We’ve added hardware health coverage and fixed several thousand servers and server components with this methodology. We continue to detect, diagnose, remediate, and repair hundreds of servers every week with various PCIe faults. This has made our hardware fleet more reliable, resilient, and performant.
We’d like to thank Aaron Miller, Aleksander Książek, Chris Chen, Deomid Ryabkov, Wren Turkal and many others who contributed to this work in different aspects.
The post How Facebook deals with PCIe faults to keep our data centers running reliably appeared first on Facebook Engineering.