Two Weeks Later: Finding and Eliminating Long Tail Latencies

A little over two weeks ago, we shared extensive benchmarking results of edge networks all around the world.  It showed that on a range of tests (TCP connection time, time to first byte, time to last byte), and on a range of measurements (p95, mean), that Cloudflare had some impressive network performance. But we weren’t […]

Two Weeks Later: Finding and Eliminating Long Tail Latencies

Two Weeks Later: Finding and Eliminating Long Tail Latencies

A little over two weeks ago, we shared extensive benchmarking results of edge networks all around the world.  It showed that on a range of tests (TCP connection time, time to first byte, time to last byte), and on a range of measurements (p95, mean), that Cloudflare had some impressive network performance. But we weren’t the fastest everywhere. So we made a commitment: we would improve in at least 10% of networks where we were not #1.

Today, we’re happy to tell you that we’ve delivered as promised. Of the networks where our average latency exceeded 100ms behind the leading provider during Speed Week, we’ve dramatically improved our performance. There were 61 networks; now, we’re the fastest in 29 of them. Of course, we’re not done yet — but we wanted to share with you the latest results, and explain how we did it.

Measuring What Matters

In the process of quantifying network performance, it became clear where we were not the fastest everywhere. There were 61 country/network pairs where we more than 100ms behind the leading provider:

Two Weeks Later: Finding and Eliminating Long Tail Latencies

Once that was done, the fun began: we needed to go through the process of figuring out why we were slow — and then improve. The challenges we faced were unique to each network and highlighted a variety of different issues that are prevalent on the Internet. We’re going to deep dive into a couple of networks, and show how we diagnosed and then improved performance.

But before we do, here are the results of our efforts in the past two weeks: of the 61 networks where our performance was over 100ms behind the leader, we are now the #1 network in 29 of them.

Two Weeks Later: Finding and Eliminating Long Tail Latencies

And it’s not that we just focused on those 29 networks, either. We’ve dramatically improved our performance in almost all the networks where we were over 100ms behind the leader.

Two Weeks Later: Finding and Eliminating Long Tail Latencies

With the results out of the way, let’s share the story of chasing peak performance in three very different geographies — each with three very different sets of challenges. Before we begin: a lot of Cloudflare’s internal network performance is automatically tuned. However, by its very nature, the Internet is a network of networks — and that inherently relies on us talking to other network operators to maximize performance. That’s often what we had to do here.

Rectifying Route Advertisement in Brazil

One particular network that was flagged for improvement during Speed Week stood out: we’ll refer to it as Network-A. This network was well known to our edge team (the team that looks after our network connectivity in Cloudflare data centers) for frequently congesting the dedicated interconnection we have with the network in São Paulo. This type of dedicated connection is called a Private Network Interconnect (PNI), or private peering, and it helps Cloudflare talk to Network-A without any intermediaries using the BGP protocol.

At a first look, we noticed that a significant chunk of traffic to Network-A was not using the PNI, but instead was being sent through one of our transit providers. A transit provider is an intermediary network that provides connectivity to the rest of the Internet.

This is not uncommon. The most likely reason for this behavior is that at some point in the past traffic was shifted away from the PNI due to capacity issues mentioned earlier.

We then started to take a more in-depth look at the path from this particular transit provider and identified that traffic was routed all the way to the USA before coming back to Brazil. The transit provider was exhibiting behavior known as tromboning: traffic from one location destined to a network in the same location travels vast distances only to be exchanged and then returns again. Tromboning typically occurs as a result of networks preferring paths that are farther away from the best possible path. This can happen due to peering preferences, BGP configurations, or the presence of direct interconnection farther away from end users. This explained the higher latency on this network we saw during Speed Week.

Two Weeks Later: Finding and Eliminating Long Tail Latencies

The next step we took was to look into alternatives to this transit connection. We have a nearby data center in Rio de Janeiro — where we also had a PNI with Network-A. Moreover, São Paulo and Rio de Janeiro are connected via our backbone network. After making the necessary checks to ensure we had room to carry traffic towards Network-A through our backbone and out through the PNI in Rio, we proceeded to prepare the network configuration changes.

We first started announcing Network-A IP addresses out of our backbone in Rio and then accepting them into São Paulo. We then ensured we preferred the path via our backbone over the PNI by changing the BGP behavior through the LOCAL_PREF path attribute. We then removed all the configuration specifying that transit provider as the preferred route for the previously identified traffic from Network-A.

Two Weeks Later: Finding and Eliminating Long Tail Latencies

The result was as expected. Traffic moved away from the transit provider onto our backbone network. We confirmed we achieved a decrease in latency by monitoring our p95 TCP RTTs, which went from 175ms to 90ms.

We currently rank #1 with Network-A, moving up from #5 during Speed Week, as seen in the chart below.

Two Weeks Later: Finding and Eliminating Long Tail Latencies

Immaculate Ingress in Spain

Another network that stood out was a European ISP with a global presence. We’ll refer to it as Network-B. Our RUM measurements showed that we were experiencing high latencies in several parts of the world, including Spain.

The first thing we did was to check how we handled traffic from Spain for Network-B. Our data showed that we had several data centers outside the country which were serving users from Network-B: Milan in Italy and Marseille in France. This obviously raised a question: why is traffic not staying locally in Spain?

The traffic was not staying local because Network-B had not peered with us in Madrid. If private peering describes a connection between exactly two networks using a dedicated circuit, public peering allows multiple networks to interconnect, if they wish so, at an Internet Exchange Point (IXP) location using a shared infrastructure. We looked at our Peering Portal to identify any potential peering opportunities with Network-B and established peering sessions in various locations where we saw high latency, including Madrid.

We looked at the traffic breakdown for these locations and identified the top destinations not being advertised in-country. We then checked whether our Spanish data centers were advertising these destinations and found that the corresponding anycast IP addresses were not enabled in Barcelona. We enabled the additional anycast IP addresses in Barcelona, and this change resulted in traffic for Network-B to be handled locally, which helped reduce latency.

Since we were looking into public peering status with Network-B, we also noticed that they had turned off their public peering session with Cloudflare in Milan. Our logs showed that the session with Network-B was down because it thought Cloudflare was sending more IP prefixes than allowed. We contacted Network-B and advised them to update the configuration according to the data we publish in PeeringDB. While it is a public peering session which comes with its own pros and cons, it still represents a more direct path than using a transit provider.

These changes pushed us up from ranking #2 during Speed Week to #1, as shown by the graph below:

Two Weeks Later: Finding and Eliminating Long Tail Latencies

You may notice that going from #2 to #1 still means we have a latency of about 300ms.  We want to ensure that every network has amazing performance, but we can’t control all network providers and how they connect with the rest of the Internet. We’re constantly working to ensure that end users see the best experience possible.

Upstream Selection in Africa

We’ve previously discussed private peering and transit providers and how a direct connection is better than a transit connection which usually routes through intermediary networks. However, sometimes this might not be true. This was the case for a network in Africa, which we will call Network-C.

As before, we started by looking at the locations from where we serve traffic to Network-C. This was mostly from our data centers in Western Europe. Looking at the parent ASNs for Network-C, we expected this outcome since we don’t peer with either of them anywhere in Africa.

Let’s take our data center in London. There, we had a private peering connection with Parent-1 and a transit connection with Parent-2. We were receiving IP addresses belonging to Network-C from both parents, however we were only sending traffic to Parent-1 since that was our private peer.

As Parent-2 also provided a direct path to Network-C and, moreover, they belonged to the same organization, we decided to test the latency via Parent-2. It is generally tricky to identify potential upstream bottlenecks especially for transit providers, as each network has its own internal mechanisms for routing. However, in this case we were directly connected.

Once again, we modified the BGP behaviour. Let’s go into more detail this time. Our routing policies are configured differently depending on the type of network we peer with and the type of connection we use. Our policies configure the BGP LOCAL_PREF path attribute, which is the first decisive step in the selection of a path. In our case, Network-C prefixes from Parent-1 had a higher associated value than the same prefixes learned from Parent-2 and were thus chosen for routing. In order to steer traffic away from the private peer and towards the transit provider, we needed to adjust our transit policy to set a higher LOCAL_PREF value only for Network-C prefixes. We also had to use a regular expression to match the desired prefixes by filtering Network-C ASN in the AS-path in a way that would not affect traffic to the other networks from the transit provider.

This change produced better results in terms of latency. We were #2 during Speed Week. We are now #1, as seen by this chart:

Two Weeks Later: Finding and Eliminating Long Tail Latencies

Update on Speed Week

Two weeks ago, when we first reported our measurements, there were two charts that stuck out where Cloudflare was not #1 in terms of number of networks where we had the lowest connection time or TTLB.

The first was the mean TCP connection time in the top 1,000 networks by number of IP addresses. Since then, we’ve been optimizing and have measured our performance again, and we’ve now moved into the #1 spot.

Two Weeks Later: Finding and Eliminating Long Tail Latencies

The other measurement where we were #2 was mean TTLB in the top 1,000 networks by IP count. We’ve moved into the #1 spot, but there’s still work to do. Which makes sense because the work we’ve been doing over the last two weeks optimized network performance and not our software platform. Hence, connection times got a lot better while TTLB improved less dramatically.

Two Weeks Later: Finding and Eliminating Long Tail Latencies

Getting ever faster

Improving performance on the Internet is a long tail problem: each issue requires a different solution because every network is unique and covers different end users. As we continue to grow our network and interconnect with more of the world, it’s important that we constantly examine our performance to ensure that we’re the fastest.

The efforts of our team have yielded great improvements for our customers, but we’re not just stopping because Speed Week and Birthday Week are over. We’re automating the discovery process of poor performance on networks like these, and are working hard to also automate the remediation processes in order to deliver more incredible performance for our customers.

And we have two more innovation weeks coming in 2021. We’ll be back each week to report on further progress on optimizing our performance globally.

Source: Cloudflare