Networking: Using Linux Traffic Control for Fun and Profit Loss Prevention

Here at bitly, we are big fans of data, tubes and especially tubes that carry data. This is a story about asking tubes to carry too much data. A physical hadoop cluster has been a significant part of bitly’s infrastructure and a core tool of the bitly Data Science and Ops/Infra teams for some time. […]

Here at bitly, we are big fans of data, tubes and especially tubes that carry
data.

This is a story about asking tubes to carry too much data.

A physical hadoop cluster has been a significant
part of bitly’s infrastructure and a core
tool of the bitly Data
Science and Ops/Infra teams for some time. Long enough that we needed to
cycle in a new cluster, copy data, and fold
the old into the new.

Branches were opened, work was done, servers provisioned and the new cluster
was stood up. Time to take a step back from the story and flesh out some
technical details:

  • bitly operates at a consequential scale of data: At the time of this
    migration the hadoop cluster was just over 150TB consumed disk space of
    compressed stream data, that is data that is the valuable output of our
    various applications after having been manipulated and expanded on by other
    applications.

  • bitly’s physical presence is collocated with our data center partner. There
    are three physical chassis classes
    (application, storage and database) racked together in contiguous cabinets in
    rows. At the time of this story
    each chassis had three physical 1Gb Ethernet connections (each logically
    isolated by VLANs), frontlink, backlink and lights-out (for out of band
    management of the server chassis). Each connection, after a series of cabinet
    specific patch panels and switches, connects to our core switches over 10Gb
    glass
    in a hub and spoke topology.

  • While bitly also operates at a consequential physical scale (hundreds of
    physical server chassis), we depend on our
    data center partner for network infrastructure and topology. This means that
    within most levels of the physical
    networking stack, we have severely limited control and visibility.

Back to the story:

The distcp tool bundled with hadoop allowed us to quickly copy data from one
cluster to the other. Put simply, distcp
tool creates a mapreduce job to shuffle data from one hdfs cluster to
another, in a many to many node copy. Distcp was
fast, which was good.

bitly broke, and the Ops/Infra team was very sad.

Errors and latent responses were being returned to website users and api
clients. We discovered that services were getting errors from other services,
database calls were timing out and even DNS queries internal to our network
were failing. We determined that the copy had caused unforeseen duress on the
network tubes, particularly the tubes of the network that carried traffic
cross physical cabinets. However the information given to us by our data
center partner only added to the confusion: no connection, not even cabinet to
core, showed signs of saturation, congestion or errors.

We now had two conflicting problems: a need to continue making headway with
the hadoop migration as well as
troubleshooting and understanding our network issues.

We limited the number of mapreduce mappers that the distcp tool used to copy
data between clusters, which artificially
throttled throughput for the copies, allowing us to resume moving forward
with the migration. Eventually the copies
completed, and we were able to swap the new cluster in for the old.

The new cluster had more nodes, which meant that hadoop was faster.

The Data Science team was happy.

Unfortunately, with hadoop being larger and faster, more data was getting
shipped around to more nodes during mapreduce
workloads, which had the unintended effect of:

bitly broke, a lot. The Ops/Infra team was very sad.

The first response action was to turn hadoop off.

The Data Science team was sad.

A turned off hadoop cluster is bad (just not as bad as breaking bitly), so we
time warped the cluster back to 1995 by
forcing all NICs to re-negotiate at 100Mbps (as opposed to 1Gbps) using
ethtool -s eth1 speed 100 duplex full autoneg
on
. Now we could safely turn hadoop on, but it was painfully slow.

The Data Science team was still sad.

In fact it was so slow and congested, that data ingestion and scheduled
ETL/reporting jobs began to fail frequently, triggering alarms that woke up
Ops/Infra team members up in the middle of the night.

The Ops/Infra team was still sad.

Because of our lack of sufficient visibility into the state of the network,
working through triage and
troubleshooting with our data center partner was going to be an involved and
lengthy task. Something had to be done to
get hadoop into a usable state, while protecting bitly from breaking.

Time to take another step back:

Some tools we have at bitly:

  • roles.json : A list of servers (app01, app02, userdb01, hadoop01 etc),
    roles (userdb, app, web, monitoring,
    hadoop_node etc), and the mapping of servers into roles (app01,02 -> app,
    hadoop01,02 -> hadoop_node etc).

  • $datacenter/jsons/* : A directory containing a json file per logical
    server, with attributes describing the
    server such as ip address, names, provisioning information and most
    importantly for this story; cabinet location.

  • Linux : Linux.

Since we could easily identify what servers do what things, where that server
is racked and can leverage all the benefits of Linux, this was a solvable
problem, and we got to work.

And the Ops/Infra team was sad.

Because Linux’s networking Traffic Control (tc) syntax was clunky and awkward
and its documentation intimidating. After much swearing and keyboard
smashing, perseverance paid off and working examples of tc magic surfaced.
Branches were opened, scripts were written, deploys done, benchmarks run and
finally some test nodes were left with the following:

$ tc class show dev eth1
class htb 1:100 root prio 0 rate 204800Kbit ceil 204800Kbit burst 1561b
    cburst 1561b
class htb 1:10 root prio 0 rate 819200Kbit ceil 819200Kbit burst 1433b 
    cburst 1433b
class htb 1:20 root prio 0 rate 204800Kbit ceil 204800Kbit burst 1561b 
    cburst 1561b

$ tc filter show dev eth1
filter parent 1: protocol ip pref 49128 u32 
filter parent 1: protocol ip pref 49128 u32 fh 818: ht divisor 1 
filter parent 1: protocol ip pref 49128 u32 fh 818::800 order 2048 key 
    ht 818 bkt 0 flowid 1:20 
    match 7f000001/ffffffff at 16
filter parent 1: protocol ip pref 49129 u32 
filter parent 1: protocol ip pref 49129 u32 fh 817: ht divisor 1 
filter parent 1: protocol ip pref 49129 u32 fh 817::800 order 2048 key 
    ht 817 bkt 0 flowid 1:10 
    match 7f000002/ffffffff at 16
filter parent 1: protocol ip pref 49130 u32 
filter parent 1: protocol ip pref 49130 u32 fh 816: ht divisor 1 
filter parent 1: protocol ip pref 49130 u32 fh 816::800 order 2048 key 
    ht 816 bkt 0 flowid 1:20 
    match 7f000003/ffffffff at 16
<snipped>

$ tc qdisc show
qdisc mq 0: dev eth2 root 
qdisc mq 0: dev eth0 root 
qdisc htb 1: dev eth1 root refcnt 9 r2q 10 default 100 
    direct_packets_stat 24

In plain English, there are three traffic control classes. Each class
represents a logical group, to which a filter
can be subscribed, such as:

class htb 1:100 root prio 0 rate 204800Kbit ceil 204800Kbit burst 1561b cburst 1561b

Each class represents a ceiling or throughput limit of outgoing traffic
aggregated across all filters subscribed to that class.

Each filter is a specific rule for a specific ip (unfortunately each IP is
printed in hex) , so the filter:

filter parent 1: protocol ip pref 49128 u32 
filter parent 1: protocol ip pref 49128 u32 fh 818: ht divisor 1 
filter parent 1: protocol ip pref 49128 u32 fh 818::800 order 2048 key 
    ht 818 bkt 0 flowid 1:20 
    match 7f000001/ffffffff at 16

can be read as “subscribe hadoop14 to the class 1:20” where “7f000001” can be
read as the IP for hadoop14 and “flowid 1:20” is the class being subscribed
to.

Finally there is a qdisc, which is more or less the active queue for the eth1
device. That queue defaults to placing any host that is not otherwise defined
in a filter for a class, into the 1:100 class.

qdisc htb 1: dev eth1 root refcnt 9 r2q 10 default 100 direct_packets_stat 24

With this configuration, any host, hadoop or not, that is in the same cabinet
as the host being configured gets a filter that is assigned to the “1:10”
class, which allows up to ~800Mbps for the class as a whole. Similarly, there
is a predefined list of roles that are deemed “roles of priority hosts”,
which get a filter created on the same “1:100” rule. These are hosts that do
uniquely important things, like running the hadoop namenode or jobtracker
services, and also our monitoring hosts.

Any other hadoop host that is not in the same cabinet is attached to the
“1:20” class, which is
limited to a more conservative ~200Mbps class.

As mentioned before, any host not specified by a filter gets caught by the
default class for the eth1 qdisc, which is
“1:100”.

What does this actually look like? Here is a host that is caught by the catch
all “1:100” rule:

[[email protected] ~]# iperf -t 30 -c NONHADOOPHOST
------------------------------------------------------------
Client connecting to NONHADOOPHOST, TCP port 5001
TCP window size: 23.2 KByte (default)
------------------------------------------------------------
[  3] local hadoop27 port 35897 connected with NONHADOOPHOST port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.1 sec   735 MBytes   205 Mbits/sec

Now when connecting to another host in the same cabinet, or the “1:10” rule :

[[email protected] ~]# iperf -t 30 -c CABINETPEER
------------------------------------------------------------
Client connecting to CABINETPEER, TCP port 5001
TCP window size: 23.2 KByte (default)
------------------------------------------------------------
[  3] local hadoop27 port 39016 connected with CABINETPEER port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  2.86 GBytes   820 Mbits/sec

Now what happens when connecting to two servers that match the “1:10” rule?

[[email protected] ~]# iperf -t 30 -c CABINETPEER1
------------------------------------------------------------
Client connecting to CABINETPEER1, TCP port 5001
TCP window size: 23.2 KByte (default)
------------------------------------------------------------
[  3] local hadoop27 port 39648 connected with CABINETPEER1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  1.47 GBytes   421 Mbits/sec

[[email protected] ~]# iperf -t 30 -c CABINETPEER2
------------------------------------------------------------
Client connecting to 10.241.28.160, TCP port 5001
TCP window size: 23.2 KByte (default)
------------------------------------------------------------
[  3] local hadoop27 port 38218 connected with CABINETPEER2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  1.43 GBytes   408 Mbits/sec

So the traffic got halved? Sounds about right.

Even better, trending the data was relatively easy by mangling the stats
output to our trending services:

$ /sbin/tc -s class show dev eth1 classid 1:100
class htb 1:100 root prio 0 rate 204800Kbit ceil 204800Kbit 
    burst 1561b cburst 1561b 
Sent 5876292240 bytes 41184081 pkt (dropped 0, overlimits 0 requeues 0) 
rate 3456bit 2pps backlog 0b 0p requeues 0 
lended: 40130273 borrowed: 0 giants: 0
tokens: 906 ctokens: 906

After testing, we cycled through hadoop hosts, re-enabling their links to 1Gb
after applying the traffic control roles.
With deploys done, hadoop was use-ably performant.

The Data Science team was happy.

The Ops/Infra team could begin tackling longer term troubleshooting and
solutions while being able to sleep at night,
knowing that bitly was not being broken.

The Ops/Infra team was happy.

Take aways:

  • In dire moments: your toolset for managing your environment is as
    important as the environment itself. Because we
    already had the toolset available to holistically control the environment, we
    were able to dig ourselves out of the
    hole almost as quickly as we had fallen into it.

  • Don’t get into dire moments: Understand the environment that you live in.
    In this case, we should have had a better
    understanding and appreciation for the scope of the hadoop migration and its
    possible impacts.

  • Linux TC is a high cost, high reward tool. It was almost certainly written
    by people with the very longest of
    beards, and requires time and patience to implement. However we found it to
    be an incredibly powerful tool that helped
    save us from ourselves.

  • Linux: Linux

EOL

This story is a good reminder of the “Law of Murphy for
devops”
.
Temporary solutions like those in this story afforded us the time to
complete troubleshooting of our network and implement permanent fixes.
We have since unthrottled hadoop and moved it to its own dedicated network,
worked around shoddy network hardware to harden our primary network and
much more. Stay tuned.

Source: Bitly