You’re doing data replication wrong. There. I said it. Suppose you’ve got a six node database cluster (ex. HDFS, Cassandra, Riak), and you’d like to store three copies of some piece of data in the cluster. Which three nodes should you pick? One possibility is that you pick a random three nodes. This is the […]

You’re doing data replication wrong. There. I said it.

Suppose you’ve got a six node database cluster (ex. HDFS, Cassandra, Riak), and you’d like to store three copies of some piece of data in the cluster. Which three nodes should you pick?

One possibility is that you pick a random three nodes. This is the strategy that HDFS takes by default.

Another possibility is that you order the nodes and then randomly pick three consecutive nodes. This is the strategy Facebook’s modified HDFS takes, as well as Riak.

A third possibility is you set up your nodes in two sets of three: [A, B, C] might be one set, and [D, E, F] might be another. The data then gets randomly assigned to one of those sets, so all data that lives on A is also on B and C, and all data that lives on D is also on E and F. This is how you might set up data replication and sharding with MySQL or MongoDB.

Question: what happens in the case that three random nodes fail? Cidon et al explored this in their paper introducing copysets. In it, they introduce both the notion of a copyset (places you copy data), and also a random algorithm called Copysets that computes the copysets. Forget about the Copysets algorithm, because Cidon et al followed up with a paper a few years later introducing Tiered Replication, which solves the copyset generation problem and then some. This post is about Tiered Replication and our Python implementation, trepl.

So what does happen when a random three nodes fail? Let’s go back and look at the three different strategies and do a little bit of math.

**Random Replication**

If you have #bigdata and you’re randomly picking your replication sets, chances are you’ll have data that resides on every single possible set. To see why this is the case, we can easily compute the number of possible three node configurations from a six node cluster. This is a combinatorics 101 function: nCr or “6 choose 3” which is equal to 20:

A B C A B D A B E A B F A C D ... C E F D E F

Note that we’re counting *combinations*, not *permutations*, so we don’t care about the ordering of the sets.

Since there only 20 different sets, but we have #bigdata, then chances are lots of data gets assigned to each one of those sets. Now if a random three nodes fail, *we’ve guaranteed that we’ve lost data!* We’ve lost roughly 1/20 or 5% of our data actually.

**Ring Placement**

Now suppose instead we go the Riak route and place our data on sequential nodes in a ring. Our sets then consist of the following:

A B C B C D C D E D E F E F A F A B

So all data gets copied to one of those six sets. Now what happens when three random nodes fail? Well we know that there are 20 possible combinations of three nodes, so the probability that one of the six sets is taken out is 6/20, or 30%, and when it happens, we lose 1/6 of our data. The expected data loss then, is the probability of loss times the magnitude of loss, or 1/20 = 5%.

**Pair Placement**

Here we’ve got exactly two sets.

A B C D E F

The probability that one of those two sets gets wiped out when three nodes fail is 2/20, or 10%. That’s nice and low, but when it happens, we lose half our data! So the expected amount of loss is 2/20 * 1/2 = 5%.

Starting to see a pattern? The expected amount of data loss in each of these scenarios is the same, however we can control the *probability* of failure by varying the configuration of the sets. This is super interesting in my opinion, because not all failures are the same. If we have to go to our customers and fess up that we’ve lost 5% of data, that might from a PR perspective be just as bad as admitting that we lost 17% of our data. In that case, we should prefer ring placement over random replication, because ring placement (though it’s *expected* to be just as bad as random) has a lower failure rate.

**Restoring Data**

It turns out that the replication strategy also dictates the speed at which you can restore from a failed node. This is easiest to see when comparing pair versus random replication. With pair placement, suppose that A fails and we bring it back up from a blank slate. All of the data that belongs on A is on B and C, so A can copy half of it from B, and the other half from C. The number of nodes A can restore from is called the scatter width, in this case two.

With random replication, A participates in a set with every other node. When we bring it back up, it can restore its data from every other node, so it has a scatter width of five. Not only does this mean that each node undergoes less strain, but the restoring process is also quite a bit faster since it can be transferring data from the replicas in parallel.

**Tiered Replication**

The three parameters of note in the example I gave are: six nodes, three copies, and some scatter width. We can parameterize these as N, R, and S. The Tiered Replication algorithm computes a set of copysets each of which is size R, the sets are drawn from the N nodes, and each node has a minimum scatter width of S. This gives the programmer the ability to pick a comfortable failure and restore scenario: do we want to minimize probability of failure but suffer from slow restores? Or minimize damage done and have fast restores?

TR is a straightforward greedy algorithm. I won’t go into detail here (read the paper or the code), but the basic intuition is this:

- All nodes start out with a scatter width of zero.
- To create a copyset, we pick the R nodes with the lowest scatter width.
- Repeat this process until every node has a scatter width of at least S.

The “tiered” part of Tiered Replication allows you to throw in an arbitrary constraint when creating a copyset. This is useful for implementing strategies such as rack or availability zone awareness, or “tiering” your storage (R-1 copies of data sit on fast expensive nodes, and then one copy sits on a slow but cheap node only used for restoring from failures).

**Trepl**

Trepl is our Python implementation of TR, available on pypi.

>>> trepl.build_copysets(['node1', 'node2', 'node3'], R=2, S=1) [['node1', 'node2'], ['node1', 'node3']] >>> trepl.build_copysets(['node1', 'node2', 'node3'], R=2, S=2) [['node1', 'node2'], ['node1', 'node3'], ['node2', 'node3']]

You can supply a constraint function as well, and in fact we ship with one useful for rack awareness, as well as tiering.

# not rack aware >>> trepl.build_copysets(['node1', 'node2', 'node3'], R=2, S=1) [['node1', 'node2'], ['node1', 'node3']] # rack aware, node1 and node2 can not share a copyset since they're in # the same rack >>> rack_map = { 'node1': 'rack1', 'node2': 'rack1', 'node3': 'rack3' } >>> trepl.build_copysets( rack_map.keys(), R=2, S=1, checker=trepl.checkers.rack(rack_map), ) [['node1', 'node3'], ['node2', 'node3']] # scatter width must be 2, and data must exist on at least one node in # the backup tier >>> primary = ['A', 'B', 'C'] >>> backup = ['d', 'e'] >>> trepl.build_copysets( primary + backup, R=2, S=2, checker=trepl.checkers.tiered(backup, 2), ) [['A', 'd'], ['A', 'e'], ['B', 'd'], ['B', 'e'], ['C', 'd'], ['C', 'e']]

We currently use Trepl to determine partition placements with Kafka (sorry, that tool isn’t open source!), and we’ll also be extending the tool to set up smarter layer 7 load balancing with haproxy. We also use it in our beloved WADE, and can also envision possible applications for Mesos. Let us know if you come up with another novel use!

Source: Chartbeat