We moved from HLLs stored in HBase to HLLs stored in Postgres with great results. 5-10 minute read What is HyperLogLog (HLL)? HyperLogLog (HLL) is a useful and interesting probabilistic data structure used to count unique values in a given data set with good accuracy and speed. Normally, to count unique values accurately requires memory […]
We moved from HLLs stored in HBase to HLLs stored in Postgres with great results.
5-10 minute read
HyperLogLog (HLL) is a useful and interesting probabilistic data structure used to count unique values in a given data set with good accuracy and speed. Normally, to count unique values accurately requires memory proportional to the number of unique values. This becomes a problem when working with large data sets. HyperLogLog solves this problem by allowing us to trade memory consumption for (tunable) precision, making it possible to estimate cardinalities larger than 1 billion with a standard error of 2% using only 1.5 kilobytes of memory. For more on how this works, you can refer to our past blog post here.
At Adroll Group, we use this data structure in our reports to measure metrics like unique visitors, engaged visitors, etc. We had implemented HLL in-house and stored it in HBase. But HBase was causing us some headaches.
While working with HBase we encountered two major challenges:
HBase being a key-value data store requires a key which uniquely identifies the record. This key is used to extract metrics and aggregation in reporting. It was always a challenge to define the key in a way which makes searching of records optimized for different parameters, especially compared to a relational database.
As an example, let’s say we are storing unique visitors for a customer with details about visitors every day.
For that we can create a table called Daily Unique Visitors with the following columns:
Since the value we care about is only the last column and the others are its key, in an HBase Table we’ll need to define a key-structure using the first 4 columns concatenated by underscores. Something like:
A key will then look like
2019-07-01_A_Chrome_US, i.e. 1st July, for customer A, chrome browser, US location (Key) => Unique visitors (Value)
This works well when we need to scan it in the same order.
For example, to get the number of unique visitors in the month of July 2019, for a given customer, for the Chrome browser, for all locations.
But as soon as we have different search criteria – like getting the number of unique visitors in July, for Customer A, for all browsers, in the US – the key structure is no longer effective and requires more processing to discard irrelevant keys on the scan results.
For interaction with HBase from a non-Java platform, people usually use Thrift since it provides a language-independent interface. To support our reporting needs we have to implement operations like aggregation, filtering, and ordering of HLL records efficiently in our Thrift layer. This adds to development, maintenance, and resource costs. We also encountered stability issues with Thrift which required frequent restarts.
In the earlier HBase table example: scanning, filtering, ordering, etc. is done in the Thrift layer.
Sometimes the date ranges are so big that Thrift makes lots of parallel HBase scan threads, causing resource crunches and frequent timeouts.
We started looking for alternatives and fortunately, Amazon Aurora RDS started supporting the postgresql-hll extension to provide a new HLL data type. Switching to a relational database system solves a lot of our problems. If we use RDS as a data store, then we can define our schema efficiently using indexing, etc., to fulfill all our reporting needs (no more single keys). Also, we don’t have to maintain anything for our aggregation, filtering and ordering needs as RDS has that built in (no more Thrift).
Going back to our daily unique visitors table example:
We can get results easily and efficiently using SQL:
SELECT hll_cardinality(hll_union_agg(unique_visitors)) FROM daily_unique_visitors WHERE ts >= '2019-07-01' AND ts < '2019-08-01' AND customer_id = 'A' AND location = 'US'
Using various relational database optimization techniques like indexing, table partitioning etc. you can get results fast.
To measure the improvements we did a proof-of-concept. We replicated 3 months of data in RDS and compared the HBase and RDS results.
Long date range queries spanning 3 months used to take more than 60 seconds in HBase, but gave results in RDS in less than 10 seconds.
HBase and the Thrift layer require maintenance such as, installation, monitoring, trouble shooting, development etc, which is already taken care of with Amazon Aurora since it’s a managed service.
After evaluation, we decided to move to RDS, but it had its own challenges and limitations as explained below.
We created a Python library to read, write, count and do operations like the union of PostgreSQL compatible HLLs. This is similar to the Java library. We plan to open-source it soon – stay tuned.
Migrating from HBase to PostgreSQL HLL on Amazon Aurora was a big win for us. We found the PostgreSQL HLL extension a lot better to store our HLL reporting metrics. Also, the benefits that stem from using an RDS with support for HLL, like Amazon Aurora, serve our reporting needs to retrieve and present this information efficiently. We thoroughly recommend that you check it out.