HyperLogLog in Python

We open-sourced a Python library for HLLs compatible with postgresql-hll. 5 minute read What is python-hll? We recently open-sourced python-hll, which is an implementation of HyperLogLog whose goal is to be storage compatible with java-hll, js-hll, and postgresql-hll. At NextRoll, we use these libraries to do fast counting of unique values in PostgreSQL and Java, […]

We open-sourced a Python library for HLLs compatible with postgresql-hll.

5 minute read


What is python-hll?

We recently open-sourced python-hll, which is an implementation of
HyperLogLog whose goal is to be
storage compatible
with java-hll, js-hll,
and postgresql-hll. At NextRoll, we use these libraries to do fast
counting of unique values in PostgreSQL and Java, but we were dismayed that there was no Python library for it.
So we decided to make one.

What are HLLs?

In a previous blog post, we talked about
how we use HLLs in PostgreSQL at NextRoll. As a quick recap, HLLs let you quickly count unique values. You
basically throw a bunch of unique values into a blob called an HLL. Then you can ask the HLL how many unique
values it has:

Adding values to an HLL

Not only that, but you can also union HLL blobs together then ask it how many unique values it has:

Unioning two HLLs together

Here’s how it works using our Python library:

from python_hll.hll import HLL
import mmh3

hll1 = HLL(13, 5)
hll2 = HLL(13, 5)

hll1.add_raw(mmh3.hash('17'))
hll1.add_raw(mmh3.hash('17'))
hll1.add_raw(mmh3.hash('31'))
hll1.add_raw(mmh3.hash('5'))

cardinality = hll1.cardinality()

# Output: 3

hll2.add_raw(mmh3.hash('6'))
hll2.add_raw(mmh3.hash('6'))
hll2.add_raw(mmh3.hash('31'))
hll2.add_raw(mmh3.hash('5'))

cardinality = hll2.cardinality()

# Output: 3

hll1.union(hll2)

cardinality = hll1.cardinality()

# Output: 4

And then you can export the HLL to a string that works with postgresql-hll:

from python_hll.util import NumberUtil
bytes = hll1.to_bytes()
output = "x" + NumberUtil.to_hex(bytes, 0, len(bytes))

# x128D7F00000000201E4B390000000027FA7CC000000000531A35E4000000006E6F0B04

How We Built and Tested It

We built python-hll as part of NextRoll’s June 2019 Hack Week, a time when engineers take some time
to build things that interest them personally. As we only had a week, we thought it would be
expedient to do a fairly literal translation/port of java-hll
to Python.

To minimize changes, we decided to internally represent
bytes as Java-style bytes (-128 to 127) rather than true Python bytes (0 to 255). Issues we ran into
included: how to translate Java’s >>>,
into Python, how to handle integer overflows from <<
identically to Java so the tests pass (actually we were quite surprised that so many integer overflows were happening in
the Java unit tests), and how to make sure the tests pass in both Python 2.7 and Python 3.

Fortunately, postgresql-hll has an extensive set of test data
with 22,614 data points that we were able to use to test our library.
In addition, we also ported the
unit tests
from java-hll to Python.

Unfortunately one drawback to our python-hll library is that it is quite slow compared to java-hll.
For example, in Java, HLLSerializationTest
takes 12 seconds to run, while in Python, the equivalent test_hll_serialization
takes 1.5 hours to run – it’s about 400x slower. It should be okay if a few HLL operations are needed,
but for many HLL operations you may find that it could slow your program down. Test it and see.

Acknowledgements

Thanks to the 7 engineers who translated code and tests from Java to Python for this project at NextRoll’s June 2019 Hack Week:

Source: AdRoll