13.3Data Engineering

HyperLogLog

Estimate the cardinality (count of distinct elements) of massive datasets using only a few bytes of memory. Watch how leading zeros in hash values enable probabilistic counting.

Presets
Registers:
Add Element
Recent Elements0 / 100
No elements processed yet. Press Play or Step to begin.
Register Array16 registers x 6 bits = 12 bytes
0
0
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
8
0
9
0
10
0
11
0
12
0
13
0
14
0
15
Estimation vs Actual
HyperLogLog Estimate0
Actual Distinct Count0
Error0.0%
Error Over Time
100%50%0%
Error will appear here as elements are processed
Memory Comparison
HyperLogLog
12 B
16 registers x 6 bits
Exact (HashSet)
0 B
~20 bytes x 0 elements
1.0x
Elements Processed0
Estimated Cardinality0
Actual Cardinality0
Error %0.0%
HLL Memory12 B
Exact Memory0 B
Understanding HyperLogLog

How It Works

HyperLogLog estimates the number of distinct elements in a stream by exploiting a statistical property of hash functions: in a random binary string, the probability of seeing k leading zeros is 1/2^k. If the longest run of leading zeros observed is k, the estimated number of distinct elements is approximately 2^k.

Algorithm Steps

1. HashEach input element is hashed to a 32-bit binary string.
2. BucketThe first 4 bits determine which of the 16 registers (buckets) to use.
3. CountCount the position of the first 1-bit in the remaining 28 bits (leading zeros + 1).
4. UpdateStore the maximum value seen for each register. Only update if the new count exceeds the stored value.
5. EstimateCombine all registers using harmonic mean to produce the cardinality estimate with bias correction.

Estimation Formula

E = alpha_m * m^2 * (SUM 2^(-M[j]))^(-1)

Where alpha_m is a bias correction constant, m is the number of registers, and M[j] is the value stored in register j.

Key InsightHyperLogLog can count billions of distinct elements using only ~1.5 KB of memory (with 16,384 registers), achieving a standard error of about 0.81%. Redis, BigQuery, Presto, and many databases use HyperLogLog internally for COUNT(DISTINCT) operations. The algorithm was published by Philippe Flajolet et al. in 2007 and has become a cornerstone of big data analytics.

Register Count vs Accuracy

Registers (m)MemoryStd ErrorUse Case
16
12 B26%Demo / Learning
64
48 B13%Low precision
256
192 B6.5%Moderate precision
1,024
768 B3.25%Good precision
16,384
12 KB0.81%Production (Redis)

Real-World Applications

Redis provides the PFADD and PFCOUNT commands for HyperLogLog. Google BigQuery uses HyperLogLog++ (an improved variant) for APPROX_COUNT_DISTINCT. Apache Spark, Presto, and Druid all implement HyperLogLog for fast approximate distinct counts on massive datasets. Common use cases include counting unique visitors, unique search queries, unique IP addresses, and distinct event types in streaming analytics.