13.3Data Engineering

HyperLogLog

Estimate the cardinality (count of distinct elements) of massive datasets using only a few bytes of memory. Watch how leading zeros in hash values enable probabilistic counting.

Presets

Registers:

Add Element

Recent Elements0 / 100

No elements processed yet. Press Play or Step to begin.

Estimation vs Actual

HyperLogLog Estimate0

Actual Distinct Count0

Error0.0%

Error Over Time

100%50%0%

Error will appear here as elements are processed

Memory Comparison

HyperLogLog

12 B

16 registers x 6 bits

Exact (HashSet)

0 B

~20 bytes x 0 elements

1.0x

Elements Processed0

Estimated Cardinality0

Actual Cardinality0

Error %0.0%

HLL Memory12 B

Exact Memory0 B

Understanding HyperLogLog

How It Works

HyperLogLog estimates the number of distinct elements in a stream by exploiting a statistical property of hash functions: in a random binary string, the probability of seeing k leading zeros is 1/2^k. If the longest run of leading zeros observed is k, the estimated number of distinct elements is approximately 2^k.

Algorithm Steps

1. HashEach input element is hashed to a 32-bit binary string.

2. BucketThe first 4 bits determine which of the 16 registers (buckets) to use.

3. CountCount the position of the first 1-bit in the remaining 28 bits (leading zeros + 1).

4. UpdateStore the maximum value seen for each register. Only update if the new count exceeds the stored value.

5. EstimateCombine all registers using harmonic mean to produce the cardinality estimate with bias correction.

Estimation Formula

E = alpha_m * m^2 * (SUM 2^(-M[j]))^(-1)

Where alpha_m is a bias correction constant, m is the number of registers, and M[j] is the value stored in register j.

Key InsightHyperLogLog can count billions of distinct elements using only ~1.5 KB of memory (with 16,384 registers), achieving a standard error of about 0.81%. Redis, BigQuery, Presto, and many databases use HyperLogLog internally for COUNT(DISTINCT) operations. The algorithm was published by Philippe Flajolet et al. in 2007 and has become a cornerstone of big data analytics.

Register Count vs Accuracy

Registers (m)	Memory	Std Error	Use Case
16	12 B	26%	Demo / Learning
64	48 B	13%	Low precision
256	192 B	6.5%	Moderate precision
1,024	768 B	3.25%	Good precision
16,384	12 KB	0.81%	Production (Redis)

Real-World Applications

Redis provides the PFADD and PFCOUNT commands for HyperLogLog. Google BigQuery uses HyperLogLog++ (an improved variant) for APPROX_COUNT_DISTINCT. Apache Spark, Presto, and Druid all implement HyperLogLog for fast approximate distinct counts on massive datasets. Common use cases include counting unique visitors, unique search queries, unique IP addresses, and distinct event types in streaming analytics.