Estimate the cardinality (count of distinct elements) of massive datasets using only a few bytes of memory. Watch how leading zeros in hash values enable probabilistic counting.
HyperLogLog estimates the number of distinct elements in a stream by exploiting a statistical property of hash functions: in a random binary string, the probability of seeing k leading zeros is 1/2^k. If the longest run of leading zeros observed is k, the estimated number of distinct elements is approximately 2^k.
Where alpha_m is a bias correction constant, m is the number of registers, and M[j] is the value stored in register j.
| Registers (m) | Memory | Std Error | Use Case |
|---|---|---|---|
16 | 12 B | 26% | Demo / Learning |
64 | 48 B | 13% | Low precision |
256 | 192 B | 6.5% | Moderate precision |
1,024 | 768 B | 3.25% | Good precision |
16,384 | 12 KB | 0.81% | Production (Redis) |
Redis provides the PFADD and PFCOUNT commands for HyperLogLog. Google BigQuery uses HyperLogLog++ (an improved variant) for APPROX_COUNT_DISTINCT. Apache Spark, Presto, and Druid all implement HyperLogLog for fast approximate distinct counts on massive datasets. Common use cases include counting unique visitors, unique search queries, unique IP addresses, and distinct event types in streaming analytics.