Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure

October 25, 2012 by mattcurcio 41 Comments

Intro

In the Zipfian world of AK, the HyperLogLog distinct value (DV) sketch reigns supreme. This DV sketch is the workhorse behind the majority of our DV counters (and we’re not alone) and enables us to have a real time, in memory data store with incredibly high throughput. HLL was conceived of by Flajolet et. al. in the phenomenal paper HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. This sketch extends upon the earlier Loglog Counting of Large Cardinalities (Durand et. al.) which in turn is based on ~~the seminal AMS work~~ FM-85, Flajolet and Martin’s original work on probabilistic counting. (Many thanks to Jérémie Lumbroso for the correction of the history here. I am very much looking forward to his upcoming introduction to probabilistic counting in Flajolet’s complete works.) UPDATE – Rob has recently published a blog about PCSA, a direct precursor to LogLog counting which is filled with interesting thoughts. There have been a few posts on HLL recently so I thought I would dive into the intuition behind the sketch and into some of the details.

Just like all the other DV sketches, HyperLogLog looks for interesting things in the hashed values of your incoming data. However, unlike other DV sketches HLL is based on bit pattern observables as opposed to KMV (and others) which are based on order statistics of a stream. As Flajolet himself states:

Bit-pattern observables: these are based on certain patterns of bits occurring at the beginning of the (binary) S-values. For instance, observing in the stream S at the beginning of a string a bit- pattern Ben Gimpert (@someben) says:

May 22, 2014 at 8:48 pm

This post was from ages ago, but just in case you’re still keeping an eye on things: There is a little bug in your Jenkin’s hash implementation from “http://findingscience.com/javascript/hashing/memcache/2010/12/28/javascript-implementation-of-jenkins-hash.html”. The right bit-shifts should all be unsigned (>>>).

but i don’t know how the Register Value that is highlighted in red is computed,can you give me some explanation about the computing of the register value.thanks!

Reply
- first:how to computer the theoretical register values
  second:the purpose of the The distribution ?
  best regards!

Thanks for the reference. Interesting idea you have over there. See my comment on your blog for my thoughts.
Thanks!

Union.Bucket[i] = Max(A.Bucket[i], B.Bucket[i])

Why doesn’t taking the min result in an intersection? Logically it seems like it should but I’m sure I’m missing something.

Thanks (:

HyperLogLog is phenomenal Algorithm at the foundation of many BigData applications: Cardinality Count is a very common problem in patterns analysis.