Jay Taylor's notes

back to listing index

axiomhq/hyperloglog

[web search]
Original source (github.com)
Tags: golang go hyperloglog github.com
Clipped on: 2017-08-21

Skip to content
  • You have unread notifications
  • Create new…
  • Image (Asset 1/4) alt=

    An improved version of HyperLogLog for the count-distinct problem, approximating the number of distinct elements in a multiset using 20-50% less space than other usual HyperLogLog implementations.

    This work is based on "Better with fewer bits: Improving the performance of cardinality estimation of large data streams - Qingjun Xiao, You Zhou, Shigang Chen".

    Implementation

    The core differences between this and other implementations are:

    • use metro hash instead of xxhash
    • sparse representation for lower cardinalities (like HyperLogLog++)
    • loglog-beta for dynamic bias correction medium and high cardinalities.
    • 4-bit register instead of 5 (HLL) and 6 (HLL++), but most implementations use 1-byte registers out of convenience

    In general it borrows a lot from InfluxData's fork of Clark Duvall's HyperLogLog++ implementation, but uses 50% less space.

    Results

    A direct comparison with the HyperLogLog++ implementation used by InfluxDB yielded the following results:

    Exact Axiom (8.2 KB) Influx (16.39 KB)
    10 10 (0.0% off) 10 (0.0% off)
    50 50 (0.0% off) 50 (0.0% off)
    250 250 (0.0% off) 250 (0.0% off)
    1250 1249 (0.08% off) 1249 (0.08% off)
    6250 6250 (0.0% off) 6250 (0.0% off)
    31250 31008 (0.7744% off) 31565 (1.0080% off)
    156250 156013 (0.1517% off) 156652 (0.2573% off)
    781250 782364 (0.1426% off) 775988 (0.6735% off)
    3906250 3869332 (0.9451% off) 3889909 (0.4183% off)
    10000000 9952682 (0.4732% off) 9889556 (1.1044% off)

    Note

    A big thank you to Prof. Shigang Chen and his team at the University of Florida who are actively conducting research around "Big Network Data".