Sketch of the Day: K-Minimum Values

Jay Taylor's notes

back to listing index

Sketch of the Day: K-Minimum Values

[web search]

Original source (research.neustar.biz)

Tags: math algorithms hyperloglog k-minimum-values data-science research.neustar.biz

Clipped on: 2017-04-06

One experiment shows,

The estimated intersection cardinality is: 1325462, true cardinality: 1176471, register size: 46
The estimated intersection error is: 12.664230567519303%
The estimated union cardinality is: 82281685, true cardinality :82352944, register size: 2048
The estimated union error is: 0.08652878274758459%
The jaccard cardinality is: 1848124, true cardinality :1176471
The estimated jaccard error is: 57.09048501833024%

Matt, have you ever evidenced the such as well?

Very insightful comment. I believe the answer lies in the error bounds of your estimator. For instance, suppose you had two sets of data that only had 1 intersecting value, and further suppose the hash of this value is in your KMV sketches for both. In your approach you would end up with a sketch of k = 1 to approximate the intersection size and you have no way of improving this estimate. In the approach of the paper you can continually increase k for your original sketches and tighten the error bound of the intersection estimate. So, there are controllable error bounds in the algorithm of the paper while I believe your estimator is not controllable.

We haven’t tried your approach for the above reason. It is curious that you have such a large discrepancy in your example. I suppose I would suggest double checking your code and running the same experiments with increasing k to confirm the error decreases.

Reply
- Have double checked the code, the code is fine; Actually, I think that both the Ke = cardinality (A ∩ B) and K cardinality (A union B) will give individual error bounds for the two estimators and Ke = E ( Jaccard * K ); So basically, I would argue that your example is not correct, since increasing K will also indicate the increase in Ke; and in extreme case where (Ke = 1), both estimators will give quite poor performance;
  
  For low Jaccard value, in my case ( J = (1176471/82352944) ~= 0.014 ), to achieve a good estimate (low ARE for A ∩ B); it means that the Jaccard error should be very small as it will be amplified by (A union B); However, Ke will be in such cases a reasonable value to achieve reasonable ARE; my point is that it might exist scenarios to balance off the estimator error by choosing one of the estimator based on the Ke / K value and their bounds respectively.
- I’m having a very difficult time understanding your notation so I’ll try and make my point again. In the extreme example above (one overlapping element out of two very large sets) you will have an intersection vector of length 1 (i.e. a capital K of 1) regardless of the size of the registers (k’s) you start with (assuming this overlapping value is in both KMV’s to begin with). If you interpret this intersection vector as a KMV itself, you can never improve the error by changing the original register size. You are effectively using a new KMV with a register size of 1 and there is nothing you can do with the original registers to change this.
  
  In the paper, the estimate of the Jaccard (K/k) is constantly improving with larger k. i.e. 1/1024, 1/2048, 1/4096, etc. so you can hone in on the correct answer by increasing k. Your statement that “both estimators will have poor performance” is not quite true, you can tune the paper’s approach to have increasingly smaller error.
  
  However, it doesn’t seem out of the question that for certain values of k, |A U B |, and |A ∩ B | that your estimator would be more accurate. There are 2 issues I have with this approach though, I don’t think you have nice probabilistic error bounds and I’m not sure you can know when to use which estimator. We’ll have to look into it a bit more. Have you plotted any comparisons between the two?

In the paper, the estimate of the Jaccard (K/k) is constantly improving with larger k. i.e. 1/1024, 1/2048, 1/4096, etc. so you can hone in on the correct answer by increasing k. Your statement that “both estimators will have poor performance” is not quite true, you can tune the paper’s approach to have increasingly smaller error.
[Ryan] In this case, increase in k will also lead to increase in K ( K = E(Jaccard * k), since Jaccard is fixed in such cases); so for extreme case, if K=1 with fixed Jaccard index, increase k will make K increasing as well to have ( K > 1);

Quote:
However, it doesn’t seem out of the question that for certain values of k, |A U B |, and |A ∩ B | that your estimator would be more accurate. There are 2 issues I have with this approach though, I don’t think you have nice probabilistic error bounds and I’m not sure you can know when to use which estimator. We’ll have to look into it a bit more. Have you plotted any comparisons between the two?
[Ryan] for the probabilistic error bound for K to estimate |A ∩ B|, we can borrow the error bound in original paper where the ARE is computed as sqrt (2 / pi * (K – 2));

Will get chance to collect data and plot one such comparison then when get chance.

When one set is over 10 times bigger than the other, the error rate can be anywhere between 30% and 200%. Even for two sets with similar size, the error rate is greater than 15%. Is this normal? I am using Jaccard way to estimate the intersection.

It is not surprising to me that your intersection errors could be so high. The accuracy of your estimates will obviously very much depend upon the size of your sketch (K), the relative sizes of the sets your are comparing as well as the amount of overlap the two sets have. Take a look at some of our posts on HLL intersections and the ways we explored those. I would guess you could follow a similar route of exploring intersections as we did for HLL. I would love to see some output!

Reply