Jay Taylor's notes

back to listing index

Finding the top K items in a list efficiently

[web search]
Original source (stevehanov.ca)
Tags: algorithms data-structures heap stevehanov.ca
Clipped on: 2018-07-23
Image (Asset 1/2) alt= I know how to make and sell software online, and I can share my tips with you.
Email | Twitter | LinkedIn | Comics | All articles
< >

Finding the top K items in a list efficiently

Posted seven years ago

Algorithms will always matter. Sure, processor speeds are still increasing. But the problems that we want to solve using those processors are increasing in size faster. People who are dealing with social network graphs, or analyzing twitter posts, or searching images, or solving any of the hundreds of problems in vogue would be wasting time without the fastest possible hardware. But they would sitting around forever if they weren't using the right tools.

That's why I get sad when I see code like this:

# find the top 10 results
results = sorted(results, reverse=True)[:10]

Anything involving a sort will usually take O(nlogn) time, which, when dealing with lots of items, will keep you waiting around for several seconds or even minutes. An O(nlogn) algorithm, for large N, simply cannot be run in realtime when users are waiting.

The Heap

Finding the top K items can be done in O(nlogk) time, which is much, much faster than O(nlogn), using a heap (wikipedia). Or, since I usually end up rewriting everything in C++ eventually, a priority queue.

The strategy is to go through the list once, and as you go, keep a list of the top k elements that you found so far. To do this efficiently, you have to always know the smallest element in this top-k, so you can possibly replace it with one that is larger. The heap structure makes it easy to maintain this list without wasting any effort. It is like a lazy family member who always does the absolute minimum amount of work. It only does enough of the sort to find the smallest element, and that is why it is fast.

Here's some code to demonstrate the difference between a linear search, and a heap search to find the top K elements in a large array. The heap search is 4 times faster, despite the test being biased in favour of the linear search. The linear search ends up executing in compiled C inside python itself, while the heap search is completely in interpreted python. If they were both in C, the difference in performance would be more pronounced.

#!/usr/bin/python
import heapq
import random
import time

def createArray():
    array = range( 10 * 1000 * 1000 )
    random.shuffle( array )
    return array

def linearSearch( bigArray, k ):
    return sorted(bigArray, reverse=True)[:k]

def heapSearch( bigArray, k ):
    heap = []
    # Note: below is for illustration. It can be replaced by 
    # heapq.nlargest( bigArray, k )
    for item in bigArray:
        # If we have not yet found k items, or the current item is larger than
        # the smallest item on the heap,
        if len(heap) < k or item > heap[0]:
            # If the heap is full, remove the smallest element on the heap.
            if len(heap) == k: heapq.heappop( heap )
            # add the current element as the new smallest.
            heapq.heappush( heap, item )
    return heap

start = time.time()
bigArray = createArray()
print "Creating array took %g s" % (time.time() - start)

start = time.time()
print linearSearch( bigArray, 10 )    
print "Linear search took %g s" % (time.time() - start)

start = time.time()
print heapSearch( bigArray, 10 )    
print "Heap search took %g s" % (time.time() - start)
Creating array took 7.15145 s
[9999999, 9999998, 9999997, 9999996, 9999995, 9999994, 9999993, 9999992, 9999991, 9999990]
Linear search took 10.9981 s
[9999990, 9999992, 9999991, 9999994, 9999993, 9999998, 9999997, 9999996, 9999999, 9999995]
Heap search took 2.66371 s

Also, if you see stuff like this, you should go directly to the wikipedia page on the Selection Algorithm

# find the median
median = sorted(results)[len(results)/2]

My Heap in Javascript: Github

Steve Hanov makes a living working on Rhymebrain.com, PriceMonkey.ca, www.websequencediagrams.com, and Zwibbler.com. He lives in Waterloo, Canada.
Post comment
edit
Yesudeep Mangalapilly
four years ago
Corrections:

In [7]: n = 2**20

In [8]: n

Out[8]: 1048576

In [9]: k = 10

In [10]: n * math.log(k, 2)

Out[10]: 3483294.074024611

In [11]: n + k * math.log(n, 2)

Out[11]: 1048776.0

In [12]: n + k * math.log(k, 2)

Out[12]: 1048609.2192809489

That's far fewer comparisons for O(n) + O(k lg n).

edit
Yesudeep Mangalapilly
four years ago
Namaste,

Heapify (bottom-up) in place [O(n)] + pop k items [O(k lg n)]

(what heapq.nLargest does) should grow slower than

what has been implemented here. Logarithms dampen

the effect of a number, so it's always better to have the

larger value as an argument to logarithm and smaller coefficients.

Therefore,

O(k lg n) is likely to beat O(n lg k) most of the time

because k is generally very small and n is large.

In [2]: 10 * math.log(2**20, 2)

Out[2]: 200.0

In [3]: 2**20 * math.log(10, 2)

Out[3]: 3483294.074024611

That's far fewer comparisons for O(k lg n).

For this reason, I wouldn't prefer the approach implemented

here if I have a bottom up heap construction function like heapify

available.

edit
Joshua Grigonis
five years ago
Coincidentally, I just wrote some code that needed to do this, on the same day this popped on reddit/r/programming. When I wrote my algorithm, I didn't even have to step back and go "EWWWW! that smells", I knew I was doing it inefficiently.

Luckily, the longest list I'm dealing with is less than 500 items, and should generally be less than 100, but I may implement this algorithm instead.

edit
Oleg
six years ago
I think it would be better to use heapq.heapreplace instead of heappop followed by heappush.
edit
Krishna
six years ago
No because priority queues size is k.
edit
Iftakharul Islam
seven years ago
Isn't it O(klogn). k times rebuild the heap (logn)
edit
BlueRaja
seven years ago
Of course, the first code sample takes less time and thought to write.

Unless the list contains millions of elements or is called 100's of times a second, I would use the first method - we can always change it to use the second later when and if profiling determines this to be a bottleneck.

edit
Brendan Miller
seven years ago
In C++, rather than std::priority_queue it's easier to use either:

1. std::nth_element: selection algorithm in O(n)

2. std::partial_sort: partial sort algorithm in O(n * log k)

priority queues are good if you have a stream of incoming values, say from disk or the network, and you don't want to store all n in memory at once.

edit
Mariano Chouza
seven years ago
You can do better by using the k-th element selection algorithm, getting an O(n + k log k) algorithm if you care about the order of the top items and O(n) if you don't.

en.wikipedia.org/wiki/Selection_algorithm#Application_of_simple_selection_algorithms

edit
Satoru
seven years ago
Thanks for introducing the heapq module :)

BTW, I just discovered that this module also includes two functions that are intended for the job as described here - finding the largest/smallest n items in a list.

Then I try the heapq.nlargest function on `bigArray`, and it turns out about 1 second slower than your code.

edit
Ok, next time, i'll set an editing password, sorry ;)
seven years ago
pastebin.com/10eL2d4V
edit
I don't like if-statements
seven years ago
if-statements are expensive!

try this:

def faster_heap_search(bigArray, k):

heap = bigArray[:k]

for item in bigArray:

if item > heap[0]:

heapq.heappop(heap)

heapq.heappush(heap, item)

return sorted(heap)

otherwise, nice post.

but always remember: premature optimization is the root of all evil!

edit
Dan
seven years ago
The heap results aren't sorted like the linear results. Just to be complete, I think the function should return sorted(heap). After all, it might take a millisecond to sort 10 numbers. ;-)

As a casual pythoner, I appreciate a post like this, where a cool feature I didn't know about is explored.

How I run my business selling software to Americans

Here's what you can do to get the most out of your business in Canada if all of your revenue comes in US dollars.

Type-checked CoffeeScript with jzbuild

If your project is a combination of javascript and coffeescript, and you want static type checking, this tool will make it easy.

The Curious Complexity of Being Turned On

In software, the simplest things can turn into a nightmare, especially at a large company.

Why don't web browsers do this?

Why don't web pages start as fast as this computer from 1984?

Yes, You Absolutely Might Possibly Need an EIN to Sell Software to the US

After many months, your software sale is complete! You've got a purchase order, sent the invoice, delivered the software. You're already handling some support issues from users at BigCorp. Then BANG! Martha from Procurement emails back, as a favour, just to let you know that BigCorp has not received your W8 form with a valid tax id, and therefore will be withholding 30% of the purchase price of your multi-thousand dollar product for taxes.

How QBASIC almost got me killed

The day arrived when my project was ready to be unleashed upon the world. I waited until the teacher was hovering nearby and then I started my application, running the FORMAT command on the network drive. Some classmates were watching the screen and she hurried over to see what all the fuss was about.

Finding Bieber: On removing duplicates from a set of documents

Using a locality sensitive hash, you can mark duplicates in millions of items in no time.

Comment spam defeated at last

For years when running this blog, I would have to log in each day and delete a dozen comments due to spam. This was a chore, and I tried many ways to stem the tide.

Rules for Effective C++

The rules for safe C++ code are surprisingly controversial.