Jay Taylor's notes

back to listing index

New Concurrent Hash Maps for C++

[web search]
Original source (preshing.com)
Tags: c++ concurrency hash-map preshing.com
Clipped on: 2016-02-22

Image (Asset 1/14) alt=Preshing on Programming

Feb 01, 2016

New Concurrent Hash Maps for C++

A map is a data structure that maps a collection of keys to a collection of values. It’s a common concept in computer programming. You typically manipulate maps using functions such as find, insert and erase.

A concurrent map is one that lets you call some of those functions concurrently – even in combinations where the map is modified. If it lets you call insert from multiple threads, with no mutual exclusion, it’s a concurrent map. If it lets you call insert while another thread is calling find, with no mutual exclusion, it’s a concurrent map. Other combinations might be allowed, too. Traditional maps, such as std::map and std::unordered_map, don’t allow that.

Today I’m releasing Junction, a C++ library that contains several new concurrent maps. It’s BSD-licensed, so you can use the source code freely in any project, for any purpose.

Image (Asset 2/14) alt=

On my Core i7-5930K, Junction’s two fastest maps outperform all other concurrent maps.

Image (Asset 3/14) alt=

They come in three flavors:

  1. Junction’s Linear map is similar to the simple lock-free hash table I published a while ago, except that it also supports resizing, deleting entries, and templated key/value types. It was inspired by Cliff Click’s non-blocking hash map in Java, but has a few differences.

  2. Junction’s LeapFrog map is similar to Linear, except that it uses a probing strategy loosely based on hopscotch hashing. This strategy improves lookup efficiency when the table is densely populated. LeapFrog scales better than Linear because it modifies shared state far less frequently.

  3. Junction’s Grampa map is similar to LeapFrog, except that at high populations, the map gets split into a set of smaller, fixed-size LeapFrog tables. Whenever one of those tables overflows, it gets split into two new tables instead of resizing the entire map.

Junction aims to support as many platforms as possible. So far, it’s been tested on Windows, Ubuntu, OS X and iOS. Its main dependencies are CMake and a companion library called Turf. Turf is an abstraction layer over POSIX, Win32, Mach, Linux, Boost, C++11, and possibly other platform APIs. You configure Turf to use the API you want.

Using Junction Maps

Instantiate one of the class templates using integer or raw pointer types.

typedef junction::ConcurrentMap_Grampa<turf::u64, Foo*> ConcurrentMap;
ConcurrentMap myMap;

Each map exposes functions such as get, set, exchange and erase. These functions are all atomic with respect to each other, so you can call them from any thread at any time. They also provide release and consume semantics implicitly, so you can safely pass non-atomic information between threads.

myMap.set(14, new Foo);
Foo* foo = myMap.get(14);
foo = myMap.exchange(14, new Foo);
delete foo;
foo = myMap.erase(14);
delete foo;

Out of all possible keys, a null key must be reserved, and out of all possible values, null and redirect values must be reserved. The defaults are 0 and 1. You can override those defaults by passing custom KeyTraits and ValueTraits parameters to the template.

Safe Memory Reclamation

All Junction maps rely on a form of safe memory reclamation known as QSBR, or quiescent state-based memory reclamation. QSBR could be described as a primitive garbage collector.

If it seems odd to perform garbage collection in C++, keep in mind that scalable concurrent maps are already prevalent in Java, an entirely garbage-collected language. That’s no coincidence. Garbage collection allows you to sidestep locks, especially during read operations, which greatly improves scalability. Not even a read-write lock is necessary. You can certainly write a concurrent map in C++ without garbage collection, but I doubt it will scale as well as a Junction map.

To make QSBR work, each thread must periodically call junction::DefaultQSBR.update at a moment when that thread is quiescent – that is, not in the middle of an operation that uses the map. In a game engine, you could call it between iterations of the main loop.

Dynamically Allocated Values

Junction maps use QSBR internally, but you must still manage object lifetimes yourself. The maps don’t currently support smart pointers.

If you’re storing dynamically allocated objects, you’ll often want to check for existing entries in the table before inserting a new one. There are a couple of ways to do that. One way is to create objects optimistically, then detect racing inserts using exchangeValue.

ConcurrentMap::Mutator mutator = myMap.insertOrFind(14);
Foo* value = mutator.getValue();
if (!value) {
    value = new Foo;
    Foo* oldValue = mutator.exchangeValue(value);
    if (oldValue)
        junction::DefaultQSBR.enqueue(&Foo::destroy, oldValue);
}

The above code uses a Mutator, which is like a pointer to a single entry in the map. First, insertOrFind creates an entry if one doesn’t already exist. Then, if two threads race to insert the same key, the second thread will garbage-collect the object created by the first.

Another way is to prevent such collisions entirely using double-checked locking. This approach guarantees that only one object will ever be created for a given key.

Foo* value = myMap.get(14);
if (!value) {
    turf::LockGuard<turf::Mutex> lock(someMutex);
    ConcurrentMap::Mutator mutator = myMap.insertOrFind(14);
    value = mutator.getValue();
    if (!value) {
        value = new Foo;
        mutator.setValue(value);
    }
}

Development Status

You should consider this alpha software. All of the code is experimental. I spent a lot of effort to get it right, and it passes all the tests I’ve thrown at it, but you never know – bugs might still lurk under the surface. That’s part of the reason why I’m releasing the code for free. Readers of this blog have proven quite good at finding obscure bugs. I hope you’ll subject Junction to your harshest scrutiny.

If you’d like to contribute to the project in other ways, here are a few suggestions:

  • Porting Turf to additional platforms
  • Further optimization
  • Searching the repository for FIXME comments
  • Identifying missing functionality that would be useful

To leave feedback, simply post a comment below, open an issue on GitHub, or use my direct contact form.

Comments (15)

Image (Asset 4/14) alt=

Rob · 2 weeks ago

This is very cool! I'm always on the look out for better hash table implementations in C++. I'm currently using the concurrent cuckoo hash map ( https://github.com/efficient/libcuckoo ) in a few of our projects. I was wondering if you have any idea where that implementation falls in terms of the concurrent maps you've tested above.
2 replies · active 2 weeks ago
Never heard of it until yesterday. I've added it to the test suite. It seems to be pretty darn good.

Image (Asset 7/14) alt=
Image (Asset 8/14) alt=

Rob · 2 weeks ago

Great; thanks for adding this, Jeff! Yes, I've been very happy with libcuckoo so far. It scales well, is easy to use, and has some nice functionality (e.g. their "upsert()" function that makes concurrent insertion and updating nice and easy). However, it still seems like 2 of the 3 Junction maps are coming out on top. I look forward to testing Junction out when I have a chance!
Image (Asset 9/14) alt=

petke · 2 weeks ago

That scaling diagram is interesting. The sceptic in me think it looks like the tbb map will keep scaling linearly with more cores while your fastest map looks like its hitting a peak with a couple of more cores. Any comments?
1 reply · active 2 weeks ago
I noticed that too. As throughput increases, Junction hits some codepaths that modify shared state more often. It's an area of possible optimization for Junction.

[Edit: I updated the post so that it doesn't claim to be "more scalable" anymore. 6 cores isn't enough to make that claim.]
Image (Asset 14/14) alt=

alexf2000 · 2 weeks ago

It would be much more useful if graph that you produced could use some "absolute" units instead of operations per second. Judging that you got about 10M operations from regular std:map, your map is only 4 times faster. I am also working on map-trie-like structure which in my tests is about 25-30 times faster than std:map but my structure works with strings instead of ints. So now, suppose I want to know if my code is actually faster than yours or vise versa. From your results I cannot really tell it, I have to adapt your code to strings and run it on my test machine to find the answer.

AlexM · 2 weeks ago

Nice work. Can you show scale to beyond a few cores? For example, up to 64 cores, or more?
2 replies · active 2 weeks ago
I'd like to, but don't have access to such a machine. I suspect I would end up tweaking some of the constants at higher core counts, among other things.
You might want to try Amazon AWS / DigitalOcean / whatever. You can rent an instance with many cores for just few hours, so it should be very cheap.

Mateusz · 2 weeks ago

What kind of concurrency guarantees does your data structure provide, e. g. is it linearizable ( https://en.wikipedia.org/wiki/Linearizability )? Some specific questions that come to mind around that - if thread A performs insert("?", 42) and communicates with thread B, can we assume that if thread B then calls find("?"), it will get answer 42?
3 replies · active 20 hours ago
These maps are not linearizable because different threads can observe modifications in a different order, particularly on a weakly-ordered machine like ARM.

For the second question, it's not clear what you mean by "communicates with", but if Thread A calls insert(7, 42), and Thread B calls get(7), then Thread B will either see 42 or 0 (the default null value), assuming no other threads modify the value at 7. But if there is already a happens-before relation between the insert() and the get() somewhere else, for example by checking a guard variable or synchronizing on a semaphore or event, then get() will only return 42.

In short, you get the same behavior as if you had a giant array of atomic values in memory, and the keys are indices into the array, with store-release and load-consume semantics only.

Mateusz · 2 weeks ago

Hi Jeff, thanks! I think the second property is very important and from what you describe means your map may be linearizable. Specifically I would like insert / find to be linearization points. That is different from being sequentially consistent, i.e. all threads observing order of operations to be the same, which I don't care that much for since in practice it means putting a lock there.
Hi Mateusz. The Wikipedia article you linked to is flawed. It currently states that linearizability is the same thing as atomicity, but that's not true. For example, the x86 MOV instruction is atomic (when naturally-aligned), but not linearizable. Linearizability and atomicity are only equivalent under specific conditions (eg. when your class methods use sequentially consistent atomics and have no data races).

For your needs, do you think linearizability is required? Or is atomicity sufficient? In my own experience, the most I've ever needed was atomicity.

Francesco · 2 weeks ago

If it coul help, i've posted a link at https://groups.google.com/forum/#!topic/mechanica...

Post a new comment

Tip Jar

If you like this blog, leave a tip!

Copyright © 2016 Jeff Preshing - Powered by Octopress