Preshing on Programming

Twitter
RSS

Feb 01, 2016

New Concurrent Hash Maps for C++

A map is a data structure that maps a collection of keys to a collection of values. It’s a common concept in computer programming. You typically manipulate maps using functions such as find, insert and erase.

A concurrent map is one that lets you call some of those functions concurrently – even in combinations where the map is modified. If it lets you call insert from multiple threads, with no mutual exclusion, it’s a concurrent map. If it lets you call insert while another thread is calling find, with no mutual exclusion, it’s a concurrent map. Other combinations might be allowed, too. Traditional maps, such as std::map and std::unordered_map, don’t allow that.

Today I’m releasing Junction, a C++ library that contains several new concurrent maps. It’s BSD-licensed, so you can use the source code freely in any project, for any purpose.

On my Core i7-5930K, Junction’s two fastest maps outperform all other concurrent maps.

They come in three flavors:

Junction’s Linear map is similar to the simple lock-free hash table I published a while ago, except that it also supports resizing, deleting entries, and templated key/value types. It was inspired by Cliff Click’s non-blocking hash map in Java, but has a few differences.
Junction’s LeapFrog map is similar to Linear, except that it uses a probing strategy loosely based on hopscotch hashing. This strategy improves lookup efficiency when the table is densely populated. LeapFrog scales better than Linear because it modifies shared state far less frequently.
Junction’s Grampa map is similar to LeapFrog, except that at high populations, the map gets split into a set of smaller, fixed-size LeapFrog tables. Whenever one of those tables overflows, it gets split into two new tables instead of resizing the entire map.

Junction aims to support as many platforms as possible. So far, it’s been tested on Windows, Ubuntu, OS X and iOS. Its main dependencies are CMake and a companion library called Turf. Turf is an abstraction layer over POSIX, Win32, Mach, Linux, Boost, C++11, and possibly other platform APIs. You configure Turf to use the API you want.

Using Junction Maps

Instantiate one of the class templates using integer or raw pointer types.

typedef junction::ConcurrentMap_Grampa<turf::u64, Foo*> ConcurrentMap;
ConcurrentMap myMap;

Each map exposes functions such as get, set, exchange and erase. These functions are all atomic with respect to each other, so you can call them from any thread at any time. They also provide release and consume semantics implicitly, so you can safely pass non-atomic information between threads.

myMap.set(14, new Foo);
Foo* foo = myMap.get(14);
foo = myMap.exchange(14, new Foo);
delete foo;
foo = myMap.erase(14);
delete foo;

Out of all possible keys, a null key must be reserved, and out of all possible values, null and redirect values must be reserved. The defaults are 0 and 1. You can override those defaults by passing custom KeyTraits and ValueTraits parameters to the template.

Safe Memory Reclamation

All Junction maps rely on a form of safe memory reclamation known as QSBR, or quiescent state-based memory reclamation. QSBR could be described as a primitive garbage collector.

If it seems odd to perform garbage collection in C++, keep in mind that scalable concurrent maps are already prevalent in Java, an entirely garbage-collected language. That’s no coincidence. Garbage collection allows you to sidestep locks, especially during read operations, which greatly improves scalability. Not even a read-write lock is necessary. You can certainly write a concurrent map in C++ without garbage collection, but I doubt it will scale as well as a Junction map.

To make QSBR work, each thread must periodically call junction::DefaultQSBR.update at a moment when that thread is quiescent – that is, not in the middle of an operation that uses the map. In a game engine, you could call it between iterations of the main loop.

Dynamically Allocated Values

Junction maps use QSBR internally, but you must still manage object lifetimes yourself. The maps don’t currently support smart pointers.

If you’re storing dynamically allocated objects, you’ll often want to check for existing entries in the table before inserting a new one. There are a couple of ways to do that. One way is to create objects optimistically, then detect racing inserts using exchangeValue.

ConcurrentMap::Mutator mutator = myMap.insertOrFind(14);
Foo* value = mutator.getValue();
if (!value) {
    value = new Foo;
    Foo* oldValue = mutator.exchangeValue(value);
    if (oldValue)
        junction::DefaultQSBR.enqueue(&Foo::destroy, oldValue);
}

The above code uses a Mutator, which is like a pointer to a single entry in the map. First, insertOrFind creates an entry if one doesn’t already exist. Then, if two threads race to insert the same key, the second thread will garbage-collect the object created by the first.

Another way is to prevent such collisions entirely using double-checked locking. This approach guarantees that only one object will ever be created for a given key.

Foo* value = myMap.get(14);
if (!value) {
    turf::LockGuard<turf::Mutex> lock(someMutex);
    ConcurrentMap::Mutator mutator = myMap.insertOrFind(14);
    value = mutator.getValue();
    if (!value) {
        value = new Foo;
        mutator.setValue(value);
    }
}

Development Status

You should consider this alpha software. All of the code is experimental. I spent a lot of effort to get it right, and it passes all the tests I’ve thrown at it, but you never know – bugs might still lurk under the surface. That’s part of the reason why I’m releasing the code for free. Readers of this blog have proven quite good at finding obscure bugs. I hope you’ll subject Junction to your harshest scrutiny.

If you’d like to contribute to the project in other ways, here are a few suggestions:

Porting Turf to additional platforms
Further optimization
Searching the repository for FIXME comments
Identifying missing functionality that would be useful

To leave feedback, simply post a comment below, open an issue on GitHub, or use my direct contact form.

Tweet

38

« You Can Do Any Kind of Atomic Read-Modify-Write Operation A Resizable Concurrent Map »

Comments (15)

Rob · 2 weeks ago

This is very cool! I'm always on the look out for better hash table implementations in C++. I'm currently using the concurrent cuckoo hash map ( https://github.com/efficient/libcuckoo ) in a few of our projects. I was wondering if you have any idea where that implementation falls in terms of the concurrent maps you've tested above.

Reply
2 replies · active 2 weeks ago

Jeff Preshing · 2 weeks ago

Never heard of it until yesterday. I've added it to the test suite. It seems to be pretty darn good.

Reply

Rob · 2 weeks ago

Great; thanks for adding this, Jeff! Yes, I've been very happy with libcuckoo so far. It scales well, is easy to use, and has some nice functionality (e.g. their "upsert()" function that makes concurrent insertion and updating nice and easy). However, it still seems like 2 of the 3 Junction maps are coming out on top. I look forward to testing Junction out when I have a chance!

Reply

petke · 2 weeks ago

That scaling diagram is interesting. The sceptic in me think it looks like the tbb map will keep scaling linearly with more cores while your fastest map looks like its hitting a peak with a couple of more cores. Any comments?

Reply
1 reply · active 2 weeks ago

Jeff Preshing · 2 weeks ago

I noticed that too. As throughput increases, Junction hits some codepaths that modify shared state more often. It's an area of possible optimization for Junction.

[Edit: I updated the post so that it doesn't claim to be "more scalable" anymore. 6 cores isn't enough to make that claim.]

Reply

Konstantin Gontarenko · 2 weeks ago

Here's one more Windows adapter: http://const.me/tmp/MapAdapter_ATL.zip
Results on i5-4460: http://const.me/tmp/versus-atl-map.png

Reply

alexf2000 · 2 weeks ago

It would be much more useful if graph that you produced could use some "absolute" units instead of operations per second. Judging that you got about 10M operations from regular std:map, your map is only 4 times faster. I am also working on map-trie-like structure which in my tests is about 25-30 times faster than std:map but my structure works with strings instead of ints. So now, suppose I want to know if my code is actually faster than yours or vise versa. From your results I cannot really tell it, I have to adapt your code to strings and run it on my test machine to find the answer.

Reply

AlexM · 2 weeks ago

Nice work. Can you show scale to beyond a few cores? For example, up to 64 cores, or more?

Reply
2 replies · active 2 weeks ago

Jeff Preshing · 2 weeks ago

I'd like to, but don't have access to such a machine. I suspect I would end up tweaking some of the constants at higher core counts, among other things.

Reply

Adam Błaszkiewicz · 2 weeks ago

You might want to try Amazon AWS / DigitalOcean / whatever. You can rent an instance with many cores for just few hours, so it should be very cheap.

Reply

Mateusz · 2 weeks ago

What kind of concurrency guarantees does your data structure provide, e. g. is it linearizable ( https://en.wikipedia.org/wiki/Linearizability )? Some specific questions that come to mind around that - if thread A performs insert("?", 42) and communicates with thread B, can we assume that if thread B then calls find("?"), it will get answer 42?

Reply
3 replies · active 20 hours ago

Jeff Preshing · 2 weeks ago

These maps are not linearizable because different threads can observe modifications in a different order, particularly on a weakly-ordered machine like ARM.

For the second question, it's not clear what you mean by "communicates with", but if Thread A calls insert(7, 42), and Thread B calls get(7), then Thread B will either see 42 or 0 (the default null value), assuming no other threads modify the value at 7. But if there is already a happens-before relation between the insert() and the get() somewhere else, for example by checking a guard variable or synchronizing on a semaphore or event, then get() will only return 42.

In short, you get the same behavior as if you had a giant array of atomic values in memory, and the keys are indices into the array, with store-release and load-consume semantics only.

Reply

Mateusz · 2 weeks ago

Hi Jeff, thanks! I think the second property is very important and from what you describe means your map may be linearizable. Specifically I would like insert / find to be linearization points. That is different from being sequentially consistent, i.e. all threads observing order of operations to be the same, which I don't care that much for since in practice it means putting a lock there.

Reply

Jeff Preshing · 20 hours ago

Hi Mateusz. The Wikipedia article you linked to is flawed. It currently states that linearizability is the same thing as atomicity, but that's not true. For example, the x86 MOV instruction is atomic (when naturally-aligned), but not linearizable. Linearizability and atomicity are only equivalent under specific conditions (eg. when your class methods use sequentially consistent atomics and have no data races).

For your needs, do you think linearizability is required? Or is atomicity sufficient? In my own experience, the most I've ever needed was atomicity.

Reply

Francesco · 2 weeks ago

If it coul help, i've posted a link at https://groups.google.com/forum/#!topic/mechanica...

Reply

Post a new comment

Comment as a Guest, or login:

Login to IntenseDebate

Login to WordPress.com

Login to Twitter

Name

Email

Website (optional)

Displayed next to your comments.

Not displayed publicly.

If you have a website, link to it here.

Submit Comment

Subscribe to

Recent Posts

A Resizable Concurrent Map

New Concurrent Hash Maps for C++

You Can Do Any Kind of Atomic Read-Modify-Write Operation

Safe Bitfields in C++

Semaphores are Surprisingly Versatile

C++ Has Become More Pythonic

Fixing GCC's Implementation of memory_order_consume

How to Build a GCC Cross-Compiler

How to Install the Latest GCC on Windows

My Multicore Talk at CppCon 2014

The Purpose of memory_order_consume in C++11

What Is a Bitcoin, Really?

Bitcoin Address Generator in Obfuscated Python

Acquire and Release Fences Don't Work the Way You'd Expect

Double-Checked Locking is Fixed In C++11

Acquire and Release Fences

The Synchronizes-With Relation

The Happens-Before Relation

Atomic vs. Non-Atomic Operations

The World's Simplest Lock-Free Hash Table

Tip Jar

If you like this blog, leave a tip!

Copyright © 2016 Jeff Preshing - Powered by Octopress

Powered by Evernote Publisher