How much overhead does std::mutex add in the single-threaded case for std::map? Because its kinda curious that this new map is more than twice as fast as std::map even on single thread.
Overhead of std::mutex depends a lot on contention patterns. Assuming one x86 CPU socket, usually you can take + release an ideally optimized contended mutex about 10M-25M times per second, if that's all you do, 100% CPU usage on all cores. That's simply limited by how many times you can run contested LOCK XADD (x86 atomic fetch-and-add) / LOCK CMPXCHG* (x86 atomic compare-and-swap) instructions per second.
Mutex generates a lot of cache coherency traffic and saturates the CPU internal ring bus. NUMA case it'll of course quickly saturate lower bandwidth CPU external QPI link(s).
Contested mutexes perform a lot better on a typical laptop than on a server.
good analysis, but you probably meant "ideally optimized uncontended mutex".
To be exact, I meant code that just repeatedly locks and releases same mutex on all CPU cores.
(Counting is a bit tricky, atomic counter on each lock would invalidate the results. Best to only update the counter, say, on every 1000th result or to use some parallel counting library.)
OK, now I see what you are saying. It would certainly be true for a spin lock, but it seems to me that, for std::mutex, the kernel transition required to arbitrate contention would dominate even on the coherency traffic, right?
Yeah, I was talking about spinlocks.
You're right that context switches would dominate std::mutex. For some reason I was thinking std::mutex uses spinlocks, even though I knew better. Too much kernel side coding lately - context switches are often not possible there, so mutexes are often not possible as potential synchronization mechanisms.
You might still be right. The implementation of std::mutex is unspecified, and real world implementations will probably spin a bit before going into the kernel, so for your test case with a tiny critical section it might behave like a spinlock.
ANSI C++, section 23.4.2 and 23.4.4, doesn't specify the implementation only the complexity requirements.
Any C++ implementation is free to choose their std::map implementation as long as it meets the requirements, it doesn't say anywhere that a RB Tree is required.
libstdc++, libc++, and MSVC's STL all implement it as a red-black tree.
Even if it's not required to be a red-black tree, it is de facto a red-black tree on every major compiler.
Moreover, this doesn't really change the parent's point. It must be some sort of ordered associative container, meaning that it's not going to have the performance characteristics of a hash table.
Clang, gcc and MSVC are hardly "every major compiler", as there are many others to choose from specially in the embedded systems, real time OSes, classical commercial UNIX and mainframes.
We don't have always the luxury of choosing which compiler to use.
Relying on implementation details of the compiler or provided library is the first trap to writing portable code across OSes and compiler vendors.
I recall hearing/reading that there are sections of the standard which mostly suggest that a tree is the only valid map data structure. I think something to do with iterators and was mentioned in a CppCon talk or blog article. I will try to dig this up.
Sure, it might be that only a tree based structure is able to fulfill the required complexity, but it doesn't need to be a RB one necessarly.
On programming languages with ISO/ANSI specifications, relying on implementation details is a trap for writing portable code.
Uhm... The absolute Y-position of the curve on the graph is secondary to its shape. Also, the scalability discussion is largely pointless if the sample set covers only 6 CPUs.
Intel TBB scales proportionally, which is precisely what you'd want here. Junction starts to flatten out on 6th CPU, which implies that it has fundamental design issues that crop up at higher CPU counts. Chances are that its performance not only won't scale further, but will actually drop further down the graph.
On other hand TBB code can be put throught some routine code optimization (hand-coded assembly and such) to increase its performance without affecting its linear graph shape.
Based on the information available, you might be right and you might be wrong.
Before one can say anything either way, one needs to profile. Assumptions rarely work when it comes to extracting high runtime performance.
So unless you did profile, your comment didn't really add anything.
> So unless you did profile, your comment didn't really add anything.
The previous comment explained why and how the author's analysis is flawed. Then it goes over top by speculating exactly into the opposite direction. But that doesn't reduce the quality of the first part of the comment. That part was still valuable to me.
Ok, you have a point.
Original author's analysis is flawed, it doesn't extend to multi-socket systems.
What we don't know is how thing scales beyond a single socket. The graph is not going to tell us anything about that. Profiling will.
Experience has shown to me assumptions are bad.
So now I don't assume a certain call will succeed or even work the way I think it does. Instead I check return value of every call and test it against my assumptions.
I never assume much about performance either. I might occasionally use microbenchmarks as a hint. But the main mode of operation is measuring as big of a piece of functionality as possible. Many different size, but realistic, workloads. Preferably on multiple different systems as well.
If performance is the goal, I'd advocate one trying out different concurrent hash map implementations in the system one is building.
> The graph is not going to tell us anything about that. Profiling will.
What kind of profiling would you employ for a concurrent data structure like the hash map here? Instrumented code or sampling profilers?
I'm afraid both kinds of profiling would yield quite meaningless information, because the individual operations are quite fast and the runtime may vary depending on e.g. cache utilization and contention. Profiling is good for bigger applications but it's not super useful for "primitive" operations like hash map inserts.
If I were to optimize something like this, I'd first reach for the CPU performance counters trying to understand what aspect is the bottle neck.
> ... Instrumented code or sampling profilers?
A sampling profiler would be helpful. Although most hash map related samples would likely fall on atomic ops it presumably uses for synchronization. On the other hand, you'd know whether you need to optimize this in the first place.
Instrumented profiler would yield garbage data for a lot of reasons, I wouldn't use that, except maybe over a large group of hash map operations.
> ... I'd first reach for the CPU performance counters
I count CPU performance counters as profiling.
I think you might be right. Maybe I should edit the post and not call it "more scalable", since I only have six cores to test on. But if it does top out at a higher core count, there are several ways it could be optimized.
Either way, Junction is BSD-licensed while TBB is GPL/$, so I hope it'll find a use somewhere.
Interesting and impressive performance. The requirement to periodically halt every thread seemed like a bit of a downer though. Am I reading that wrong?
That seems to be the case, I do wish they'd made this requirement clear up front. There is no getting around sychronised quiescence being a blocking event, but in this case they essentially hope that either 1) you're already using a kind of scatter gather thread model like the mentioned game example - which implies an iterative discrete world, or 2) the set of threads (or tasks) interacting with the collection is simply bounded and sychronised, and/or 3) the performance is still better in aggregate even with infrequent world blocking events.
Typically QSBR algorithms don't require blocking the world, or even blocking any single thread. They just require each thread to periodically check in and run a bounded amount of code which amounts to "hey, I'm not currently looking at the map".
Some other background collector thread (which is going to actually delete removed objects) just has to wait until it sees every mutator thread cross a safepoint, at which point it knows that none of those threads could be hanging onto references that have been unlinked from the data structure.
I'd recommend reading some surveys of RCU and SMR algorithms if this stuff is interesting to you.
Threads are not required to call QSBR::update at the same time. They can call it anytime, but they do need to call it.
OK. That's not really clear from the article. It makes the strange statement that the update function must be called "at a moment when each thread is quiescent – that is, not doing anything else." I don't believe there is any way in C++ to have a thread be doing more than one thing at a time, so if a C++ thread calls update it must by inspection be not doing anything else. Maybe you could restate this. I assumed since this had been stated this way it must really mean that all threads must have reached this state, and one thread must call update.
I hope that the usage will be easier than for Intel TBB. If you are not using a mainstream build tool, then Intel TBB is a headache.
Those results look slightly suspicious to me, in that I've seen TBB concurrent_hash_map scale much better than that...
I guess it depends on the workload ratio...
Would have been nice to see a binned/sharded hashmap in the results as well, I've seen pretty good scalability on those as well.