Wednesday, July 23, 2025

Comparing a red-black tree to a B-tree

In an earlier blog post we found that optimizing the memory layout of a red-black tree does not seem to work. A different way of implementing an ordered container is to use a B-tree. It was originally designed to be used for on-disk data. The design principle was that memory access is "instant" while disk access is slow. Nowadays this applies to memory access as well, as cache hits are "instant" and uncached memory is slow.

I implemented a B-tree in Pystd. Here is how all the various containers compare. For test data we used numbers from zero to one million in a random order.

As we can see, an unordered map is massively faster than any ordered container. If your data does not need to be ordered, that is the one you should use. For ordered data, the B-tree is clearly faster than either red-black tree implementation.

Tuning the B-tree

B-trees have one main tunable parameter, namely the spread factor of the nodes. In the test above it was five, but for on disk purposes the recommended value is "in the thousands". Here's how altering the value affects performance.

The sweet spot seems to be in the 256-512 range, where the operations are 60% faster than standard set. As the spread factor grows towards infinity, the B-tree reduces to just storing all data in a single sorted array. Insertion into that is an O(N^2) algorithm as can be seen here.

Getting weird

The B-tree implementation has many assert calls to verify the internal structures. We can compile the code with -DNDEBUG to make all those asserts disappear. Removing redundant code should make things faster, so let's try it.

There are 13 measurements in total and disabling asserts (i.e. enabling NDEBUG) makes the code run slower in 8 of those cases. Let's state that again, since it is very unexpected: in this particular measurement code with assertions enabled runs faster than the same code without them. This should not be happening. What could be causing it?

I don't know for sure, so here is some speculation instead.

First of all a result of 8/13 is probably not statistically significant to say that enabling assertions makes things faster. OTOH it does mean that enabling them does not make the code run noticeably slower. So I guess we can say that both ways of building the code are approximately as fast.

As to why that is, things get trickier. Maybe GCC's optimizer is just really good at removing unnecessary checks. It might even be that the assertions give the compiler more information so it can skip generating code for things that can never happen. I'm not a compiler engineer, so I'll refrain from speculating further, it would probably be wrong in any case.

Saturday, July 5, 2025

Deoptimizing a red-black tree

An ordered map is typically slower than a hash map, but it is needed every now and then. Thus I implemented one in Pystd. This implementation does not use individually allocated nodes, but instead stores all data in a single contiguous array.

Implementing the basics was not particularly difficult. Debugging it to actually work took ages of staring at the debugger, drawing trees by hand on paper, printfing things out in Graphviz format and copypasting the output to a visualiser. But eventually I got it working. Performance measurements showed that my implementation is faster than std::map but slower than std::unordered_map.

So far so good.

The test application creates a tree with a million random integers. This means that the nodes are most likely in a random order in the backing store and searching through them causes a lot of cache misses. Having all nodes in an array means we can rearrange them for better memory access patterns.

I wrote some code to reorganize the nodes so that the root is at the first spot and the remaining nodes are stored layer by layer. In this way the query always processes memory "left to right" making things easier for the branch predictor.

Or that's the theory anyway. In practice this made things slower. And not even a bit slower, the "optimized version" was more than ten percent slower. Why? I don't know. Back to the drawing board.

Maybe interleaving both the left and right child node next to each other is the problem? That places two mutually exclusive pieces of data on the same cache line. An alternative would be to place the entire left subtree in one memory area and the right one in a separate one. Thinking about this for a while, this can be accomplished by storing the nodes in tree traversal order, i.e. in numerical order.

I did that. It was also slower than a random layout. Why? Again: no idea.

Time to focus on something else, I guess.

Nibble Stew