Saturday, July 5, 2025

Deoptimizing a red-black tree

An ordered map is typically slower than a hash map, but it is needed every now and then. Thus I implemented one in Pystd. This implementation does not use individually allocated nodes, but instead stores all data in a single contiguous array.

Implementing the basics was not particularly difficult. Debugging it to actually work took ages of staring at the debugger, drawing trees by hand on paper, printfing things out in Graphviz format and copypasting the output to a visualiser. But eventually I got it working. Performance measurements showed that my implementation is faster than std::map but slower than std::unordered_map.

So far so good.

The test application creates a tree with a million random integers. This means that the nodes are most likely in a random order in the backing store and searching through them causes a lot of cache misses. Having all nodes in an array means we can rearrange them for better memory access patterns.

I wrote some code to reorganize the nodes so that the root is at the first spot and the remaining nodes are stored layer by layer. In this way the query always processes memory "left to right" making things easier for the branch predictor.

Or that's the theory anyway. In practice this made things slower. And not even a bit slower, the "optimized version" was more than ten percent slower. Why? I don't know. Back to the drawing board.

Maybe interleaving both the left and right child node next to each other is the problem? That places two mutually exclusive pieces of data on the same cache line. An alternative would be to place the entire left subtree in one memory area and the right one in a separate one. Thinking about this for a while, this can be accomplished by storing the nodes in tree traversal order, i.e. in numerical order.

I did that. It was also slower than a random layout. Why? Again: no idea.

Time to focus on something else, I guess.

Friday, June 20, 2025

Book creation using almost entirely open source tools

Some years ago I wrote a book. Now I have written a second one, but because no publisher wanted to publish it I chose to self-publish a small print run and hand it out to friends (and whoever actually wants one) as a a very-much-not-for-profit art project.

This meant that I had to create every PDF used in the printing myself. I received the shipment from the printing house yesterday. The end result turned out very nice indeed.

The red parts in this dust jacket are not ink but instead are done with foil stamping (it has a metallic reflective surface). This was done with Scribus. The cover image was painted by hand using watercolor paints. The illustrator used a proprietary image processing app to create the final TIFF version used here. She has since told me that she wants to eventually shift to an open source app due to ongoing actions of the company making the proprietary app.

The cover itself is cloth with a debossed emblem. The figure was drawn in Inkscape and then copypasted to Scribus.

Evert fantasy book needs to have a map. This has two and they are printed in the end papers. The original picture was drawn with a nib pen and black ink and processed with Gimp. The printed version is brownish to give it that "old timey" look. Despite its apparent simplicity this PDF was the most problematic. The image itself is monochrome and printed with a Pantone spot ink. Trying to print this with CMYK inks would just not have worked. Because the PDF drawing model for spot inks in images behaves, let's say, in an unexpected way, I had to write a custom script to create the PDF with CapyPDF. As far as I know no other open source tool can do this correctly, not even Scribus. The relevant bug can be found here. It was somewhat nerve wrecking to send this out to the print shop with zero practical experience and a theoretical basis of "according to my interpretation of the PDF spec, this should be correct". As this is the first ever commercial print job using CapyPDF, it's quite fortunate that it succeeded pretty much perfectly.

The inner pages were created with the same Chapterizer tool as the previous book. It uses Pango and Cairo to generate PDFs. Illustrations in the text were drawn with Krita. As Cairo only produces RGB PDFs, as the last step it had to be converted to grayscale using Ghostscript.

Monday, June 16, 2025

A custom C++ standard library part 4: using it for real

Writing your own standard library is all fun and games until someone (which is to say yourself) asks the important question: could this be actually used for real? Theories and opinions can be thrown about the issue pretty much forever, but the only way to actually know for sure is to do it.

Thus I converted CapyPDF, which is a fairly compact 15k LoC codebase from the C++ standard library to Pystd, which is about 4k lines. All functionality is still the same, which is to say that the test suite passes, there are most likely new bugs that the tests do not catch. For those wanting to replicate the results themselves, clone the CapyPDF repo, switch to the pystdport branch and start building. Meson will automatically download and set up Pystd as a subproject. The code is fairly bleeding edge and only works on Linux with GCC 15.1.

Build times

One of the original reasons for starting Pystd was being annoyed at STL compile times. Let's see if we succeeded in improving on them. Build times when using only one core in debug look like this.

When optimizations are enabled the results look like this:

In both cases the Pystd version compiles in about a quarter of the time.

Binary size

C++ gets a lot of valid criticism for creating bloated code. How much of that is due to the language as opposed to the library?

That's quite unexpected. The debug info for STL types seems to take an extra 20 megabytes. But how about the executable code itself?

STL is still 200 kB bigger. Based on observations most of this seems to come from stdlibc++'s implementation of variant. Note that if you try this yourself the Pystd version is probably 100 kB bigger, because by default the build setup links against libsubc++, which adds 100+ kB to binary sizes whereas linking against the main C++ runtime library does not.

Performance

Ok, fine, so we can implement basic code to build faster and take less space. Fine. But what about performance? That is the main thing that matters after all, right? CapyPDF ships with a simple benchmark program. Let's look at its memory usage first.

Apologies for the Y-axis does not starting at zero. I tried my best to make it happen, but LibreOffice Calc said no. In any case the outcome itself is expected. Pystd has not seen any performance optimization work so it requiring 10% more memory is tolerable. But what about the actual runtime itself?

This is unexpected to say the least. A reasonable result would have been to be only 2x slower than the standard library, but the code ended up being almost 25% faster. This is even stranger considering that Pystd's containers do bounds checks on all accesses, the UTF-8 parsing code sometimes validates its input twice, the hashing algorithm is a simple multiply-and-xor and so on. Pystd should be slower, and yet, in this case at least, it is not.

I have no explanation for this. It is expected that Pystd will start performing (much) worse as the data set size grows but that has not been tested.

Friday, June 6, 2025

Custom C++ stdlib part 3: The bleedingest edge variant

Implementing a variant type in C++ is challenging to say the least. I tried looking into the libstd++ implementation and could not even decipher where the actual data is stored. There is a lot of inheritance going on and helper classes that seem to be doing custom vtable construction and other metaprogramming stuff. The only thing I could truly grasp was a comment saying // "These go to eleven". Sadly there was not a comment // Smell my glove! which would seem more suitable for this occasion.

A modern stdlib does need a variant, though, so I had to implement one. To make it feasible I made the following simplifying assumptions.

All types handled must be noexcept default constructible, move constructible and movable. (i.e. the WellBehaved concept)
If you have a properly allocated and aligned piece of memory, placement new'ing into it works (there may be UB-shenanigans here due to the memory model)
The number of different types that a variant can hold has a predefined static maximum value.
You don't need to support any C++ version older than c++26.

The last one of these is the biggest hurdle, as C++26 will not be released for at least a year. GCC 15 does have support for it, though, so all code below only works with that.

The implementation

At its core, a Pystd variant is nothing more than a byte buffer and an index specifying which type it holds:

template<typename T...>

class Variant {

char buf[compute_size<0, T...>()] alignas(compute_alignment<0, T...>());

int8_t type_id;

};

The functions to compute max size and alignment requirements for types are simple to implement. The main problem lies elsewhere, specifically: going from a type to the corresponding index, going from a compile time index value to a type and going from a runtime index to the corresponding type.

The middle one is the simplest of these. As of C++26 you can directly index the argument pack like so:

using nth_type = T...[compile_time_constant];

Going from type to an index is only slightly more difficult:

Going from a runtime value to a type is the difficult one. I don't know how to do it "correctly", i.e. the way a proper stdlib implementation does it. However, since the number of possible types is limited at compile time, we can cheat (currently only 5 types are supported):

Unrolling (type) loops like its 1988! This means you can't have variants with hundreds of different types, but in that case you probably need an architectural redesign rather than a more capable variant.

With these primitives implementing public class methods is fairly simple.

The end result

The variant implementation in Pystd and its helper code took approximately 200 lines of code. It handles all the basic stuff in addition to being exception safe for copy operations (implemented as copy to a local variable + move). Compile times remain in fractions of a second per file even though Pystd only has a single public header.

It works in the sense that you can put different types in to it, switch between them and so on without any compiler warnings, sanitizer issues or Valgrind complaints. So be careful with the code, I have only tested it, not proven it correct.

No performance optimization or even measurements have been made.

Sunday, May 25, 2025

Iterators and adaptors

Previously it was mentioned that Python and C++ do iteration quite differently. Python has "statefull" objects that have a .next() method that returns a new object or throws a StopIteration exception. Incidentally Rust does exactly the same thing, except that it uses an optional type rather than an exception. C++ does not work like this, instead it has a concept of a "start" and "end" of a sequence and the machinery keeps incrementing the start until it reaches the end.

At the end of last post it was said that it is difficult to integrate an object of the former type with C++'s native language facilities, at least without macros.

Now we'll look how to integrate an object of the former type with C++'s native language facilities without any macros.

In fact, it only took fewer than 30 lines of code to do. The caveat being that it is probably fairly easy to break it. But it works for me.

Here it is being used.

There is also a second, similar helper class that takes ownership of the object being iterated over.

Monday, May 19, 2025

Optimizing page splits in books

Earlier in this blog we looked at how to generate a justified block of text, which is nowadays usually done with the Knuth-Plass algorithm. Sadly this is not, by itself, enough to create a finished product. Processing all input text with the algorithm produces one very long column of text. This works fine if you end result is a scroll (of the papyrus variety), but people tend to prefer their text it book format. Thus you need a way to split that text column into pages.

The simplest solution is to decide that a page has some N number of lines and page breaks happen at exact these places. This works (and is commonly used) but it has certain drawbacks. From a typographical point of view, there are at least three things that should be avoided:

Orphan lines, a paragraph at the bottom of a page that has only one line of text.
Widow lines, a paragraph that starts a new page and has only one line of text.
Spread imbalance, where the two pages on a spread have different number of lines on them (when both are "full pages")

Determining "globally optimal" page splits for a given text is not a simple problem, because every pagination choice affects every pagination choice that comes after it. If you stare at the problem long and hard enough, eventually you realize something.

The two problems are exactly the same and can be solved in the same way. I have implemented this in the Chapterizer book generator program. The dynamic programming algorithm for both is so similar that they could, in theory, use shared code. I chose not to do it because sometimes it is just so much simpler to not be DRY. They are so similar, in fact, that a search space optimization that I had to do in the paragraph justification case was also needed for pagination even though the data set size is typically much smaller in pagination. The optimization implementation turned out to be exactly the same for both cases.

The program now creates optimal page splits and in addition prints a text log of all pages with widows, orphans and imbalances it could not optimize away.

Has this been done before?

Probably. Everything has already been invented multiple times. I don't know for sure, so instead I'm going to post my potentially incorrect answer here on the Internet for other people to fact check.

I am not aware of any book generation program doing global page optimization. LibreOffice and Word definitely do not do it. As far as I know LaTeX processes pages one by one and once a page is generated it never returns to it, probably because computers in the early 80s did not have enough memory to hold the entire document in RAM at the same time. I have never used Indesign so I don't know what it does.

Books created earlier than about the mid eighties were typeset by hand and they seem to try to avoid orphans and widows at the cost of some page spreads having more lines than other spreads. Modern books seem to optimize for filling the page fully rather than avoiding widows and orphans. Since most books nowadays are created with Indesign (I think), it would seem to imply that Indesign does not do optimal page splitting or at least it is not enabled by default.

When implementing the page splitter the reasons for not doing global page splitting became clear. Computing both paragraph and page optimization is too slow for a "real time" GUI app. This sort of an optimization really only makes sense for a classical "LaTeX-style" book where there is one main text flow. Layout programs like Scribus and Indesign support richer magazine style layouts. This sort of a page splitter does not work in the general case, so in order to use it they would need to have a separate simpler document format.

But perhaps the biggest issue is that if different pages may have different number of lines on each page, it leads to problems in the page's visual appearance. Anything written at the bottom of the page (like page numbers) need to be re-placed based on how many lines of text actually ended up on the page. Doing this in code in a batch system is easy, doing the same with GUI widgets in "real time" is hard.

Saturday, May 3, 2025

Writing your own C++ standard library part 2

This blog post talked about the "self written C++ standard library" I wrote for the fun of it (code here).

The post got linked by Hackernews and Reddit. As is usual the majority of comments did not talk about the actual content but instead were focused on two tangential things. The first one being "this is not a full implementation of the C++ standard library as specified by the ISO standard, therefore the author is an idiot". I am, in actual fact, an idiot, but not due to project scope but because I assumed people on the Internet to have elementary reading comprehension skills. To make things clear: no, this is not an implementation of the ISO standard library. At no point was such a thing claimed. There is little point in writing one of those, there are several high quality implementations available. "Standard library" in this context means "a collection of low level functions and types that would be needed by most applications".

The second discussion was around the fact that calling the C++ standard library "STL" was both wrong and proof that the person saying that does not understand the C++ language. This was followed by several "I am a C++ standard library implementer and everyone I know calls it the STL". Things deteriorated from there.

The complexity question

Existing container implementations are complex by necessity. They need to handle things like types that can not be noexcept moved or copied. The amount of rollback code needed explodes very quickly and needs to be processed every time the header is included (because of templates). A reasonable point against writing your own containers is that eventually you need to support all the same complexity as existing ones because people will insert "bad" types into your container and complain when things fail. Thus you need to have all the same nasty, complex and brittle code as an STL implementation, only lacking decades of hardening to shake out all the bugs.

That is true, but the beauty of having zero existing users is that you can do something like this:

This is basically a concept that requires the type given to be noexcept-movable (and some other things that could arguably be removed). Now, instead of allowing any type under the sun, all containers require all types they get instantiated with to be WellBehaved. This means that the library does not have to care at all about types behaving badly because trying to use those results in a compiler error. A massive amount of complexity is thus eliminated in a single stroke.

There are of course cases when you need to deal with types that are not well behaved. If you can tolerate a memory allocation per object, the simple solution is to use unique_ptr. OTOH if you have types that can't be reliably moved and which are so performance critical that you can't afford memory allocations, then you are probably going to write your own container code anyway. If you are in the position that you have badly behaved types that you can't update, can't tolerate an allocation and can't write your own containers, then that is your problem.

Iterating things the native way

One of the most common things I do with strings in Python is to split them on whitespace. In Python this is simple as there is only one string type, but in native code things are more complicated. What should the return type of mystring.split() be?

vector<string>
vector<string_view>
A coroutine handle
The method should be a full template so you can customize it to output any custom string type (as every C++ code base has at least three of them)
Something ranges something something

There is probably not a single correct answer. All of the options have different tradeoffs. Thus I implemented this in two ways. The first returns a vector<string> as you would get in Python. The second one was a fully general, lazy and allocation-free method that uses the most unloved of language features: the void pointer.

So given that you have a callback function like this:

the split function becomes:

This is as fully general as you can get without needing to muck about with templates, coroutines or monads (I don't actually know what monads are, I'm just counting on the fact that mentioning them makes you look cool). The cost is that the end user needs to write a small adapter lambda to use it.

Iterating things the Python way

The Python iteration protocol is surprisingly simple. The object being iterated needs to provide a .next() method that either returns the next value or throws a StopIteration exception. Obviously exceptions can't be used in native code because they are too slow, but the same effect can be achieved by returning an optional<T> instead. Here's a unit test for an implementation of Python's range function.

Sadly there is not a simple way to integrate this to native loop constructs without macros and even then it is a bit ugly.

Current status

The repo has basic functionality for strings (regular and enforced UTF-8), regexes and basic containers.

Building the entire project has 8 individual compilation and linking steps and takes 0.8 seconds when using a single core on this laptop computer. Thus compiling a single source file with optimizations enabled takes ~ 0.1 seconds. Which is nice.

The only cheat is that pystd uses ctre for regexes and it is precompiled.