Wednesday, August 6, 2025

Let's properly analyze an AI article for once

Recently the CEO of Github wrote a blog post called Developers reinvented. It was reposted with various clickbait headings like GitHub CEO Thomas Dohmke Warns Developers: "Either Embrace AI or Get Out of This Career" (that one feels like an LLM generated summary of the actual post, which would be ironic if it wasn't awful). To my great misfortune I read both of these. Even if we ignore whether AI is useful or not, the writings contain some of the absolute worst reasoning and stretched logical leaps I have seen in years, maybe decades. If you are ever in the need of finding out how not to write a "scientific" text on any given subject, this is the disaster area for you.

But before we begin, a detour to the east.

Statistics and the Soviet Union

One of the great wonders of statistical science of the previous century was without a doubt the Soviet Union. They managed to invent and perfect dozens of ways to turn data to your liking, no matter the reality. Almost every official statistic issued by USSR was a lie. Most people know this. But even most of those do not grasp just how much the stats differed from reality. I sure didn't until I read this book. Let's look at some examples.

Only ever report percentages

The USSR's glorious statistics tended to be of the type "manufacturing of shoes grew over 600% this five year period". That certainly sounds a lot better than "In the last five years our factory made 700 pairs of shoes as opposed to 100" or even "7 shoes instead of 1". If you are really forward thinking, you can even cut down shoe production on those five year periods when you are not being measured. It makes the stats even more impressive, even though in reality many people have no shoes at all.

The USSR classified the real numbers as state secrets because the truth would have made them look bad. If a corporation only gives you percentages, they may be doing the same thing. Apply skepticism as needed.

Creative comparisons

The previous section said the manufacturing of shoes has grown. Can you tell what it is not saying? That's right, growth over what? It is implied that the comparison is to the previous five year plan. But it is not. Apparently a common comparison in these cases was the production amounts of the year 1913. This "best practice" was not only used in the early part of the 1900s, it was used far into the 1980s.

Some of you might wonder why 1913 and not 1916, which was the last year before the bolsheviks took over? Simply because that was the century's worst year for Russia as a whole. So if you encounter a claim that "car manufacturing was up 3700%" some year in 1980s Soviet Union, now you know what that actually meant.

"Better" measurements

According to official propaganda, the USSR was the world's leading country in wheat production. In this case they even listed out the production in absolute tonnes. In reality it was all fake. The established way of measuring wheat yields is to measure the "dry weight", that is, the mass of final processed grains. When it became apparent that the USSR could not compete with imperial scum, they changed their measurements to "wet weight". This included the mass of everything that came out from the nozzle of a harvester, such as stalks, rats, mud, rain water, dissidents and so on.

Some people outside the iron curtain even believed those numbers. Add your own analogy between those people and modern VC investors here.

To business then

The actual blog post starts with this thing that can be considered a picture.

What message would this choice of image tell about the person using it in their blog post?

Said person does not have sufficient technical understanding to grasp the fact that children's toy blocks should, in fact, be affected by gravity (or that perspective is a thing, but we'll let that pass).
Said person does not give a shit about whether things are correct or could even work, as long as they look "somewhat plausible".

Are these the sort of traits a person in charge of the largest software development platform on Earth should have? No, they are not.

To add insult to injury, the image seems to have been created with the Studio Ghibli image generator, which Hayao Miyazaki described as an abomination on art itself. Cultural misappropriation is high on the list of core values at Github HQ it seems.

With that let's move on to the actual content, which is this post from Twitter (to quote Matthew Garrett, I will respect their name change once Elon Musk starts respecting his child's).

Oh, wow! A field study. That makes things clear. With evidence and all! How can we possibly argue against that?

Easily. As with a child.

Let's look at this "study" (and I'm using the word in its loosest possible term here) and its details with an actual critical eye. The first thing is statistical representativeness. The sample size is 22. According to this sample size calculator I found, a required sample size for just one thousand people would be 278, but, you know, one order of magnitude one way or another, who cares about those? Certainly not business big shot movers and shakers. Like Stockton Rush for example.

The math above assumes an unbiased sampling. The post does not even attempt to answer whether that is the case. It would mean getting answers to questions like:

How were the 22 people chosen?
How many different companies, skill levels, nationalities, genders, age groups etc were represented?
Did they have any personal financial incentive on making their new AI tools look good?
Were they under any sort of duress to produce the "correct" answers?
What was/were the exact phrase(s) that was asked?
Were they the same for all participants?
Was the test run multiple times until it produced the desired result?

The latter is an age old trick where you run a test with random results over and over on small groups. Eventually you will get a run that points the way you want. Then you drop the earlier measurements and publish the last one. In "the circles" this is known as data set selection.

Just to be sure, I'm not saying that is what they did. But if someone drove a dump truck full of money to my house and asked me to create a "study" that produced these results, that is exactly how I would do it. (I would not actually do it because I have a spine.)

Moving on. The main headline grabber is "Either you embrace AI or get out of this career". If you actually read the post (I know), what you find is that this is actually a quote from one of the participants. It's a bit difficult to decipher from the phrasing but my reading is that this is not a grandstanding hurrah of all things AI, but more of a "I guess this is something I'll have to get used to" kind of submission. That is not evidence, certainly not of the clear type. It is an opinion.

The post then goes on a buzzwordsalad tour of statements that range from the incomprehensible to the puzzling. Perhaps the weirdest is this nugget on education:

Teaching [programming] in a way that evaluates rote syntax or memorization of APIs is becoming obsolete.

It is not "becoming obsolete". It has been considered the wrong thing to do for as long as computer science has existed. Learning the syntax of most programming languages takes a few lessons, the rest of the semester is spent on actually using the language to solve problems. Any curriculum not doing that is just plain bad. Even worse than CS education in Russia in 1913.

You might also ponder that if the author is so out of touch with reality in this simple issue, how completely off base the rest of his statements might be. In fact the statement is so wrong at such a fundamental level that it has probably been generated with an LLM.

A magician's shuffle

As nonsensical as the Twitter post is, we have not yet even mentioned the biggest misdirection in it. You might not even have noticed it yet. I certainly did not until I read the actual post. Try if you can spot it.

Ready? Let's go.

The actual fruit of this "study" boils down to this snippet.

Developers rarely mentioned “time saved” as the core benefit of working in this new way with agents. They were all about increasing ambition.

Let that sink in. For the last several years the main supposed advantage of AI tools has been the fact that they save massive amounts of developer time. This has lead to the "fire all your developers and replace them with AI bots" trend sweeping the nation. Now even this AI advertisement of a "study" can not find any such advantages and starts backpedaling into something completely different. Just like we have always been at war with Eastasia, AI has never been about "productivity". No. No. It is all about "increased ambition", whatever that is. The post then carries on with this even more baffling statement.

When you move from thinking about reducing effort to expanding scope, only the most advanced agentic capabilities will do.

Really? Only the most advanced agentics you say? That is a bold statement to make given that the leading reason for software project failure is scope creep. This is the one area where human beings have decades long track record for beating any artificial system. Even if machines were able to do it better, "Make your project failures more probable! Faster! Spectacularer!" is a tough rallying cry to sell.

To conclude, the actual findings of this "study" seem to be that:

AI does not improve developer productivity or skills
AI does increase developer ambition

This is strictly worse than the current state of affairs.

Wednesday, July 23, 2025

Comparing a red-black tree to a B-tree

In an earlier blog post we found that optimizing the memory layout of a red-black tree does not seem to work. A different way of implementing an ordered container is to use a B-tree. It was originally designed to be used for on-disk data. The design principle was that memory access is "instant" while disk access is slow. Nowadays this applies to memory access as well, as cache hits are "instant" and uncached memory is slow.

I implemented a B-tree in Pystd. Here is how all the various containers compare. For test data we used numbers from zero to one million in a random order.

As we can see, an unordered map is massively faster than any ordered container. If your data does not need to be ordered, that is the one you should use. For ordered data, the B-tree is clearly faster than either red-black tree implementation.

Tuning the B-tree

B-trees have one main tunable parameter, namely the spread factor of the nodes. In the test above it was five, but for on disk purposes the recommended value is "in the thousands". Here's how altering the value affects performance.

The sweet spot seems to be in the 256-512 range, where the operations are 60% faster than standard set. As the spread factor grows towards infinity, the B-tree reduces to just storing all data in a single sorted array. Insertion into that is an O(N^2) algorithm as can be seen here.

Getting weird

The B-tree implementation has many assert calls to verify the internal structures. We can compile the code with -DNDEBUG to make all those asserts disappear. Removing redundant code should make things faster, so let's try it.

There are 13 measurements in total and disabling asserts (i.e. enabling NDEBUG) makes the code run slower in 8 of those cases. Let's state that again, since it is very unexpected: in this particular measurement code with assertions enabled runs faster than the same code without them. This should not be happening. What could be causing it?

I don't know for sure, so here is some speculation instead.

First of all a result of 8/13 is probably not statistically significant to say that enabling assertions makes things faster. OTOH it does mean that enabling them does not make the code run noticeably slower. So I guess we can say that both ways of building the code are approximately as fast.

As to why that is, things get trickier. Maybe GCC's optimizer is just really good at removing unnecessary checks. It might even be that the assertions give the compiler more information so it can skip generating code for things that can never happen. I'm not a compiler engineer, so I'll refrain from speculating further, it would probably be wrong in any case.

Friday, July 11, 2025

AI slop is the new smoking

Saturday, July 5, 2025

Deoptimizing a red-black tree

An ordered map is typically slower than a hash map, but it is needed every now and then. Thus I implemented one in Pystd. This implementation does not use individually allocated nodes, but instead stores all data in a single contiguous array.

Implementing the basics was not particularly difficult. Debugging it to actually work took ages of staring at the debugger, drawing trees by hand on paper, printfing things out in Graphviz format and copypasting the output to a visualiser. But eventually I got it working. Performance measurements showed that my implementation is faster than std::map but slower than std::unordered_map.

So far so good.

The test application creates a tree with a million random integers. This means that the nodes are most likely in a random order in the backing store and searching through them causes a lot of cache misses. Having all nodes in an array means we can rearrange them for better memory access patterns.

I wrote some code to reorganize the nodes so that the root is at the first spot and the remaining nodes are stored layer by layer. In this way the query always processes memory "left to right" making things easier for the branch predictor.

Or that's the theory anyway. In practice this made things slower. And not even a bit slower, the "optimized version" was more than ten percent slower. Why? I don't know. Back to the drawing board.

Maybe interleaving both the left and right child node next to each other is the problem? That places two mutually exclusive pieces of data on the same cache line. An alternative would be to place the entire left subtree in one memory area and the right one in a separate one. Thinking about this for a while, this can be accomplished by storing the nodes in tree traversal order, i.e. in numerical order.

I did that. It was also slower than a random layout. Why? Again: no idea.

Time to focus on something else, I guess.

Friday, June 20, 2025

Book creation using almost entirely open source tools

Some years ago I wrote a book. Now I have written a second one, but because no publisher wanted to publish it I chose to self-publish a small print run and hand it out to friends (and whoever actually wants one) as a a very-much-not-for-profit art project.

This meant that I had to create every PDF used in the printing myself. I received the shipment from the printing house yesterday. The end result turned out very nice indeed.

The red parts in this dust jacket are not ink but instead are done with foil stamping (it has a metallic reflective surface). This was done with Scribus. The cover image was painted by hand using watercolor paints. The illustrator used a proprietary image processing app to create the final TIFF version used here. She has since told me that she wants to eventually shift to an open source app due to ongoing actions of the company making the proprietary app.

The cover itself is cloth with a debossed emblem. The figure was drawn in Inkscape and then copypasted to Scribus.

Evert fantasy book needs to have a map. This has two and they are printed in the end papers. The original picture was drawn with a nib pen and black ink and processed with Gimp. The printed version is brownish to give it that "old timey" look. Despite its apparent simplicity this PDF was the most problematic. The image itself is monochrome and printed with a Pantone spot ink. Trying to print this with CMYK inks would just not have worked. Because the PDF drawing model for spot inks in images behaves, let's say, in an unexpected way, I had to write a custom script to create the PDF with CapyPDF. As far as I know no other open source tool can do this correctly, not even Scribus. The relevant bug can be found here. It was somewhat nerve wrecking to send this out to the print shop with zero practical experience and a theoretical basis of "according to my interpretation of the PDF spec, this should be correct". As this is the first ever commercial print job using CapyPDF, it's quite fortunate that it succeeded pretty much perfectly.

The inner pages were created with the same Chapterizer tool as the previous book. It uses Pango and Cairo to generate PDFs. Illustrations in the text were drawn with Krita. As Cairo only produces RGB PDFs, as the last step it had to be converted to grayscale using Ghostscript.

Monday, June 16, 2025

A custom C++ standard library part 4: using it for real

Writing your own standard library is all fun and games until someone (which is to say yourself) asks the important question: could this be actually used for real? Theories and opinions can be thrown about the issue pretty much forever, but the only way to actually know for sure is to do it.

Thus I converted CapyPDF, which is a fairly compact 15k LoC codebase from the C++ standard library to Pystd, which is about 4k lines. All functionality is still the same, which is to say that the test suite passes, there are most likely new bugs that the tests do not catch. For those wanting to replicate the results themselves, clone the CapyPDF repo, switch to the pystdport branch and start building. Meson will automatically download and set up Pystd as a subproject. The code is fairly bleeding edge and only works on Linux with GCC 15.1.

Build times

One of the original reasons for starting Pystd was being annoyed at STL compile times. Let's see if we succeeded in improving on them. Build times when using only one core in debug look like this.

When optimizations are enabled the results look like this:

In both cases the Pystd version compiles in about a quarter of the time.

Binary size

C++ gets a lot of valid criticism for creating bloated code. How much of that is due to the language as opposed to the library?

That's quite unexpected. The debug info for STL types seems to take an extra 20 megabytes. But how about the executable code itself?

STL is still 200 kB bigger. Based on observations most of this seems to come from stdlibc++'s implementation of variant. Note that if you try this yourself the Pystd version is probably 100 kB bigger, because by default the build setup links against libsubc++, which adds 100+ kB to binary sizes whereas linking against the main C++ runtime library does not.

Performance

Ok, fine, so we can implement basic code to build faster and take less space. Fine. But what about performance? That is the main thing that matters after all, right? CapyPDF ships with a simple benchmark program. Let's look at its memory usage first.

Apologies for the Y-axis does not starting at zero. I tried my best to make it happen, but LibreOffice Calc said no. In any case the outcome itself is expected. Pystd has not seen any performance optimization work so it requiring 10% more memory is tolerable. But what about the actual runtime itself?

This is unexpected to say the least. A reasonable result would have been to be only 2x slower than the standard library, but the code ended up being almost 25% faster. This is even stranger considering that Pystd's containers do bounds checks on all accesses, the UTF-8 parsing code sometimes validates its input twice, the hashing algorithm is a simple multiply-and-xor and so on. Pystd should be slower, and yet, in this case at least, it is not.

I have no explanation for this. It is expected that Pystd will start performing (much) worse as the data set size grows but that has not been tested.

Friday, June 6, 2025

Custom C++ stdlib part 3: The bleedingest edge variant

Implementing a variant type in C++ is challenging to say the least. I tried looking into the libstd++ implementation and could not even decipher where the actual data is stored. There is a lot of inheritance going on and helper classes that seem to be doing custom vtable construction and other metaprogramming stuff. The only thing I could truly grasp was a comment saying // "These go to eleven". Sadly there was not a comment // Smell my glove! which would seem more suitable for this occasion.

A modern stdlib does need a variant, though, so I had to implement one. To make it feasible I made the following simplifying assumptions.

All types handled must be noexcept default constructible, move constructible and movable. (i.e. the WellBehaved concept)
If you have a properly allocated and aligned piece of memory, placement new'ing into it works (there may be UB-shenanigans here due to the memory model)
The number of different types that a variant can hold has a predefined static maximum value.
You don't need to support any C++ version older than c++26.

The last one of these is the biggest hurdle, as C++26 will not be released for at least a year. GCC 15 does have support for it, though, so all code below only works with that.

The implementation

At its core, a Pystd variant is nothing more than a byte buffer and an index specifying which type it holds:

template<typename T...>

class Variant {

char buf[compute_size<0, T...>()] alignas(compute_alignment<0, T...>());

int8_t type_id;

};

The functions to compute max size and alignment requirements for types are simple to implement. The main problem lies elsewhere, specifically: going from a type to the corresponding index, going from a compile time index value to a type and going from a runtime index to the corresponding type.

The middle one is the simplest of these. As of C++26 you can directly index the argument pack like so:

using nth_type = T...[compile_time_constant];

Going from type to an index is only slightly more difficult:

Going from a runtime value to a type is the difficult one. I don't know how to do it "correctly", i.e. the way a proper stdlib implementation does it. However, since the number of possible types is limited at compile time, we can cheat (currently only 5 types are supported):

Unrolling (type) loops like its 1988! This means you can't have variants with hundreds of different types, but in that case you probably need an architectural redesign rather than a more capable variant.

With these primitives implementing public class methods is fairly simple.

The end result

The variant implementation in Pystd and its helper code took approximately 200 lines of code. It handles all the basic stuff in addition to being exception safe for copy operations (implemented as copy to a local variable + move). Compile times remain in fractions of a second per file even though Pystd only has a single public header.

It works in the sense that you can put different types in to it, switch between them and so on without any compiler warnings, sanitizer issues or Valgrind complaints. So be careful with the code, I have only tested it, not proven it correct.

No performance optimization or even measurements have been made.

Sunday, May 25, 2025

Iterators and adaptors

Previously it was mentioned that Python and C++ do iteration quite differently. Python has "statefull" objects that have a .next() method that returns a new object or throws a StopIteration exception. Incidentally Rust does exactly the same thing, except that it uses an optional type rather than an exception. C++ does not work like this, instead it has a concept of a "start" and "end" of a sequence and the machinery keeps incrementing the start until it reaches the end.

At the end of last post it was said that it is difficult to integrate an object of the former type with C++'s native language facilities, at least without macros.

Now we'll look how to integrate an object of the former type with C++'s native language facilities without any macros.

In fact, it only took fewer than 30 lines of code to do. The caveat being that it is probably fairly easy to break it. But it works for me.

Here it is being used.

There is also a second, similar helper class that takes ownership of the object being iterated over.

Monday, May 19, 2025

Optimizing page splits in books

Earlier in this blog we looked at how to generate a justified block of text, which is nowadays usually done with the Knuth-Plass algorithm. Sadly this is not, by itself, enough to create a finished product. Processing all input text with the algorithm produces one very long column of text. This works fine if you end result is a scroll (of the papyrus variety), but people tend to prefer their text it book format. Thus you need a way to split that text column into pages.

The simplest solution is to decide that a page has some N number of lines and page breaks happen at exact these places. This works (and is commonly used) but it has certain drawbacks. From a typographical point of view, there are at least three things that should be avoided:

Orphan lines, a paragraph at the bottom of a page that has only one line of text.
Widow lines, a paragraph that starts a new page and has only one line of text.
Spread imbalance, where the two pages on a spread have different number of lines on them (when both are "full pages")

Determining "globally optimal" page splits for a given text is not a simple problem, because every pagination choice affects every pagination choice that comes after it. If you stare at the problem long and hard enough, eventually you realize something.

The two problems are exactly the same and can be solved in the same way. I have implemented this in the Chapterizer book generator program. The dynamic programming algorithm for both is so similar that they could, in theory, use shared code. I chose not to do it because sometimes it is just so much simpler to not be DRY. They are so similar, in fact, that a search space optimization that I had to do in the paragraph justification case was also needed for pagination even though the data set size is typically much smaller in pagination. The optimization implementation turned out to be exactly the same for both cases.

The program now creates optimal page splits and in addition prints a text log of all pages with widows, orphans and imbalances it could not optimize away.

Has this been done before?

Probably. Everything has already been invented multiple times. I don't know for sure, so instead I'm going to post my potentially incorrect answer here on the Internet for other people to fact check.

I am not aware of any book generation program doing global page optimization. LibreOffice and Word definitely do not do it. As far as I know LaTeX processes pages one by one and once a page is generated it never returns to it, probably because computers in the early 80s did not have enough memory to hold the entire document in RAM at the same time. I have never used Indesign so I don't know what it does.

Books created earlier than about the mid eighties were typeset by hand and they seem to try to avoid orphans and widows at the cost of some page spreads having more lines than other spreads. Modern books seem to optimize for filling the page fully rather than avoiding widows and orphans. Since most books nowadays are created with Indesign (I think), it would seem to imply that Indesign does not do optimal page splitting or at least it is not enabled by default.

When implementing the page splitter the reasons for not doing global page splitting became clear. Computing both paragraph and page optimization is too slow for a "real time" GUI app. This sort of an optimization really only makes sense for a classical "LaTeX-style" book where there is one main text flow. Layout programs like Scribus and Indesign support richer magazine style layouts. This sort of a page splitter does not work in the general case, so in order to use it they would need to have a separate simpler document format.

But perhaps the biggest issue is that if different pages may have different number of lines on each page, it leads to problems in the page's visual appearance. Anything written at the bottom of the page (like page numbers) need to be re-placed based on how many lines of text actually ended up on the page. Doing this in code in a batch system is easy, doing the same with GUI widgets in "real time" is hard.

Saturday, May 3, 2025

Writing your own C++ standard library part 2

This blog post talked about the "self written C++ standard library" I wrote for the fun of it (code here).

The post got linked by Hackernews and Reddit. As is usual the majority of comments did not talk about the actual content but instead were focused on two tangential things. The first one being "this is not a full implementation of the C++ standard library as specified by the ISO standard, therefore the author is an idiot". I am, in actual fact, an idiot, but not due to project scope but because I assumed people on the Internet to have elementary reading comprehension skills. To make things clear: no, this is not an implementation of the ISO standard library. At no point was such a thing claimed. There is little point in writing one of those, there are several high quality implementations available. "Standard library" in this context means "a collection of low level functions and types that would be needed by most applications".

The second discussion was around the fact that calling the C++ standard library "STL" was both wrong and proof that the person saying that does not understand the C++ language. This was followed by several "I am a C++ standard library implementer and everyone I know calls it the STL". Things deteriorated from there.

The complexity question

Existing container implementations are complex by necessity. They need to handle things like types that can not be noexcept moved or copied. The amount of rollback code needed explodes very quickly and needs to be processed every time the header is included (because of templates). A reasonable point against writing your own containers is that eventually you need to support all the same complexity as existing ones because people will insert "bad" types into your container and complain when things fail. Thus you need to have all the same nasty, complex and brittle code as an STL implementation, only lacking decades of hardening to shake out all the bugs.

That is true, but the beauty of having zero existing users is that you can do something like this:

This is basically a concept that requires the type given to be noexcept-movable (and some other things that could arguably be removed). Now, instead of allowing any type under the sun, all containers require all types they get instantiated with to be WellBehaved. This means that the library does not have to care at all about types behaving badly because trying to use those results in a compiler error. A massive amount of complexity is thus eliminated in a single stroke.

There are of course cases when you need to deal with types that are not well behaved. If you can tolerate a memory allocation per object, the simple solution is to use unique_ptr. OTOH if you have types that can't be reliably moved and which are so performance critical that you can't afford memory allocations, then you are probably going to write your own container code anyway. If you are in the position that you have badly behaved types that you can't update, can't tolerate an allocation and can't write your own containers, then that is your problem.

Iterating things the native way

One of the most common things I do with strings in Python is to split them on whitespace. In Python this is simple as there is only one string type, but in native code things are more complicated. What should the return type of mystring.split() be?

vector<string>
vector<string_view>
A coroutine handle
The method should be a full template so you can customize it to output any custom string type (as every C++ code base has at least three of them)
Something ranges something something

There is probably not a single correct answer. All of the options have different tradeoffs. Thus I implemented this in two ways. The first returns a vector<string> as you would get in Python. The second one was a fully general, lazy and allocation-free method that uses the most unloved of language features: the void pointer.

So given that you have a callback function like this:

the split function becomes:

This is as fully general as you can get without needing to muck about with templates, coroutines or monads (I don't actually know what monads are, I'm just counting on the fact that mentioning them makes you look cool). The cost is that the end user needs to write a small adapter lambda to use it.

Iterating things the Python way

The Python iteration protocol is surprisingly simple. The object being iterated needs to provide a .next() method that either returns the next value or throws a StopIteration exception. Obviously exceptions can't be used in native code because they are too slow, but the same effect can be achieved by returning an optional<T> instead. Here's a unit test for an implementation of Python's range function.

Sadly there is not a simple way to integrate this to native loop constructs without macros and even then it is a bit ugly.

Current status

The repo has basic functionality for strings (regular and enforced UTF-8), regexes and basic containers.

Building the entire project has 8 individual compilation and linking steps and takes 0.8 seconds when using a single core on this laptop computer. Thus compiling a single source file with optimizations enabled takes ~ 0.1 seconds. Which is nice.

The only cheat is that pystd uses ctre for regexes and it is precompiled.

Monday, March 24, 2025

Writing your own C++ standard library from scratch

The C++ standard library (also know as the STL) is, without a doubt, an astounding piece of work. Its scope, performance and incredible backwards compatibility have taken decades of work by many of the world's best programmers. My hat's off to all those people who have contributed to it.

All of that is not to say that it is not without its problems. The biggest one being the absolutely abysmal compile times but unreadability, and certain unoptimalities caused by strict backwards compatibility are also at the top of the list. In fact, it could be argued that most of the things people really dislike about C++ are features of the STL rather than the language itself. Fortunately, using the STL is not mandatory. If you are crazy enough, you can disable it completely and build your own standard library in the best Bender style.

One of the main advantages of being an unemployed-by-choice open source developer is that you can do all of that if you wish. There are no incompetent middle damagers hovering over your shoulder to ensure you are "producing immediate customer value" rather than "wasting time on useless polishing that does not produce immediate customer value".

It's my time, and I'll waste it if I want to!

What's in it?

The biggest design questions of a standard library are scope and the "feel" of the API. Rather than spending time on design, we steal it. Thus, when in doubt, read the Python stdlib documentation and replicate it. Thus the name of the library is pystd.

The test app

To keep the scope meaningful, we start by writing only enough of stdlib to build an app that reads a text file, validates it as UTF-8, splits the contents into words, counts how many time each word appears in the file and prints all words and how many times it appears sorted by decreasing count.

This requires, at least:

File handling
Strings
UTF8 validation
A hash map
A vector
Sorting

The training wheels come off

The code is available in this Github repo for those who want to follow along at home.

Disabling the STL is fairly easy (with Linux+GCC at least) and requires only these two Meson statements:

add_global_arguments('-nostdinc++', language: 'cpp')
add_global_link_arguments('-nostdlib++', '-lsupc++', language: 'cpp')

The supc++ library is (according to stackoverflow) a support library GCC needs to implement core language features. Now the stdlib is off and it is time to implement everything with sticks, stones and duct tape.

The outcome

Once you have implemented everything discussed above and auxiliary stuff like a hashing framework the main application looks like this.

The end result is both Valgrind and Asan clean. There is one chunk of unreleased memory, but that comes from supc++. There is probably UB in the implementation. But it should be the good kind of UB that, if it would actually not work, would break the entire Linux userspace because everything depends on it working "as expected".

All of this took fewer than 1000 lines of code in the library itself (including a regex implementation that is not actually used). For comparison merely including vector from the STL brings in 27 thousand lines of code.

Comparison to an STL version

Converting this code to use the STL is fairly simple and only requires changing some types and fine tuning the API. The main difference is that the STL version does not validate that the input is UTF-8 as there is no builtin function for that. Now we can compare the two.

Runtime for both is 0.001 to 0.002 seconds on the small test file I used. Pystd is not noticeably slower than the STL version, which is enough for our purposes. It almost certainly scales worse because there has been zero performance work on it.

Compiling the pystd version with -O2 takes 0.3 seconds whereas the STL version takes 1.2 seconds. The measurements were done on a Ryzen 7 3700X processor.

The executable's unstripped size is 349k for STL and 309k for pystd. The stripped sizes are 23k and 135k. Approximately 100 k of the pystd executable comes from supc++. In the STL version that probably comes dynamically from libstdc++ (which, on this machine, takes 2.5 MB).

Perfect ABI stability

Designing a standard library is exceedingly difficult because you can't ever really change it. Someone, somewhere, is depending on every misfeature in it so they can never be changed.

Pystd has been designed to both support perfect ABI stability and make it possible to change it in arbitrary ways in the future. If you start from scratch this turned out to be fairly simple.

The sample code above used the pystd namespace. It does not actually exist. Instead it is defined like this in the cpp file:

#include <pystd2025.hpp>

namespace pystd = pystd2025;

In pystd all code is in a namespace with a year and is stored in a header file with the same year. The idea is, then, that every year you create a new release. This involves copying all stdlib header files to a file with the new year and regexping the namespace declarations to match. The old code is now frozen forever (except for bug fixes) whereas the new code can be changed at will because there are zero existing lines of code that depend on it.

End users now have the choice of when to update their code to use newer pystd versions. Even better, if there is an old library that can not be updated, any of the old versions can be used in parallel. For example:

pystd2030::SomeType foo;
pystd2025::SomeType bar(foo.something(), foo.something_else());

Thus if no code is ever updated, everything keeps working. If all code is updated at once, everything works. If only parts of the code are updated, things can still be made to work with some glue code. This puts the maintenance burden on the people whose projects can not be updated as opposed to every other developer in the world. This is as it should be, and also would motivate people with broken deps to spend some more effort to get them fixed.

Thursday, February 27, 2025

The price of statelessness is eternal waiting

Most CI systems I have seen have been stateless. That is, they start by getting a fresh Docker container (or building one from scratch), doing a Git checkout, building the thing and then throwing everything away. This is simple and matematically pure, but really slow. This approach is further driven by the way how in cloud computing CPU time and network transfers are cheap but storage is expensive (or at least it is possible to get almost infinite CI build time for open source projects but not persistent storage). Probably because the cloud vendor needs to take care of things like backups, they can't dispatch the task on any machine on the planet but instead on the one that already has the required state and so on.

How much could you reduce resource usage (or, if you prefer, improve CI build speed) by giving up on statelessness? Let's find out by running some tests. To get a reasonably large code base I used LLVM. I did not actually use any cloud or Docker in the tests, but I simulated them on a local media PC. I used 16 cores to compile and 4 to link (any more would saturate the disk). Tests were not run.

Baseline

Creating a Docker container with all the build deps takes a few minutes. Alternatively you can prebuild it, but then you need to download a 1 GB image.

Doing a full Git checkout would be wasteful. There are basically three different ways of doing a partial checkout: shallow clone, blobless and treeless. They take the following amount of time and space:

shallow: 1m, 259 MB
blobless: 2m 20s, 961 MB
treeless: 1m 46s, 473 MB

Doing a full build from scratch takes 42 minutes.

With CCache

Using CCache in Docker is mostly a question of bind mounting a persistent directory in the container's cache directory. A from-scratch build with an up to date CCache takes 9m 30s.

With stashed Git repo

Just like the CCache dir, the Git checkout can also be persisted outside the container. Doing a git pull on an existing full checkout takes only a few seconds. You can even mount the repo dir read only to ensure that no state leaks from one build invocation to another.

With Danger Zone

One main thing a CI build ensures is that the code keeps on building when compiled from scratch. It is quite possible to have a bug in your build setup that manifests itself so that the build succeeds if a build directory has already been set up, but fails if you try to set it up from scratch. This was especially common back in ye olden times when people used to both write Makefiles by hand and to think that doing so was a good idea.

Nowadays build systems are much more reliable and this is not such a common issue (though it can definitely still occur). So what if you would be willing to give up full from-scratch checks on merge requests? You could, for example, still have a daily build that validates that use case. For some organizations this would not be acceptable, but for others it might be reasonable tradeoff. After all, why should a CI build take noticeably longer than an incremental build on the developer's own machine. If anything it should be faster, since servers are a lot beefier than developer laptops. So let's try it.

The implementation for this is the same as for CCache, you just persist the build directory as well. To run the build you do a Git update, mount the repo, build dir and optionally CCache dirs to the container and go.

I tested this by doing a git pull on the repo and then doing a rebuild. There were a couple of new commits, so this should be representative of the real world workloads. An incremental build took 8m 30s whereas a from scratch rebuild using a fully up to date cache took 10m 30s.

Conclusions

The amount of wall clock time used for the three main approaches were:

Fully stateless

Image building: 2m
Git checkout: 1m
Build: 42m
Total: 45m

Cached from-scratch

Image building: 0m (assuming it is not "apt-get update"d for every build)
Git checkout: 0m
Build: 10m 30s
Total: 10m 30s

Fully cached

Image building: 0m
Git checkout: 0m
Build: 8m 30s
Total: 8m 30s

Similarly the amount of data transferred was:

Fully stateless

Image: 1G
Checkout: 300 MB

Cached from-scratch:

Image: 0
Checkout: O(changes since last pull), typically a few kB

Fully cached

Image: 0
Checkout: O(changes since last pull)

The differences are quite clear. Just by using CCache the build time drops by almost 80%. Persisting the build dir reduces the time by a further 15%. It turns out that having machines dedicated to a specific task can be a lot more efficient than rebuilding the universe from atoms every time. Fancy that.

The final 2 minute improvement might not seem like that much, but on the other hand do you really want your developers to spend 2 minutes twiddling their thumbs for every merge request they create or update? I sure don't. Waiting for CI to finish is one of the most annoying things in software development.

Monday, February 10, 2025

C++ compiler daemon testing tool

In an earlier blog post I wrote about a potential way of speeding up C++ compilations (or any language that has a big up-front cost). The basic idea is to have a process that reads in all stdlib header code that is suspended. Compilations are done by sending the actual source file + flags to this process, which then forks and resumes compilation. Basically this is a way to persist the state of the compiler without writing (or executing) a single line of serialization code.

The obvious follow up question is what is the speedup of this scheme. That is difficult to say without actually implementing the system. There are way too many variables and uncertainties to make any sort of reasonable estimate.

So I implemented it.

Not in an actual compiler, heavens no, I don't have the chops for that. Instead I implemented a completely fake compiler that does the same steps a real compiler would need to take. It spawns the daemon process. It creates a Unix domain socket. It communicates with the running daemon. It produces output files. The main difference is that it does not actually do any compilation, instead it just sleeps to simulate work. Since all sleep durations are parameters, it is easy to test the "real" effect of various schemes.

The code is in this GH repo.

The default durations were handwavy estimates based on past experience. In past measurements, stdlib includes take by far the largest chunk of the total compilation time. Thus I estimated that compilation without this scheme would take 5 seconds per file whereas compilations with it would take 1 second. If you disagree with these assumptions, feel free to run the test yourself with your own time estimates.

The end result was that on this laptop that has 22 cores a project with 100 source files took 26 seconds to compile without the daemon and 7 seconds with it. This means the daemon version finished in just a hair over 25% of a "regular" build.

Wouldn't you want your compilations to finish in a quarter of the time with zero code changes? I sure would.

(In reality the speedup is probably less than that. How much? No idea. Someone's got to implement that to find out.)

Tuesday, February 4, 2025

The trials and tribulations of supporting CJK text in PDF

In the past I may have spoken critically on Truetype fonts and their usage in PDF files. Recently I have come to the conclusion that it may have been too harsh and that Truetype fonts are actually somewhat nice. Why? Because I have had to add support for CFF fonts to CapyPDF. This is a font format that comes from Adobe. It encodes textual PostScript drawing operations into binary bytecode. Wikipedia does not give dates, but it seems to have been developed in the late 80s - early 90s. The name CFF is an abbeviation for "complicated font format".

Double-checks notes.

Compact font format. Yes, that is what I meant to write. Most people reading this have probably not ever even seen a CFF file so you might be asking why is supporting CFF fonts even a thing nowadays? It's all quite simple. Many of the Truetype (and especially OpenType) fonts you see are not actually Truetype fonts. Instead they are Transfontners, glyphs in disguise. It is entirely valid to have a Truetype font that is merely an envelope holding a CFF font. As an example the Noto CJK fonts are like this. Aggregation of different formats is common in font files, and the main reason OpenType fonts have like four different and mutually incompatible ways of specifying color emoji. None of the participating entities were willing to accept anyone else's format so the end result was to add all of them. If you want Asian language support, you have to dive into the bowels of the CFF rabid hole.

As most people probably do not have sufficient historical perspective, let's start by listing out some major computer science achievements that definitely existed when CFF was being designed.

File format magic numbers
Archive formats that specify both the offset and size of the elements within
Archive formats that afford access to their data in O(number of items in the archive) rather than O(number of bytes in the file)
Data compression

CFF chooses to not do any of this nonsense. It also does not believe in consistent offset types. Sometimes the offsets within data objects refer to other objects by their order in the index they are in. Sometimes they refer to number of bytes from the beginning of the file. Sometimes they refer to number of bytes from the beginning of the object the offset data is written in. Sometimes it refers to something else. One of the downsides of this is that while some of the data is neatly organized into index structures with specified offsets, a lot of it is just free floating in the file and needs the equivalent of three pointer dereferences to access.

Said offsets are stored with a variable width encoding like so:

This makes writing subset CFF font files a pain. In order to write an offset value at some location X, you first must serialize everything up to that point to know where the value would be written. To know the value to write you have to serialize the the entire font up to the point where that data is stored. Typically the data comes later in the file than its offset location. You know what that means? Yes, storing all these index locations and hotpatching them afterwards once you find out where the actual data pointed to ended up in. Be sure to compute your patching locations correctly lest you end up in lengthy debugging sessions where your subset font files do not render correctly. In fairness all of the incorrect writes were within the data array and thus 100% memory safe, and, really, isn't that the only thing that actually matters?

One of the main data structures in a CFF file is a font dictionary stored in, as the docs say, "key-value pairs". This is not true. The "key-value dictionary" is neither key-value nor is it a dictionary. The entries must come in a specific order (sometimes) so it is not a dictionary. The entries are not stored as key-value pairs but as value-key pairs. The more accurate description of "value-key somewhat ordered array" does lack some punch so it is understandable that they went with common terminology. The backwards ordering of elements to some people confusion bring might, but it perfect sense makes, as the designers of the format a long history with PostScript had. Unknown is whether some of them German were.

Anyhow, after staring directly into the morass of madness for a sufficient amount of time the following picture emerges.

Final words

The CFF specification document contains data needed to decipher CFF data streams in nice tabular format, which is easy to convert to an enum. Trying it fails with an error message saying that the file has prohibited copypasting. This is a bit rich coming from Adobe, whose current stance seems to be that they can take any document opened with their apps and use it for AI training. I'd like to conclude this blog post by sending the following message to the (assumed) middle manager who made the decision that publicly available specification documents should prohibit copypasting:

YOU GO IN THE CORNER AND THINK ABOUT WHAT YOU HAVE DONE! AND DON'T EVEN THINK ABOUT COMING BACK UNTIL YOU ARE READY TO APOLOGIZE TO EVERYONE FOR YOU ACTIONS!

Tuesday, January 14, 2025

Measuring code size and performance

Are exceptions faster and/or bloatier than using error codes? Well...

The traditional wisdom is that exceptions are faster when not taken, slower when taken and lead to more bloated code. On the other hand there are cases where using exceptions makes code a lot smaller. In embedded development, even, where code size is often the limiting factor.

Artificial benchmarks aside, measuring the effect on real world code is fairly difficult. Basically you'd need to implement the exact same, nontrivial piece of code twice. One implementation would use exceptions, the other would use error codes but they should be otherwise identical. No-one is going to do that for fun or even idle curiosity.

CapyPDF has been written exclusively using C++ 23's new expected object for error handling. As every Go programmer knows, typing error checks over and over again is super annoying. Very early on I wrote macros to autopropagate errors. That props up an interesting question, namely could you commit horrible macro crimes to make the error handling either use error objects or exceptions?

It tuns out that yes you can. After a thorough scrubbing of the ensuring shame from your body and soul you can start doing measurements. To get started I built and ran CapyPDF's benchmark application with the following option combinations:

Optimization -O1, -O2, -O3, -Os
LTO enabled and disabled
Exceptions enabled and disabled
RTTI enabled and disabled
NDEBUG enabled and disabled

The measurements are the stripped size of the resulting shared library and runtime of the test executable. The code and full measurement data can be found in this repo. The code size breakdown looks like this:

Performance goes like this:

Some interesting things to note:

The fastest runtime of 0.92 seconds with O3-lto-rtti-noexc-ndbg
The slowest is 1.2s with Os-noltto-rtti-noexc-ndbg
If we ignore Os the slowes is 1.07s O1-noltto-rtti-noexc-ndbg
The largest code is 724 kb with O3-nolto-nortt-exc-nondbg
The smallest is 335 kB with Os-lto-nortti-noexc-ndbg
Ignoring Os the smallest is 470 kB with O1-lto-nortti-noexc-ndbg

Things noticed via eyeballing

Os leads to noticeably smaller binaries at the cost of performance
O3 makes binaries a lot bigger in exchange for a fairly modest performance gain
NDEBUG makes programs both smaller and faster, as one would expect
LTO typically improves both speed and code size
The fastest times for O1, O2 and O3 are within a few percent points of each other with 0.95, 094 and 0.92 seconds, respectively

Caveats

At the time of writing the upstream code uses error objects even when exceptions are enabled. To replicate these results you need to edit the source code.

The benchmark does not actually raise any errors. This test only measures the golden path.

The tests were run on GCC 14.2 on x86_64 Ubuntu 10/24.