Friday, June 26, 2026

Pystd standard library, similar-ish functionality with a fraction of the compile time

I submitted talk proposals about Pystd, the from-scratch written standard library for C++ (custom design, not a implementation of the ISO specification) to a bunch of conferences. Unfortunately all of them were rejected, so it's blog posting time.

A controversial opinion

Pretty much everybody agrees that compiling C++ is slow, but I personally feel that it is intolerably slow. Other people might disagree, and that is fine, but let's look at some numbers.

Compiling a helloworld executable in C takes approximately 0.02 seconds. Compiling a C++ exe that does the same thing using #include<print> takes a second if optimizations are disabled and up to 2.3 seconds with them enabled. This is approximately a 100x slowdown. I'm using a Ryzen 7 3700X processor, so not the latest and greatest but not too shabby either. I have talked about this slowdown with some people in person and weirdly often their answers have been "that's not a problem, two seconds is insignificant". Even if you accepted this (which personally I don't) the big problem comes from scaling because the slowdown factor is approximately linear. Let's assume that in a less extreme case the slowdown was only 20x instead of 100x. In this case a program that should be done in 0.1 seconds takes 2 seconds, and therefore a program that should compile in one minute would take on the order of 20 minutes to compile.

Why is compilation so slow?

C++ is actually very fast to compile, the slowdowns come mostly from the way the standard library is implemented. This is actually fairly easy to test yourself by running the following shell snippet.

echo '#include<vector>' | g++ -x c++ -E - -std=c++23 | wc

The -E flag tells the compiler to stop after preprocessing. The output is the source code that is fed to the compiler proper. Instead we pass it to wc and find out that merely including vector expands out to 29 000 lines of code. The number is not directly comparable to "human written code", but still, almost 30k lines of code just to get a growable array of elements seems a bit much. And vector is actually one of the lighter headers. Memory is 55 thousand lines (especially bad, since 99% of the time people just want unique_ptr), print is 65 thousand lines and filesystem is 80 thousand lines.

The unfortunate reality is that if you include even a single standard library header, your compile times tank and there's nothing you can do about it.

Just say no

Pystd was originally just a project for me to implement low level primitives (hash maps etc) for scratch for the fun of it. Pretty quickly I came to the three design priorities:

  1. Compilation time
  2. Simplicity of implementation
  3. Performance

I'm not aware of an existing standard library where build time minimization would have been a design priority. Those that are fast, like the standard libraries of C and Go, seem to mostly follow from the simplicity of their respective languages.

At the time of writing building Pystd and all tests from scratch using a single core takes 4 seconds. This consists of 45 individual processes (mostly compiles, a few links). Enabling optimizations balloons the build time to 9 seconds. Using all 16 cores brings it down to 1.9 seconds.

What we have thus far

If we ignore test code, Pystd has 6500 lines of headers and 5600 lines of source in total. Even adding these two together yields a line count of roughly one third of std::vector's (preprocessed) implementation. Functionality provided by Pystd includes:

  • vector, string, validating u8string, string views, spans
  • Hash map, ordered map (using a B-tree)
  • sort (approximately as fast as stdlibc++), stable_sort (faster than stdlibc++)
  • Random selection of things in the ISO algorithm header
  • Optional, expected, variant, unique_ptr
  • Functionality roughly equivalent to Python modules argparse, pathlib (including the ** operator), regular expressions (using pcre) and tempfile

Note that all of these are "complete" as it were. Typically they only contain the most commonly used subset of functionality. That might be a fairly small.

Performance

There is an earlier blog post about the performance. The numbers for converting the CapyPDF library are as follows:

  • Compile times dropped ~80%
  • Unstripped binary size reduced by ~75%
  • Stripped binary size reduced by ~30%
  • Runtime performance became ~25% faster (yes, faster, not slower)

Regression can be prevented

Two typical issues people raise when hearing something needs to be "fast to compile" are the following:

  1. What even is "fast"? It is highly subjective thing that depends on each person and the computer they are using.
  2. Even if something is fast now, it can not remain fast. New functionality gets added all the time, so the code is destined to become ever slower and eventually it is just as slow as the default standard library.

The boring solution to both of these issues is the same: a predefined time budget. Pystd has a requirement that compiling a source file that includes any single public header must take at most 0.15 seconds. This limit was originally 0.1 seconds and it worked perfectly with GCC, but Clang's process startup time is longer than that. The test script that validates the performance is here. It must pass even on a Raspberry Pi.

Interestingly the tester script was not originally single-threaded. I parallelised it only because it was the single slowest part of Pystd's compile-test cycle taking over a second.

This is the requirement that all new functionality in Pystd must meet. If the code you want to add compiles too slow then either you rewrite the whole package to compile faster or you submit a patch to the upstream compiler to make it run faster.

Try it yourself

The code for Pystd is available in the usual places. Beginners might want to try using the sample project instead.

The code works on Linux and macOS. It does not support MSVC, because the implementation uses pack indexing, which MSVC has not implemented yet.

Monday, June 15, 2026

Beware of Star Trek managers, especially when bearing MBAs

Almost exactly three years ago the Oceangate submarine implosion happened. The disaster came about when a billionaire called Stockton Rush created his own unclassified submarine to go sightseeing on the Titanic. Ignoring all advice from experts he created a "macgyveresque death trap" that eventually killed him and sadly also 4 innocent people. The whole thing was a massive display of stupidity and arrogance with unfortunate outcomes. We are not going to go into the actual event any deeper, but those interested can find lots of material online.

Instead we are going to look more deeply into one often overlooked points of Stockton Rush's character. Apparently he felt like he was something of a "new James T. Kirk" (link1 paywalled, link2). Liking Star Trek is not that unusual. I'm guessing that more than 99% of the readers of this blog are fellow Star Trek fans. The problem lies elsewhere, but to understand it we first have travel back in time.

A brief overview of the British navy during the Napoleonic wars (by a non-historian, so probably inaccurate)

The original concept for Star Trek was, approximately, The Adventures of Horatio Hornblower in Space! The Enterprise is basically a British warship sailing through the vast ocean of outer space. The command structure mirrors this, where you have a captain, navigator, ship's doctor and so on. The Next Generation leaned into this even further by having a first officer and so on. The original Star Trek never went into detail on how the main cast got to their current positions, just that there was an Starfleet Academy they went to.

In the Napoleonic era of Hornblower things were quite different. Anyone who wanted to become a captain pretty much had to be from the upper classes. They had to obtain a letter of recommendation so that they could join a vessel as a midshipman at the age of 13 or so. They were expected to be able seamen by this time and then spent the next six to seven years working on the ship rigging sails and doing all manner of random jobs. This went on for six to nine years depending on circumstances, after which the person could take a formal examination to become a lieutenant. The test was not trivial, many people could not pass even after trying multiple times.

A lieutenant then had to work successfully for several years before obtaining the rank of captain. Even that did not guarantee a commission. Some captains never commanded a ship simply because there were not enough of them to go around. All in all becoming a ship's captain was a long and difficult journey. In a surprisingly non-British turn of events it was not possible for aristocrats to sneak past the gates. Getting a midshipman position was obviously easier with connections, but the lieutenant's test was something they had to pass on their own.

All of this is to say that every captain of the time was an expert with decades of working experience on many different positions aboard the ship.

What does a captain actually do?

[Note: I have not fact checked this portion at all. Feel free to consider it fanfiction.]

The year is 1808 and we are aboard a British warship about to leave for a mission of great importance. The captain gives the order to set sail. Whistles are blown, bells are rung and sailors springs into action. Every single man, with one exception, is either doing manual labour or directly supervising their underlings. That exception is the captain, who seemingly stands around doing nothing (at least if you ask the crew). This is not so.

What he is doing is crucial. He is observing the state of the ship and her crew. This includes things like overall crew morale, any aberrations from normal operations that could cause problems, thinking of workflow improvements and so on. In a sense he has to sense the ship itself. This only works because of two things. First of all he has personal experience doing the exact work he is observing. If you have not personally "been there", you can't really know if a crew is working well or not. You need a "gut feeling" to be able to sense this. Secondly the captain does not have any manual labour so he can focus all of his mental energy on observing the ship's state. He is preparing for all the unexpected things that may occur in the future. This can only happen if your brain is free from menial tasks.

This is exactly what most books on business and project management advocate. It is a time tested way of improving your chances of success. A highly skilled commander can take an average team of people and lead them to victory. It is the basic plot of most military and sports movies.

Getting back to the present

Now take a typical modern day billionaire-via-inheritance and show them Star Trek at an impressionable age. Do they see the advantages of education, hard work and ethics? The foundation upon which Gene Roddenberry carefully built the show? Hell no! What they see is this (TNG screenshot used because TOS did not have a suitable maritime episode).

And then they think: "Wow! I want to be exactly like that! Parading around in a funny hat while everyone obeys my orders without question is my life's mission from now on. And I get to have sexy space sex with hot sexy space ladies of sex whenever I want. This appeals to me even more profoundly than Atlas Shrugged." Some of them might go on to watch Master and Commander and shed tears upon realizing that they cant publicly flog employees for failing to salute their superiors. Yet. Expect this to be made legal in Silicon Valley any day now.

Liking Kirk is not in any way a bad thing. Wanting to "be just like Kirk" is, because in the real world running a business like Kirk runs the Enterprise is a terrible way to do things. An example will illustrate this nicely. Let's imagine a random episode where the Enterprise has gotten into trouble. Eventually Kirk will call for Scotty and tell him: "You need to <babble> <babble> <babble>." Scotty will then reply with a varying level of scottish accent something like: "I cannae change the laws of physics, captain". Kirk will then say the same thing again, just more aggressively and in a close up shot. Scotty replies with "Well in that case I can get it done in sixty minutes." Kirk counters with: "You have five." And thus the problem is solved. Kirk gets a commendation for incredible valour under stress while Scotty, who did all of the actual work, is never mentioned.

In "Kirk style" management the Big Boss tells his underlings what to do. If they try to give any sort of feedback, the Boss ignores everything and just repeats his original orders again and again until the other party yields. The only reason underlings ever resist The Vision is that they are lazy and it is the job of the manager to put them in their place. This seems like a bit from a comedy show, but I have unfortunately worked under bosses like this. Either you try to talk at least some sense into them, fail, get labeled as a "not team player", watch the project crash and get blamed for the failure, or you try to do the impossible task given to you, fail watch the project crash and get blamed for the failure.

Another major problem with Kirk is that due to the way tv shows and movies need to be structured, he is actually an obsessive gambler. The stakes must always get higher and the ways to get out of trouble must become ever crazier. Kirk will break any and all laws and regulations he sees fit and then, once he has succeeded, no disciplinary action is taken. The ends justify the means. Idolising this sort of behaviour leads to thinking that wild one-in-a-million gambits will succeed at least 99 times out of a hundred. And even if it fails, you can get out of it by betting everything on an even bigger gamble. The real world does not work like that. Reality is not a story and you are not its hero. It does not owe you eternal, or even eventual, success. Had Stockton Rush survived his death trap, he would most likely have faced criminal charges and, if convicted, gone to jail.

The myth of the existence of the professional manager

Let's make one last detour in the 1800s and assume that the 7th Earl of Sidcup or some such really wants to get his idiot son instated as a captain. He and contacts the appropriate naval officers.

"My offspring needs to become a captain of a ship post haste!"

"Well first he has to become a midshipman and ..."

"Phah! None of that nonsense. It's way too slow and not becoming of my statute. Also my son is 35 years old so the post of a midshipman would be beneath him."

"I see. Well what sort of prior naval experience does he have?"

"None."

"Has he even ever been out to sea?"

"Not to my knowledge. But that does not matter. He is highly skilled in using the abacus excelius to compute annual budgets."

"By himself?"

"Of course not. That is what secretaries are for. He just gives them orders. That he can do. And that is all that matters. Same as in sailing."

This person is unlikely to get his wish with this line of reasoning. On the other hand in modern business life this is common. For example when startups get VC money, a common requirement is that they need to get a "proper manager" as a CEO. Typically this means the investor's friend, and more often than not an MBA.

Contrary to common belief, having an MBA does not make you incompetent at managing a technical company (though there is a strong positive correlation). It is entirely possible to be a good manager on a field you have no personal experience in. You just have to have a lot of humility, listen (actually, properly listen) to your employees and let the people with hands-on experience make the most technical, product and development decisions. In other words you have to be the enabler, not the maverick decision maker. People with these sorts of personality traits are rare and typically their career choices steer as far away from getting an MBA as possible.

The absolute worst thing happens if the CEO in question combines the (lack of) skills of an MBA with the attitude of Kirk. That leads to incompetent decisions based on willful ignorance, executed with the fury of an egomaniac who refuses to even entertain the notion that they might be wrong. Further, any person inside the organisation who dares to point out potential flaws in the plan will soon find themselves outside said organisation. Disagreement is treason. Treason shall not go unpunished.

In the 1800s the British navy could be said to be the best in the world. It seems plausible that one component of this success was the requirement that the officers running their ships had to have actual experience operating the ship. Not looking at other people operating it. Not pretending to read about operating it for a test. Actually doing it. If we look around how MBA wielding sociopath CEOs are enshittifying absolutely everything about the tech industry, bringing this requirement back into active use starts to feel awfully tempting.

Epilogue: Why doesn't everything immediately explode?

A reasonable counterpoint to everything written above would be that if managers truly are that bad, shouldn't all those companies be bankrupt by now? In an ideal world they would be, but there are opposing forces that keep them going.

The first one is that all corporations have inertia. If you took an established major company and actively started to mismanage it to death, it would still take years for things to eventually collapse. 

The second one is a dirty little secret. Many employees care more about the product they work on than "corporate visions" that seem to stem from overuse of peyote. They don't blindly obey idiotic commands but instead try to make things silently work within the system. Basically this means that corporations thrive despite their mad kings, not because of them. I know several people who have worked in these kinds of organizations and this is not as rare of an occurrence as one might imagine.

I have also experienced it personally. Years ago I was at a company, whose CEO (who, to the best of my knowledge, did not have an MBA) wanted to change the company's product so that it would do a specific thing X. Everybody thought this was a horrible idea and tried to reason with him using solid business and technical reasons (which turned out to be 100% correct). That failed. Spectacularly. This lead to an eternal series of secret meetings. The participants were main developers and all managers except the CEO. The only item on the agenda was "How can we make the CEO think that we did what he ordered while doing the exact opposite".

Eventually we did succeed, but boy was that a surreal couple of weeks.

Monday, June 8, 2026

Faking keyword arguments to functions in C++

One of the many nice language features in Python are keyword arguments. They make some types of APIs  concise and readable. Like so:

Unfortunately C does not have keyword arguments and, by extension, neither does C++. Adding them as a language feature would take 15-20 years of effort, most of which would consist of trying to convince people via email that such a feature is important and should be added.

There have been attempts to implement this via macros and template magic (link), but they have not seen widespread usage probably because they are using macros and template magic. However it turns out that with modern language features you can fake keyword arguments fairly convincingly. Like so:

The add_argument method takes a single argument which is a struct. The extra curly braces inside the parentheses boil down to "whatever the underlying argument is, construct it in place with these parameters". The dotted names are designated initializers, so those fields get the specified value whereas other fields get their default values.

And there you go, keyword arguments in C++. You just have to squint a bit and pretend not to see the extra curly braces.

Tuesday, April 21, 2026

CapyPDF is approaching feature sufficiency

In the past I have written many blog posts on implementing various PDF features in CapyPDF. Typically they explain the feature being implemented, how confusing the documentation is, what perverse undocumented quirks one has to work around to get things working and so on. To save the effort of me writing and you reading yet another post of the same type, let me just say that you can now use CapyPDF to generate PDF forms that have widgets like text fields and radio buttons.

What makes this post special is that forms and widget annotations were pretty much the last major missing PDF feature Does that mean that it supports everything? No. Of course not. There is a whole bunch of subtlety to consider. Let's start with the fact that the PDF spec is massive, close to 1000 pages. Among its pages are features that are either not used or have been replaced by other features and deprecated.

The implementation principle of CapyPDF thus far has been "implement everything that needs special tracking, but only to the minimal level needed". This seems complicated but is in fact quite simple. As an example the PDF spec defines over 20 different kinds of annotations. Specifying them requires tracking each one and writing out appropriate entries in the document metadata structures. However once you have implemented that for one annotation type, the same code will work for all annotation types. Thus CapyPDF has only implemented a few of the most common annotations and the rest can be added later when someone actually needs them.

Many objects have lots of configuration options which are defined by adding keys and values to existing dictionaries. Again, only the most common ones are implemented, the rest are mostly a matter of adding functions to set those keys. There is no cross-referencing code that needs to be updated or so on. If nobody ever needs to specify the color with which a trim box should be drawn in a prepress preview application, there's no point in spending effort to make it happen.

The API should be mostly done, especially for drawing operations. The API for widgets probably needs to change. Especially since form submission actions are not done. I don't know if anything actually uses those, though. That work can be done based on user feedback.

Friday, April 17, 2026

Multi merge sort, or when optimizations aren't

In our previous episode we wrote a merge sort implementation that runs a bit faster than the one in stdlibc++. The question then becomes, could it be made even faster. If you go through the relevant literature one potential improvement is to do a multiway merge. That is, instead of merging two arrays into one, you merge four into one using, for example, a priority queue.

This seems like a slam dunk for performance.

  • Doubling the number of arrays to merge at a time halves the number of total passes needed
  • The priority queue has a known static maximum size, so it can be put on the stack, which is guaranteed to be in the cache all the time
  • Processing an element takes only log(#lists) comparisons
Implementing multimerge was conceptually straightforward but getting all the gritty details right took a fair bit of time. Once I got it working the end result was slower. And not by a little, either, but more than 30% slower. Trying some optimizations made it a bit faster but not noticeably so.

Why is this so? Maybe there are bugs that cause it to do extra work? Assuming that is not the case, what actually is? Measuring seems to indicate that a notable fraction of the runtime is spent in the priority queue code. Beyond that I got very little to nothing.

The best hypotheses I could come up with has to with the number of comparisons made. A classical merge sort does two if statements per output elements. One to determine which of the two lists has a smaller element at the front and one to see whether removing the element exhausted the list. The former is basically random and the latter is always false except when the last element is processed. This amounts to 0.5 mispredicted branches per element per round.

A priority queue has to do a bunch more work to preserve the heap property. The first iteration needs to check the root and its two children. That's three comparisons for value and two checks whether the children actually exist. Those are much less predictable than the comparisons in merge sort. Computers are really efficient at doing simple things, so it may be that the additional bookkeeping is so expensive that it negates the advantage of fewer rounds.

Or maybe it's something else. Who's to say? Certainly not me. If someone wants to play with the code, the implementation is here. I'll probably delete it at some point as it does not have really any advantage over the regular merge sort.

Monday, April 6, 2026

Sorting performance rabbit hole

In an earlier blog post we found out that Pystd's simple sorting algorithm implementations were 5-10% slower than their stdlibc++ counterparts. The obvious follow up nerd snipe is to ask "can we make the Pystd implementation faster than stdlibc++?"

For all tests below the data set used was 10 million consecutive 64 bit integers shuffled in a random order. The order was the same for all algorithms.

Stable sort

It turns out that the answer for stable sorting is "yes, surprisingly easily". I made a few obvious tweaks (whose details I don't even remember any more) and got the runtime down to 0.86 seconds. This is approximately 5% faster than std::stable_sort. Done. Onwards to unstable sort.

Unstable sort

This one was not, as they say, a picnic. I suspect that stdlib developers have spent more time optimizing std::sort than std::stable_sort simply because it is used a lot more.

After all the improvements I could think of were done, Pystd's implementation was consistently 5-10% percent slower. At this point I started cheating and examined how stdlibc++'s implementation worked to see if there are any optimization ideas to steal. Indeed there were, but they did not help.

Pystd's insertion sort moves elements by pairwise swaps. Stdlibc++ does it by moving the last item to a temporary, shifting the array elements onwards and then moving the stored item to its final location. I implemented that. It made things slower.

Stdlibc++'s moves use memmove instead of copying (at least according to code comments).  I implemented that. It made things slower.

Then I implemented shell sort to see if it made things faster. It didn't. It made them a lot slower. So did radix sort.

Then I reworked the way pivot selection is done and realized that if you do it in a specific way, some elements move to their correct partitions as a side effect of median selection. I implemented that and it did not make things faster. It did not make them slower, either, but the end result should be more resistant against bad pivot selection so I left it in.

At some point the implementation grew a bug which only appeared with very large data sets. For debugging purposes I reduce the limit where introsort switches from qsort to insertion sort from 16 to 8. I got the bug fixed but the change made sorting a lot slower. As it should.

But this raises a question, namely would increasing the limit from 16 to 32 make things faster? It turns out that it did. A lot. Out of all perf improvements I implemented, this was the one that yielded the biggest improvement by a fairly wide margin. Going to 64 elements made it even faster, but that made other algorithms using insertion sort slower, so 32 it is. For now at least.

After a few final tweaks I managed to finally beat stdlibc++. By how much you ask? Pystd's best observed time was 0.754 seconds while stdlibc++'s was 0.755 seconds. And it happened only once. But that's enough for me.

Monday, March 23, 2026

Everything old is new again: memory optimization

At this point in history, AI sociopaths have purchased all the world's RAM in order to run their copyright infringement factories at full blast. Thus the amount of memory in consumer computers and phones seems to be going down. After decades of not having to care about memory usage, reducing it has very much become a thing.

Relevant questions to this state of things include a) is it really worth it and b) what sort of improvements are even possible. The answers to these depend on the task and data set at hand. Let's examine one such case. It might be a bit contrived, unrepresentative and unfair, but on the other hand it's the one I already had available.

Suppose you have to write script that opens a text file, parses it as UTF-8, splits it into words according to white space, counts the number of time each word appears and prints the words and counts in decreasing order (most common first).

The Python baseline

This sounds like a job for Python. Indeed, an implementation takes fewer than 30 lines of code. Its memory consumption on a small text file [update: repo's readme, which is 1.3k] looks like this.

Peak memory consumption is 1.3 MB. At this point you might want to stop reading and make a guess on how much memory a native code version of the same functionality would use.

The native version

A fully native C++ version using Pystd requires 60 lines of code to implement the same thing. If you ignore the boilerplate, the core functionality fits in 20 lines. The steps needed are straightforward:

  1. Mmap the input file to memory.
  2. Validate that it is utf-8
  3. Convert raw data into a utf-8 view
  4. Split the view into words lazily
  5. Compute the result into a hash table whose keys are string views, not strings

The main advantage of this is that there are no string objects. The only dynamic memory allocations are for the hash table and the final vector used for sorting and printing. All text operations use string views , which are basically just a pointer + size.

In code this looks like the following:

Its memory usage looks like this.

Peak consumption is ~100 kB in this implementation. It uses only 7.7% of the amount of memory required by the Python version.

Isn't this a bit unfair towards Python?

In a way it is. The Python runtime has a hefty startup cost but in return you get a lot of functionality for free. But if you don't need said functionality, things start looking very different.

But we can make this comparison even more unfair towards Python. If you look at the memory consumption graph you'll quite easily see that 70 kB is used by the C++ runtime. It reserves a bunch of memory up front so that it can do stack unwinding and exception handling even when the process is out of memory. It should be possible to build this code without exception support in which case the total memory usage would be a mere 21 kB. Such version would yield a 98.4% reduction in memory usage.