Friday, July 11, 2025

Saturday, July 5, 2025

An ordered map is typically slower than a hash map, but it is needed every now and then. Thus I implemented one in Pystd. This implementation does not use individually allocated nodes, but instead stores all data in a single contiguous array.

Implementing the basics was not particularly difficult. Debugging it to actually work took ages of staring at the debugger, drawing trees by hand on paper, printfing things out in Graphviz format and copypasting the output to a visualiser. But eventually I got it working. Performance measurements showed that my implementation is faster than std::map but slower than std::unordered_map.

So far so good.

The test application creates a tree with a million random integers. This means that the nodes are most likely in a random order in the backing store and searching through them causes a lot of cache misses. Having all nodes in an array means we can rearrange them for better memory access patterns.

I wrote some code to reorganize the nodes so that the root is at the first spot and the remaining nodes are stored layer by layer. In this way the query always processes memory "left to right" making things easier for the branch predictor.

Or that's the theory anyway. In practice this made things slower. And not even a bit slower, the "optimized version" was more than ten percent slower. Why? I don't know. Back to the drawing board.

Maybe interleaving both the left and right child node next to each other is the problem? That places two mutually exclusive pieces of data on the same cache line. An alternative would be to place the entire left subtree in one memory area and the right one in a separate one. Thinking about this for a while, this can be accomplished by storing the nodes in tree traversal order, i.e. in numerical order.

I did that. It was also slower than a random layout. Why? Again: no idea.

Time to focus on something else, I guess.

Friday, June 20, 2025

Book creation using almost entirely open source tools

Some years ago I wrote a book. Now I have written a second one, but because no publisher wanted to publish it I chose to self-publish a small print run and hand it out to friends (and whoever actually wants one) as a a very-much-not-for-profit art project.

This meant that I had to create every PDF used in the printing myself. I received the shipment from the printing house yesterday. The end result turned out very nice indeed.

The red parts in this dust jacket are not ink but instead are done with foil stamping (it has a metallic reflective surface). This was done with Scribus. The cover image was painted by hand using watercolor paints. The illustrator used a proprietary image processing app to create the final TIFF version used here. She has since told me that she wants to eventually shift to an open source app due to ongoing actions of the company making the proprietary app.

The cover itself is cloth with a debossed emblem. The figure was drawn in Inkscape and then copypasted to Scribus.

Evert fantasy book needs to have a map. This has two and they are printed in the end papers. The original picture was drawn with a nib pen and black ink and processed with Gimp. The printed version is brownish to give it that "old timey" look. Despite its apparent simplicity this PDF was the most problematic. The image itself is monochrome and printed with a Pantone spot ink. Trying to print this with CMYK inks would just not have worked. Because the PDF drawing model for spot inks in images behaves, let's say, in an unexpected way, I had to write a custom script to create the PDF with CapyPDF. As far as I know no other open source tool can do this correctly, not even Scribus. The relevant bug can be found here. It was somewhat nerve wrecking to send this out to the print shop with zero practical experience and a theoretical basis of "according to my interpretation of the PDF spec, this should be correct". As this is the first ever commercial print job using CapyPDF, it's quite fortunate that it succeeded pretty much perfectly.

The inner pages were created with the same Chapterizer tool as the previous book. It uses Pango and Cairo to generate PDFs. Illustrations in the text were drawn with Krita. As Cairo only produces RGB PDFs, as the last step it had to be converted to grayscale using Ghostscript.

Monday, June 16, 2025

A custom C++ standard library part 4: using it for real

Writing your own standard library is all fun and games until someone (which is to say yourself) asks the important question: could this be actually used for real? Theories and opinions can be thrown about the issue pretty much forever, but the only way to actually know for sure is to do it.

Thus I converted CapyPDF, which is a fairly compact 15k LoC codebase from the C++ standard library to Pystd, which is about 4k lines. All functionality is still the same, which is to say that the test suite passes, there are most likely new bugs that the tests do not catch. For those wanting to replicate the results themselves, clone the CapyPDF repo, switch to the pystdport branch and start building. Meson will automatically download and set up Pystd as a subproject. The code is fairly bleeding edge and only works on Linux with GCC 15.1.

Build times

One of the original reasons for starting Pystd was being annoyed at STL compile times. Let's see if we succeeded in improving on them. Build times when using only one core in debug look like this.

When optimizations are enabled the results look like this:

In both cases the Pystd version compiles in about a quarter of the time.

Binary size

C++ gets a lot of valid criticism for creating bloated code. How much of that is due to the language as opposed to the library?

That's quite unexpected. The debug info for STL types seems to take an extra 20 megabytes. But how about the executable code itself?

STL is still 200 kB bigger. Based on observations most of this seems to come from stdlibc++'s implementation of variant. Note that if you try this yourself the Pystd version is probably 100 kB bigger, because by default the build setup links against libsubc++, which adds 100+ kB to binary sizes whereas linking against the main C++ runtime library does not.

Performance

Ok, fine, so we can implement basic code to build faster and take less space. Fine. But what about performance? That is the main thing that matters after all, right? CapyPDF ships with a simple benchmark program. Let's look at its memory usage first.

Apologies for the Y-axis does not starting at zero. I tried my best to make it happen, but LibreOffice Calc said no. In any case the outcome itself is expected. Pystd has not seen any performance optimization work so it requiring 10% more memory is tolerable. But what about the actual runtime itself?

This is unexpected to say the least. A reasonable result would have been to be only 2x slower than the standard library, but the code ended up being almost 25% faster. This is even stranger considering that Pystd's containers do bounds checks on all accesses, the UTF-8 parsing code sometimes validates its input twice, the hashing algorithm is a simple multiply-and-xor and so on. Pystd should be slower, and yet, in this case at least, it is not.

I have no explanation for this. It is expected that Pystd will start performing (much) worse as the data set size grows but that has not been tested.

Friday, June 6, 2025

Custom C++ stdlib part 3: The bleedingest edge variant

Implementing a variant type in C++ is challenging to say the least. I tried looking into the libstd++ implementation and could not even decipher where the actual data is stored. There is a lot of inheritance going on and helper classes that seem to be doing custom vtable construction and other metaprogramming stuff. The only thing I could truly grasp was a comment saying // "These go to eleven". Sadly there was not a comment // Smell my glove! which would seem more suitable for this occasion.

A modern stdlib does need a variant, though, so I had to implement one. To make it feasible I made the following simplifying assumptions.

All types handled must be noexcept default constructible, move constructible and movable. (i.e. the WellBehaved concept)
If you have a properly allocated and aligned piece of memory, placement new'ing into it works (there may be UB-shenanigans here due to the memory model)
The number of different types that a variant can hold has a predefined static maximum value.
You don't need to support any C++ version older than c++26.

The last one of these is the biggest hurdle, as C++26 will not be released for at least a year. GCC 15 does have support for it, though, so all code below only works with that.

The implementation

At its core, a Pystd variant is nothing more than a byte buffer and an index specifying which type it holds:

template<typename T...>

class Variant {

char buf[compute_size<0, T...>()] alignas(compute_alignment<0, T...>());

int8_t type_id;

};

The functions to compute max size and alignment requirements for types are simple to implement. The main problem lies elsewhere, specifically: going from a type to the corresponding index, going from a compile time index value to a type and going from a runtime index to the corresponding type.

The middle one is the simplest of these. As of C++26 you can directly index the argument pack like so:

using nth_type = T...[compile_time_constant];

Going from type to an index is only slightly more difficult:

Going from a runtime value to a type is the difficult one. I don't know how to do it "correctly", i.e. the way a proper stdlib implementation does it. However, since the number of possible types is limited at compile time, we can cheat (currently only 5 types are supported):

Unrolling (type) loops like its 1988! This means you can't have variants with hundreds of different types, but in that case you probably need an architectural redesign rather than a more capable variant.

With these primitives implementing public class methods is fairly simple.

The end result

The variant implementation in Pystd and its helper code took approximately 200 lines of code. It handles all the basic stuff in addition to being exception safe for copy operations (implemented as copy to a local variable + move). Compile times remain in fractions of a second per file even though Pystd only has a single public header.

It works in the sense that you can put different types in to it, switch between them and so on without any compiler warnings, sanitizer issues or Valgrind complaints. So be careful with the code, I have only tested it, not proven it correct.

No performance optimization or even measurements have been made.

Sunday, May 25, 2025

Iterators and adaptors

Previously it was mentioned that Python and C++ do iteration quite differently. Python has "statefull" objects that have a .next() method that returns a new object or throws a StopIteration exception. Incidentally Rust does exactly the same thing, except that it uses an optional type rather than an exception. C++ does not work like this, instead it has a concept of a "start" and "end" of a sequence and the machinery keeps incrementing the start until it reaches the end.

At the end of last post it was said that it is difficult to integrate an object of the former type with C++'s native language facilities, at least without macros.

Now we'll look how to integrate an object of the former type with C++'s native language facilities without any macros.

In fact, it only took fewer than 30 lines of code to do. The caveat being that it is probably fairly easy to break it. But it works for me.

Here it is being used.

There is also a second, similar helper class that takes ownership of the object being iterated over.

Monday, May 19, 2025

Optimizing page splits in books

Earlier in this blog we looked at how to generate a justified block of text, which is nowadays usually done with the Knuth-Plass algorithm. Sadly this is not, by itself, enough to create a finished product. Processing all input text with the algorithm produces one very long column of text. This works fine if you end result is a scroll (of the papyrus variety), but people tend to prefer their text it book format. Thus you need a way to split that text column into pages.

The simplest solution is to decide that a page has some N number of lines and page breaks happen at exact these places. This works (and is commonly used) but it has certain drawbacks. From a typographical point of view, there are at least three things that should be avoided:

Orphan lines, a paragraph at the bottom of a page that has only one line of text.
Widow lines, a paragraph that starts a new page and has only one line of text.
Spread imbalance, where the two pages on a spread have different number of lines on them (when both are "full pages")

Determining "globally optimal" page splits for a given text is not a simple problem, because every pagination choice affects every pagination choice that comes after it. If you stare at the problem long and hard enough, eventually you realize something.

The two problems are exactly the same and can be solved in the same way. I have implemented this in the Chapterizer book generator program. The dynamic programming algorithm for both is so similar that they could, in theory, use shared code. I chose not to do it because sometimes it is just so much simpler to not be DRY. They are so similar, in fact, that a search space optimization that I had to do in the paragraph justification case was also needed for pagination even though the data set size is typically much smaller in pagination. The optimization implementation turned out to be exactly the same for both cases.

The program now creates optimal page splits and in addition prints a text log of all pages with widows, orphans and imbalances it could not optimize away.

Has this been done before?

Probably. Everything has already been invented multiple times. I don't know for sure, so instead I'm going to post my potentially incorrect answer here on the Internet for other people to fact check.

I am not aware of any book generation program doing global page optimization. LibreOffice and Word definitely do not do it. As far as I know LaTeX processes pages one by one and once a page is generated it never returns to it, probably because computers in the early 80s did not have enough memory to hold the entire document in RAM at the same time. I have never used Indesign so I don't know what it does.

Books created earlier than about the mid eighties were typeset by hand and they seem to try to avoid orphans and widows at the cost of some page spreads having more lines than other spreads. Modern books seem to optimize for filling the page fully rather than avoiding widows and orphans. Since most books nowadays are created with Indesign (I think), it would seem to imply that Indesign does not do optimal page splitting or at least it is not enabled by default.

When implementing the page splitter the reasons for not doing global page splitting became clear. Computing both paragraph and page optimization is too slow for a "real time" GUI app. This sort of an optimization really only makes sense for a classical "LaTeX-style" book where there is one main text flow. Layout programs like Scribus and Indesign support richer magazine style layouts. This sort of a page splitter does not work in the general case, so in order to use it they would need to have a separate simpler document format.

But perhaps the biggest issue is that if different pages may have different number of lines on each page, it leads to problems in the page's visual appearance. Anything written at the bottom of the page (like page numbers) need to be re-placed based on how many lines of text actually ended up on the page. Doing this in code in a batch system is easy, doing the same with GUI widgets in "real time" is hard.

Saturday, May 3, 2025

Writing your own C++ standard library part 2

This blog post talked about the "self written C++ standard library" I wrote for the fun of it (code here).

The post got linked by Hackernews and Reddit. As is usual the majority of comments did not talk about the actual content but instead were focused on two tangential things. The first one being "this is not a full implementation of the C++ standard library as specified by the ISO standard, therefore the author is an idiot". I am, in actual fact, an idiot, but not due to project scope but because I assumed people on the Internet to have elementary reading comprehension skills. To make things clear: no, this is not an implementation of the ISO standard library. At no point was such a thing claimed. There is little point in writing one of those, there are several high quality implementations available. "Standard library" in this context means "a collection of low level functions and types that would be needed by most applications".

The second discussion was around the fact that calling the C++ standard library "STL" was both wrong and proof that the person saying that does not understand the C++ language. This was followed by several "I am a C++ standard library implementer and everyone I know calls it the STL". Things deteriorated from there.

The complexity question

Existing container implementations are complex by necessity. They need to handle things like types that can not be noexcept moved or copied. The amount of rollback code needed explodes very quickly and needs to be processed every time the header is included (because of templates). A reasonable point against writing your own containers is that eventually you need to support all the same complexity as existing ones because people will insert "bad" types into your container and complain when things fail. Thus you need to have all the same nasty, complex and brittle code as an STL implementation, only lacking decades of hardening to shake out all the bugs.

That is true, but the beauty of having zero existing users is that you can do something like this:

This is basically a concept that requires the type given to be noexcept-movable (and some other things that could arguably be removed). Now, instead of allowing any type under the sun, all containers require all types they get instantiated with to be WellBehaved. This means that the library does not have to care at all about types behaving badly because trying to use those results in a compiler error. A massive amount of complexity is thus eliminated in a single stroke.

There are of course cases when you need to deal with types that are not well behaved. If you can tolerate a memory allocation per object, the simple solution is to use unique_ptr. OTOH if you have types that can't be reliably moved and which are so performance critical that you can't afford memory allocations, then you are probably going to write your own container code anyway. If you are in the position that you have badly behaved types that you can't update, can't tolerate an allocation and can't write your own containers, then that is your problem.

Iterating things the native way

One of the most common things I do with strings in Python is to split them on whitespace. In Python this is simple as there is only one string type, but in native code things are more complicated. What should the return type of mystring.split() be?

vector<string>
vector<string_view>
A coroutine handle
The method should be a full template so you can customize it to output any custom string type (as every C++ code base has at least three of them)
Something ranges something something

There is probably not a single correct answer. All of the options have different tradeoffs. Thus I implemented this in two ways. The first returns a vector<string> as you would get in Python. The second one was a fully general, lazy and allocation-free method that uses the most unloved of language features: the void pointer.

So given that you have a callback function like this:

the split function becomes:

This is as fully general as you can get without needing to muck about with templates, coroutines or monads (I don't actually know what monads are, I'm just counting on the fact that mentioning them makes you look cool). The cost is that the end user needs to write a small adapter lambda to use it.

Iterating things the Python way

The Python iteration protocol is surprisingly simple. The object being iterated needs to provide a .next() method that either returns the next value or throws a StopIteration exception. Obviously exceptions can't be used in native code because they are too slow, but the same effect can be achieved by returning an optional<T> instead. Here's a unit test for an implementation of Python's range function.

Sadly there is not a simple way to integrate this to native loop constructs without macros and even then it is a bit ugly.

Current status

The repo has basic functionality for strings (regular and enforced UTF-8), regexes and basic containers.

Building the entire project has 8 individual compilation and linking steps and takes 0.8 seconds when using a single core on this laptop computer. Thus compiling a single source file with optimizations enabled takes ~ 0.1 seconds. Which is nice.

The only cheat is that pystd uses ctre for regexes and it is precompiled.

Monday, March 24, 2025

Writing your own C++ standard library from scratch

The C++ standard library (also know as the STL) is, without a doubt, an astounding piece of work. Its scope, performance and incredible backwards compatibility have taken decades of work by many of the world's best programmers. My hat's off to all those people who have contributed to it.

All of that is not to say that it is not without its problems. The biggest one being the absolutely abysmal compile times but unreadability, and certain unoptimalities caused by strict backwards compatibility are also at the top of the list. In fact, it could be argued that most of the things people really dislike about C++ are features of the STL rather than the language itself. Fortunately, using the STL is not mandatory. If you are crazy enough, you can disable it completely and build your own standard library in the best Bender style.

One of the main advantages of being an unemployed-by-choice open source developer is that you can do all of that if you wish. There are no incompetent middle damagers hovering over your shoulder to ensure you are "producing immediate customer value" rather than "wasting time on useless polishing that does not produce immediate customer value".

It's my time, and I'll waste it if I want to!

What's in it?

The biggest design questions of a standard library are scope and the "feel" of the API. Rather than spending time on design, we steal it. Thus, when in doubt, read the Python stdlib documentation and replicate it. Thus the name of the library is pystd.

The test app

To keep the scope meaningful, we start by writing only enough of stdlib to build an app that reads a text file, validates it as UTF-8, splits the contents into words, counts how many time each word appears in the file and prints all words and how many times it appears sorted by decreasing count.

This requires, at least:

File handling
Strings
UTF8 validation
A hash map
A vector
Sorting

The training wheels come off

The code is available in this Github repo for those who want to follow along at home.

Disabling the STL is fairly easy (with Linux+GCC at least) and requires only these two Meson statements:

add_global_arguments('-nostdinc++', language: 'cpp')
add_global_link_arguments('-nostdlib++', '-lsupc++', language: 'cpp')

The supc++ library is (according to stackoverflow) a support library GCC needs to implement core language features. Now the stdlib is off and it is time to implement everything with sticks, stones and duct tape.

The outcome

Once you have implemented everything discussed above and auxiliary stuff like a hashing framework the main application looks like this.

The end result is both Valgrind and Asan clean. There is one chunk of unreleased memory, but that comes from supc++. There is probably UB in the implementation. But it should be the good kind of UB that, if it would actually not work, would break the entire Linux userspace because everything depends on it working "as expected".

All of this took fewer than 1000 lines of code in the library itself (including a regex implementation that is not actually used). For comparison merely including vector from the STL brings in 27 thousand lines of code.

Comparison to an STL version

Converting this code to use the STL is fairly simple and only requires changing some types and fine tuning the API. The main difference is that the STL version does not validate that the input is UTF-8 as there is no builtin function for that. Now we can compare the two.

Runtime for both is 0.001 to 0.002 seconds on the small test file I used. Pystd is not noticeably slower than the STL version, which is enough for our purposes. It almost certainly scales worse because there has been zero performance work on it.

Compiling the pystd version with -O2 takes 0.3 seconds whereas the STL version takes 1.2 seconds. The measurements were done on a Ryzen 7 3700X processor.

The executable's unstripped size is 349k for STL and 309k for pystd. The stripped sizes are 23k and 135k. Approximately 100 k of the pystd executable comes from supc++. In the STL version that probably comes dynamically from libstdc++ (which, on this machine, takes 2.5 MB).

Perfect ABI stability

Designing a standard library is exceedingly difficult because you can't ever really change it. Someone, somewhere, is depending on every misfeature in it so they can never be changed.

Pystd has been designed to both support perfect ABI stability and make it possible to change it in arbitrary ways in the future. If you start from scratch this turned out to be fairly simple.

The sample code above used the pystd namespace. It does not actually exist. Instead it is defined like this in the cpp file:

#include <pystd2025.hpp>

namespace pystd = pystd2025;

In pystd all code is in a namespace with a year and is stored in a header file with the same year. The idea is, then, that every year you create a new release. This involves copying all stdlib header files to a file with the new year and regexping the namespace declarations to match. The old code is now frozen forever (except for bug fixes) whereas the new code can be changed at will because there are zero existing lines of code that depend on it.

End users now have the choice of when to update their code to use newer pystd versions. Even better, if there is an old library that can not be updated, any of the old versions can be used in parallel. For example:

pystd2030::SomeType foo;
pystd2025::SomeType bar(foo.something(), foo.something_else());

Thus if no code is ever updated, everything keeps working. If all code is updated at once, everything works. If only parts of the code are updated, things can still be made to work with some glue code. This puts the maintenance burden on the people whose projects can not be updated as opposed to every other developer in the world. This is as it should be, and also would motivate people with broken deps to spend some more effort to get them fixed.

Thursday, February 27, 2025

The price of statelessness is eternal waiting

Most CI systems I have seen have been stateless. That is, they start by getting a fresh Docker container (or building one from scratch), doing a Git checkout, building the thing and then throwing everything away. This is simple and matematically pure, but really slow. This approach is further driven by the way how in cloud computing CPU time and network transfers are cheap but storage is expensive (or at least it is possible to get almost infinite CI build time for open source projects but not persistent storage). Probably because the cloud vendor needs to take care of things like backups, they can't dispatch the task on any machine on the planet but instead on the one that already has the required state and so on.

How much could you reduce resource usage (or, if you prefer, improve CI build speed) by giving up on statelessness? Let's find out by running some tests. To get a reasonably large code base I used LLVM. I did not actually use any cloud or Docker in the tests, but I simulated them on a local media PC. I used 16 cores to compile and 4 to link (any more would saturate the disk). Tests were not run.

Baseline

Creating a Docker container with all the build deps takes a few minutes. Alternatively you can prebuild it, but then you need to download a 1 GB image.

Doing a full Git checkout would be wasteful. There are basically three different ways of doing a partial checkout: shallow clone, blobless and treeless. They take the following amount of time and space:

shallow: 1m, 259 MB
blobless: 2m 20s, 961 MB
treeless: 1m 46s, 473 MB

Doing a full build from scratch takes 42 minutes.

With CCache

Using CCache in Docker is mostly a question of bind mounting a persistent directory in the container's cache directory. A from-scratch build with an up to date CCache takes 9m 30s.

With stashed Git repo

Just like the CCache dir, the Git checkout can also be persisted outside the container. Doing a git pull on an existing full checkout takes only a few seconds. You can even mount the repo dir read only to ensure that no state leaks from one build invocation to another.

With Danger Zone

One main thing a CI build ensures is that the code keeps on building when compiled from scratch. It is quite possible to have a bug in your build setup that manifests itself so that the build succeeds if a build directory has already been set up, but fails if you try to set it up from scratch. This was especially common back in ye olden times when people used to both write Makefiles by hand and to think that doing so was a good idea.

Nowadays build systems are much more reliable and this is not such a common issue (though it can definitely still occur). So what if you would be willing to give up full from-scratch checks on merge requests? You could, for example, still have a daily build that validates that use case. For some organizations this would not be acceptable, but for others it might be reasonable tradeoff. After all, why should a CI build take noticeably longer than an incremental build on the developer's own machine. If anything it should be faster, since servers are a lot beefier than developer laptops. So let's try it.

The implementation for this is the same as for CCache, you just persist the build directory as well. To run the build you do a Git update, mount the repo, build dir and optionally CCache dirs to the container and go.

I tested this by doing a git pull on the repo and then doing a rebuild. There were a couple of new commits, so this should be representative of the real world workloads. An incremental build took 8m 30s whereas a from scratch rebuild using a fully up to date cache took 10m 30s.

Conclusions

The amount of wall clock time used for the three main approaches were:

Fully stateless

Image building: 2m
Git checkout: 1m
Build: 42m
Total: 45m

Cached from-scratch

Image building: 0m (assuming it is not "apt-get update"d for every build)
Git checkout: 0m
Build: 10m 30s
Total: 10m 30s

Fully cached

Image building: 0m
Git checkout: 0m
Build: 8m 30s
Total: 8m 30s

Similarly the amount of data transferred was:

Fully stateless

Image: 1G
Checkout: 300 MB

Cached from-scratch:

Image: 0
Checkout: O(changes since last pull), typically a few kB

Fully cached

Image: 0
Checkout: O(changes since last pull)

The differences are quite clear. Just by using CCache the build time drops by almost 80%. Persisting the build dir reduces the time by a further 15%. It turns out that having machines dedicated to a specific task can be a lot more efficient than rebuilding the universe from atoms every time. Fancy that.

The final 2 minute improvement might not seem like that much, but on the other hand do you really want your developers to spend 2 minutes twiddling their thumbs for every merge request they create or update? I sure don't. Waiting for CI to finish is one of the most annoying things in software development.

Monday, February 10, 2025

C++ compiler daemon testing tool

In an earlier blog post I wrote about a potential way of speeding up C++ compilations (or any language that has a big up-front cost). The basic idea is to have a process that reads in all stdlib header code that is suspended. Compilations are done by sending the actual source file + flags to this process, which then forks and resumes compilation. Basically this is a way to persist the state of the compiler without writing (or executing) a single line of serialization code.

The obvious follow up question is what is the speedup of this scheme. That is difficult to say without actually implementing the system. There are way too many variables and uncertainties to make any sort of reasonable estimate.

So I implemented it.

Not in an actual compiler, heavens no, I don't have the chops for that. Instead I implemented a completely fake compiler that does the same steps a real compiler would need to take. It spawns the daemon process. It creates a Unix domain socket. It communicates with the running daemon. It produces output files. The main difference is that it does not actually do any compilation, instead it just sleeps to simulate work. Since all sleep durations are parameters, it is easy to test the "real" effect of various schemes.

The code is in this GH repo.

The default durations were handwavy estimates based on past experience. In past measurements, stdlib includes take by far the largest chunk of the total compilation time. Thus I estimated that compilation without this scheme would take 5 seconds per file whereas compilations with it would take 1 second. If you disagree with these assumptions, feel free to run the test yourself with your own time estimates.

The end result was that on this laptop that has 22 cores a project with 100 source files took 26 seconds to compile without the daemon and 7 seconds with it. This means the daemon version finished in just a hair over 25% of a "regular" build.

Wouldn't you want your compilations to finish in a quarter of the time with zero code changes? I sure would.

(In reality the speedup is probably less than that. How much? No idea. Someone's got to implement that to find out.)

Tuesday, February 4, 2025

The trials and tribulations of supporting CJK text in PDF

In the past I may have spoken critically on Truetype fonts and their usage in PDF files. Recently I have come to the conclusion that it may have been too harsh and that Truetype fonts are actually somewhat nice. Why? Because I have had to add support for CFF fonts to CapyPDF. This is a font format that comes from Adobe. It encodes textual PostScript drawing operations into binary bytecode. Wikipedia does not give dates, but it seems to have been developed in the late 80s - early 90s. The name CFF is an abbeviation for "complicated font format".

Double-checks notes.

Compact font format. Yes, that is what I meant to write. Most people reading this have probably not ever even seen a CFF file so you might be asking why is supporting CFF fonts even a thing nowadays? It's all quite simple. Many of the Truetype (and especially OpenType) fonts you see are not actually Truetype fonts. Instead they are Transfontners, glyphs in disguise. It is entirely valid to have a Truetype font that is merely an envelope holding a CFF font. As an example the Noto CJK fonts are like this. Aggregation of different formats is common in font files, and the main reason OpenType fonts have like four different and mutually incompatible ways of specifying color emoji. None of the participating entities were willing to accept anyone else's format so the end result was to add all of them. If you want Asian language support, you have to dive into the bowels of the CFF rabid hole.

As most people probably do not have sufficient historical perspective, let's start by listing out some major computer science achievements that definitely existed when CFF was being designed.

File format magic numbers
Archive formats that specify both the offset and size of the elements within
Archive formats that afford access to their data in O(number of items in the archive) rather than O(number of bytes in the file)
Data compression

CFF chooses to not do any of this nonsense. It also does not believe in consistent offset types. Sometimes the offsets within data objects refer to other objects by their order in the index they are in. Sometimes they refer to number of bytes from the beginning of the file. Sometimes they refer to number of bytes from the beginning of the object the offset data is written in. Sometimes it refers to something else. One of the downsides of this is that while some of the data is neatly organized into index structures with specified offsets, a lot of it is just free floating in the file and needs the equivalent of three pointer dereferences to access.

Said offsets are stored with a variable width encoding like so:

This makes writing subset CFF font files a pain. In order to write an offset value at some location X, you first must serialize everything up to that point to know where the value would be written. To know the value to write you have to serialize the the entire font up to the point where that data is stored. Typically the data comes later in the file than its offset location. You know what that means? Yes, storing all these index locations and hotpatching them afterwards once you find out where the actual data pointed to ended up in. Be sure to compute your patching locations correctly lest you end up in lengthy debugging sessions where your subset font files do not render correctly. In fairness all of the incorrect writes were within the data array and thus 100% memory safe, and, really, isn't that the only thing that actually matters?

One of the main data structures in a CFF file is a font dictionary stored in, as the docs say, "key-value pairs". This is not true. The "key-value dictionary" is neither key-value nor is it a dictionary. The entries must come in a specific order (sometimes) so it is not a dictionary. The entries are not stored as key-value pairs but as value-key pairs. The more accurate description of "value-key somewhat ordered array" does lack some punch so it is understandable that they went with common terminology. The backwards ordering of elements to some people confusion bring might, but it perfect sense makes, as the designers of the format a long history with PostScript had. Unknown is whether some of them German were.

Anyhow, after staring directly into the morass of madness for a sufficient amount of time the following picture emerges.

Final words

The CFF specification document contains data needed to decipher CFF data streams in nice tabular format, which is easy to convert to an enum. Trying it fails with an error message saying that the file has prohibited copypasting. This is a bit rich coming from Adobe, whose current stance seems to be that they can take any document opened with their apps and use it for AI training. I'd like to conclude this blog post by sending the following message to the (assumed) middle manager who made the decision that publicly available specification documents should prohibit copypasting:

YOU GO IN THE CORNER AND THINK ABOUT WHAT YOU HAVE DONE! AND DON'T EVEN THINK ABOUT COMING BACK UNTIL YOU ARE READY TO APOLOGIZE TO EVERYONE FOR YOU ACTIONS!

Tuesday, January 14, 2025

Measuring code size and performance

Are exceptions faster and/or bloatier than using error codes? Well...

The traditional wisdom is that exceptions are faster when not taken, slower when taken and lead to more bloated code. On the other hand there are cases where using exceptions makes code a lot smaller. In embedded development, even, where code size is often the limiting factor.

Artificial benchmarks aside, measuring the effect on real world code is fairly difficult. Basically you'd need to implement the exact same, nontrivial piece of code twice. One implementation would use exceptions, the other would use error codes but they should be otherwise identical. No-one is going to do that for fun or even idle curiosity.

CapyPDF has been written exclusively using C++ 23's new expected object for error handling. As every Go programmer knows, typing error checks over and over again is super annoying. Very early on I wrote macros to autopropagate errors. That props up an interesting question, namely could you commit horrible macro crimes to make the error handling either use error objects or exceptions?

It tuns out that yes you can. After a thorough scrubbing of the ensuring shame from your body and soul you can start doing measurements. To get started I built and ran CapyPDF's benchmark application with the following option combinations:

Optimization -O1, -O2, -O3, -Os
LTO enabled and disabled
Exceptions enabled and disabled
RTTI enabled and disabled
NDEBUG enabled and disabled

The measurements are the stripped size of the resulting shared library and runtime of the test executable. The code and full measurement data can be found in this repo. The code size breakdown looks like this:

Performance goes like this:

Some interesting things to note:

The fastest runtime of 0.92 seconds with O3-lto-rtti-noexc-ndbg
The slowest is 1.2s with Os-noltto-rtti-noexc-ndbg
If we ignore Os the slowes is 1.07s O1-noltto-rtti-noexc-ndbg
The largest code is 724 kb with O3-nolto-nortt-exc-nondbg
The smallest is 335 kB with Os-lto-nortti-noexc-ndbg
Ignoring Os the smallest is 470 kB with O1-lto-nortti-noexc-ndbg

Things noticed via eyeballing

Os leads to noticeably smaller binaries at the cost of performance
O3 makes binaries a lot bigger in exchange for a fairly modest performance gain
NDEBUG makes programs both smaller and faster, as one would expect
LTO typically improves both speed and code size
The fastest times for O1, O2 and O3 are within a few percent points of each other with 0.95, 094 and 0.92 seconds, respectively

Caveats

At the time of writing the upstream code uses error objects even when exceptions are enabled. To replicate these results you need to edit the source code.

The benchmark does not actually raise any errors. This test only measures the golden path.

The tests were run on GCC 14.2 on x86_64 Ubuntu 10/24.