Thursday, February 27, 2025

The price of statelessness is eternal waiting

Most CI systems I have seen have been stateless. That is, they start by getting a fresh Docker container (or building one from scratch), doing a Git checkout, building the thing and then throwing everything away. This is simple and matematically pure, but really slow. This approach is further driven by the way how in cloud computing CPU time and network transfers are cheap but storage is expensive (or at least it is possible to get almost infinite CI build time for open source projects but not persistent storage). Probably because the cloud vendor needs to take care of things like backups, they can't dispatch the task on any machine on the planet but instead on the one that already has the required state and so on.

How much could you reduce resource usage (or, if you prefer, improve CI build speed) by giving up on statelessness? Let's find out by running some tests. To get a reasonably large code base I used LLVM. I did not actually use any cloud or Docker in the tests, but I simulated them on a local media PC. I used 16 cores to compile and 4 to link (any more would saturate the disk). Tests were not run.

Baseline

Creating a Docker container with all the build deps takes a few minutes. Alternatively you can prebuild it, but then you need to download a 1 GB image.

Doing a full Git checkout would be wasteful. There are basically three different ways of doing a partial checkout: shallow clone, blobless and treeless. They take the following amount of time and space:

  • shallow: 1m, 259 MB
  • blobless: 2m 20s, 961 MB
  • treeless: 1m 46s, 473 MB
Doing a full build from scratch takes 42 minutes.

With CCache

Using CCache in Docker is mostly a question of bind mounting a persistent directory in the container's cache directory. A from-scratch build with an up to date CCache takes 9m 30s.

With stashed Git repo

Just like the CCache dir, the Git checkout can also be persisted outside the container. Doing a git pull on an existing full checkout takes only a few seconds. You can even mount the repo dir read only to ensure that no state leaks from one build invocation to another.

With Danger Zone

One main thing a CI build ensures is that the code keeps on building when compiled from scratch. It is quite possible to have a bug in your build setup that manifests itself so that the build succeeds if a build directory has already been set up, but fails if you try to set it up from scratch. This was especially common back in ye olden times when people used to both write Makefiles by hand and to think that doing so was a good idea.

Nowadays build systems are much more reliable and this is not such a common issue (though it can definitely still occur). So what if you would be willing to give up full from-scratch checks on merge requests? You could, for example, still have a daily build that validates that use case. For some organizations this would not be acceptable, but for others it might be reasonable tradeoff. After all, why should a CI build take noticeably longer than an incremental build on the developer's own machine. If anything it should be faster, since servers are a lot beefier than developer laptops. So let's try it.

The implementation for this is the same as for CCache, you just persist the build directory as well. To run the build you do a Git update, mount the repo, build dir and optionally CCache dirs to the container and go.

I tested this by doing a git pull on the repo and then doing a rebuild. There were a couple of new commits, so this should be representative of the real world workloads. An incremental build took 8m 30s whereas a from scratch rebuild using a fully up to date cache took 10m 30s.

Conclusions

The amount of wall clock time used for the three main approaches were:

  • Fully stateless
    • Image building: 2m
    • Git checkout: 1m
    • Build: 42m
    • Total: 45m
  • Cached from-scratch
    • Image building: 0m (assuming it is not "apt-get update"d for every build)
    • Git checkout: 0m
    • Build: 10m 30s
    • Total: 10m 30s
  • Fully cached
    • Image building: 0m
    • Git checkout: 0m
    • Build: 8m 30s
    • Total: 8m 30s
Similarly the amount of data transferred was:

  • Fully stateless
    • Image: 1G
    • Checkout: 300 MB
  • Cached from-scratch:
    • Image: 0
    • Checkout: O(changes since last pull), typically a few kB
  • Fully cached
    • Image: 0
    • Checkout: O(changes since last pull)
The differences are quite clear. Just by using CCache the build time drops by almost 80%. Persisting the build dir reduces the time by a further 15%. It turns out that having machines dedicated to a specific task can be a lot more efficient than rebuilding the universe from atoms every time. Fancy that.

The final 2 minute improvement might not seem like that much, but on the other hand do you really want your developers to spend 2 minutes twiddling their thumbs for every merge request they create or update? I sure don't. Waiting for CI to finish is one of the most annoying things in software development.

Monday, February 10, 2025

C++ compiler daemon testing tool

In an earlier blog post I wrote about a potential way of speeding up C++ compilations (or any language that has a big up-front cost). The basic idea is to have a process that reads in all stdlib header code that is suspended. Compilations are done by sending the actual source file + flags to this process, which then forks and resumes compilation. Basically this is a way to persist the state of the compiler without writing (or executing) a single line of serialization code.

The obvious follow up question is what is the speedup of this scheme. That is difficult to say without actually implementing the system. There are way too many variables and uncertainties to make any sort of reasonable estimate.

So I implemented it. 

Not in an actual compiler, heavens no, I don't have the chops for that. Instead I implemented a completely fake compiler that does the same steps a real compiler would need to take. It spawns the daemon process. It creates a Unix domain socket. It communicates with the running daemon. It produces output files. The main difference is that it does not actually do any compilation, instead it just sleeps to simulate work. Since all sleep durations are parameters, it is easy to test the "real" effect of various schemes.

The code is in this GH repo.

The default durations were handwavy estimates based on past experience. In past measurements, stdlib includes take by far the largest chunk of the total compilation time. Thus I estimated that compilation without this scheme would take 5 seconds per file whereas compilations with it would take 1 second. If you disagree with these assumptions, feel free to run the test yourself with your own time estimates.

The end result was that on this laptop that has 22 cores a project with 100 source files took 26 seconds to compile without the daemon and 7 seconds with it. This means the daemon version finished in just a hair over 25% of a "regular" build.

Wouldn't you want your compilations to finish in a quarter of the time with zero code changes? I sure would.

(In reality the speedup is probably less than that. How much? No idea. Someone's got to implement that to find out.)

Tuesday, February 4, 2025

The trials and tribulations of supporting CJK text in PDF

In the past I may have spoken critically on Truetype fonts and their usage in PDF files. Recently I have come to the conclusion that it may have been too harsh and that Truetype fonts are actually somewhat nice. Why? Because I have had to add support for CFF fonts to CapyPDF. This is a font format that comes from Adobe. It encodes textual PostScript drawing operations into binary bytecode. Wikipedia does not give dates, but it seems to have been developed in the late 80s - early 90s. The name CFF is an abbeviation for "complicated font format".

Double-checks notes.

Compact font format. Yes, that is what I meant to write. Most people reading this have probably not ever even seen a CFF file so you might be asking why is supporting CFF fonts even a thing nowadays? It's all quite simple. Many of the Truetype (and especially OpenType) fonts you see are not actually Truetype fonts. Instead they are Transfontners, glyphs in disguise. It is entirely valid to have a Truetype font that is merely an envelope holding a CFF font. As an example the Noto CJK fonts are like this. Aggregation of different formats is common in font files, and the main reason OpenType fonts have like four different and mutually incompatible ways of specifying color emoji. None of the participating entities were willing to accept anyone else's format so the end result was to add all of them. If you want Asian language support, you have to dive into the bowels of the CFF rabid hole.

As most people probably do not have sufficient historical perspective, let's start by listing out some major computer science achievements that definitely existed when CFF was being designed.

  • File format magic numbers
  • Archive formats that specify both the offset and size of the elements within
  • Archive formats that afford access to their data in O(number of items in the archive) rather than O(number of bytes in the file)
  • Data compression
CFF chooses to not do any of this nonsense. It also does not believe in consistent offset types. Sometimes the offsets within data objects refer to other objects by their order in the index they are in. Sometimes they refer to number of bytes from the beginning of the file. Sometimes they refer to number of bytes from the beginning of the object the offset data is written in. Sometimes it refers to something else. One of the downsides of this is that while some of the data is neatly organized into index structures with specified offsets, a lot of it is just free floating in the file and needs the equivalent of three pointer dereferences to access.

Said offsets are stored with a variable width encoding like so:

This makes writing subset CFF font files a pain. In order to write an offset value at some location X, you first must serialize everything up to that point to know where the value would be written. To know the value to write you have to serialize the the entire font up to the point where that data is stored. Typically the data comes later in the file than its offset location. You know what that means? Yes, storing all these index locations and hotpatching them afterwards once you find out where the actual data pointed to ended up in. Be sure to compute your patching locations correctly lest you end up in lengthy debugging sessions where your subset font files do not render correctly. In fairness all of the incorrect writes were within the data array and thus 100% memory safe, and, really, isn't that the only thing that actually matters?

One of the main data structures in a CFF file is a font dictionary stored in, as the docs say, "key-value pairs". This is not true. The "key-value dictionary" is neither key-value nor is it a dictionary. The entries must come in a specific order (sometimes) so it is not a dictionary. The entries are not stored as key-value pairs but as value-key pairs. The more accurate description of "value-key somewhat ordered array" does lack some punch so it is understandable that they went with common terminology. The backwards ordering of elements to some people confusion bring might, but it perfect sense makes, as the designers of the format a long history with PostScript had. Unknown is whether some of them German were.

Anyhow, after staring directly into the morass of madness for a sufficient amount of time the following picture emerges.

Final words

The CFF specification document contains data needed to decipher CFF data streams in nice tabular format, which is easy to convert to an enum. Trying it fails with an error message saying that the file has prohibited copypasting. This is a bit rich coming from Adobe, whose current stance seems to be that they can take any document opened with their apps and use it for AI training. I'd like to conclude this blog post by sending the following message to the (assumed) middle manager who made the decision that publicly available specification documents should prohibit copypasting:

YOU GO IN THE CORNER AND THINK ABOUT WHAT YOU HAVE DONE! AND DON'T EVEN THINK ABOUT COMING BACK UNTIL YOU ARE READY TO APOLOGIZE TO EVERYONE FOR YOU ACTIONS!

Tuesday, January 14, 2025

Measuring code size and performance

Are exceptions faster and/or bloatier than using error codes? Well...

The traditional wisdom is that exceptions are faster when not taken, slower when taken and lead to more bloated code. On the other hand there are cases where using exceptions makes code a lot smaller. In embedded development, even, where code size is often the limiting factor.

Artificial benchmarks aside, measuring the effect on real world code is fairly difficult. Basically you'd need to implement the exact same, nontrivial piece of code twice. One implementation would use exceptions, the other would use error codes but they should be otherwise identical. No-one is going to do that for fun or even idle curiosity.

CapyPDF has been written exclusively using C++ 23's new expected object for error handling. As every Go programmer knows, typing error checks over and over again is super annoying. Very early on I wrote macros to autopropagate errors. That props up an interesting question, namely could you commit horrible macro crimes to make the error handling either use error objects or exceptions?

It tuns out that yes you can. After a thorough scrubbing of the ensuring shame from your body and soul you can start doing measurements. To get started I built and ran CapyPDF's benchmark application with the following option combinations:

  • Optimization -O1, -O2, -O3, -Os
  • LTO enabled and disabled
  • Exceptions enabled and disabled
  • RTTI enabled and disabled
  • NDEBUG enabled and disabled
The measurements are the stripped size of the resulting shared library and runtime of the test executable. The code and full measurement data can be found in this repo. The code size breakdown looks like this:

Performance goes like this:

Some interesting things to note:

  • The fastest runtime of 0.92 seconds with O3-lto-rtti-noexc-ndbg
  • The slowest is 1.2s with Os-noltto-rtti-noexc-ndbg
  • If we ignore Os the slowes is 1.07s O1-noltto-rtti-noexc-ndbg
  • The largest code is 724 kb with O3-nolto-nortt-exc-nondbg
  • The smallest is 335 kB with Os-lto-nortti-noexc-ndbg
  • Ignoring Os the smallest is 470 kB with O1-lto-nortti-noexc-ndbg
Things noticed via eyeballing

  • Os leads to noticeably smaller binaries at the cost of performance
  • O3 makes binaries a lot bigger in exchange for a fairly modest performance gain
  • NDEBUG makes programs both smaller and faster, as one would expect
  • LTO typically improves both speed and code size
  • The fastest times for O1, O2 and O3 are within a few percent points of each other with 0.95, 094 and 0.92 seconds, respectively

Caveats

At the time of writing the upstream code uses error objects even when exceptions are enabled. To replicate these results you need to edit the source code.

The benchmark does not actually raise any errors. This test only measures the golden path.

The tests were run on GCC 14.2 on x86_64 Ubuntu 10/24.

Monday, December 23, 2024

CapyPDF 0.14 is out

I have just released version 0.14 of CapyPDF. This release has a ton of new functionality. So much, in fact, that I don't even remember them all. The reason for this is that it is actually starting to see real world usage, specifically as the new color managed PDF exporter for Inkscape. It has required a lot of refactoring work in the color code of Inkscape proper. This work has been done mostly by Doctormo, who has several videos on the issue.

The development cycle has consisted mostly of him reporting missing features like "specifying page labels is not supported", "patterns can be used for fill, but not for stroke" and "loading CMYK TIFF images with embedded color profiles does not work" and me then implementing said features or finding out how how setjmp/longjmp actually works and debugging corrupted stack traces when it doesn't.

Major change coming in the next version

The API for CapyPDF is not stable, but in the next release it will be extra unstable. The reason is C strings. Null terminated UTF-8 strings are a natural text format for PDF, as strings in PDF must not contain the zero glyph. Thus there are many functions like this in the public C API:

void do_something(const char *text);

This works and is simple, but there is a common use case it can't handle. All strings must be zero terminated so you can't point to a middle of an existing buffer, because it is not guaranteed to be zero terminated. Thus you always have to make a copy of the text you want to pass. In other words this means that you can't use C++'s string_view (or any equivalent string) as a source of text data. The public API should support this use case.

Is this premature optimization? Maybe. But is is also a usability issue as string views seem to be fairly common nowadays. There does not seem to be a perfect solution, but the best one I managed to crib seems to be to do this:

void do_something(const char *text, int32_t len_or_negative);

If the last argument is positive, use it as the length of the buffer. If i is negative then treat the char data as a zero terminated plain string. This requires changing all functions that take strings and makes the API more unpleasant to use.

If someone has an idea for a better API, do post a comment here.

Tuesday, December 17, 2024

Meson build definitions merged into Git's git repo

The developers of Git have been considering switchibg build systems for a while. No definitive decision have been made as of yet, but they gave merged Meson build definitions in the main branch. Thus it now possible, and even semi-supported, to develop and build Git with Meson instead of the vintage Makefile setup (which, AFAICT, remains as the default build system for now).

The most interesting thing about this conversion is that the devs were very thorough in their evaluation of all the different possibilities. Those who are interested in the details or are possibly contemplating a build system switch on their own are recommended to read the merge's commit message.

Huge congratulations for everyone involved and thank you for putting in the work (FTR i did not work on this myself). 

Friday, December 13, 2024

CMYK me baby one more time!

Did you know that Jpeg supports images in the CMYK colorspace? And that people are actually using them in the wild? This being the case I needed to add support to them into CapyPDF. The development steps are quite simple, first you create a CMYK Jpeg file, then you create a test document that embeds it and finally look at the result in a PDF renderer.

Off to a painter application then. This is what the test image looks like.

Then we update the Jpeg parsing code to detect cmyk images and write the corresponding metadata to the output PDF. What does then end result look like then?

Aaaaand now we have a problem. Specifically one of an arbitrary color remapping. It might seem this is just a case of inverted colors. It's not (I checked), something weirder is going on. For reference Acrobat Reader's output looks identical.

At this point rather than poke things at random and hoping for the best, a good strategy is to get more test data. Since Scribus is pretty much a gold standard on print quality PDF production I went about recreating the test document in it.

Which failed immediately on loading the image.

Here we have Gwenview and Scribus presenting their interpretations of the exact same image. If you use Scribus to generate a PDF, it will convert the Jpeg into some three channel (i.e. RGB) ICC profile.

Take-home exercise

Where is the bug (or a hole in the spec) in this case:

  • The original CMYK jpeg is correct, but Scribus and PDF renderers read it in incorrectly?
  • The original image is incorrect and Gwenview has a separate inverse bug that cancel each other out?
  • The image is correct but the metadata written in the file by CapyPDF is incorrect?
  • The PDF spec has a big chunk of UB here and the final result can be anything?
  • Aliens?
I don't know the correct answer. If someone out there does, do let me know.