Monday, February 10, 2025

C++ compiler daemon testing tool

In an earlier blog post I wrote about a potential way of speeding up C++ compilations (or any language that has a big up-front cost). The basic idea is to have a process that reads in all stdlib header code that is suspended. Compilations are done by sending the actual source file + flags to this process, which then forks and resumes compilation. Basically this is a way to persist the state of the compiler without writing (or executing) a single line of serialization code.

The obvious follow up question is what is the speedup of this scheme. That is difficult to say without actually implementing the system. There are way too many variables and uncertainties to make any sort of reasonable estimate.

So I implemented it.

Not in an actual compiler, heavens no, I don't have the chops for that. Instead I implemented a completely fake compiler that does the same steps a real compiler would need to take. It spawns the daemon process. It creates a Unix domain socket. It communicates with the running daemon. It produces output files. The main difference is that it does not actually do any compilation, instead it just sleeps to simulate work. Since all sleep durations are parameters, it is easy to test the "real" effect of various schemes.

The code is in this GH repo.

The default durations were handwavy estimates based on past experience. In past measurements, stdlib includes take by far the largest chunk of the total compilation time. Thus I estimated that compilation without this scheme would take 5 seconds per file whereas compilations with it would take 1 second. If you disagree with these assumptions, feel free to run the test yourself with your own time estimates.

The end result was that on this laptop that has 22 cores a project with 100 source files took 26 seconds to compile without the daemon and 7 seconds with it. This means the daemon version finished in just a hair over 25% of a "regular" build.

Wouldn't you want your compilations to finish in a quarter of the time with zero code changes? I sure would.

(In reality the speedup is probably less than that. How much? No idea. Someone's got to implement that to find out.)

Tuesday, February 4, 2025

The trials and tribulations of supporting CJK text in PDF

In the past I may have spoken critically on Truetype fonts and their usage in PDF files. Recently I have come to the conclusion that it may have been too harsh and that Truetype fonts are actually somewhat nice. Why? Because I have had to add support for CFF fonts to CapyPDF. This is a font format that comes from Adobe. It encodes textual PostScript drawing operations into binary bytecode. Wikipedia does not give dates, but it seems to have been developed in the late 80s - early 90s. The name CFF is an abbeviation for "complicated font format".

Double-checks notes.

Compact font format. Yes, that is what I meant to write. Most people reading this have probably not ever even seen a CFF file so you might be asking why is supporting CFF fonts even a thing nowadays? It's all quite simple. Many of the Truetype (and especially OpenType) fonts you see are not actually Truetype fonts. Instead they are Transfontners, glyphs in disguise. It is entirely valid to have a Truetype font that is merely an envelope holding a CFF font. As an example the Noto CJK fonts are like this. Aggregation of different formats is common in font files, and the main reason OpenType fonts have like four different and mutually incompatible ways of specifying color emoji. None of the participating entities were willing to accept anyone else's format so the end result was to add all of them. If you want Asian language support, you have to dive into the bowels of the CFF rabid hole.

As most people probably do not have sufficient historical perspective, let's start by listing out some major computer science achievements that definitely existed when CFF was being designed.

File format magic numbers
Archive formats that specify both the offset and size of the elements within
Archive formats that afford access to their data in O(number of items in the archive) rather than O(number of bytes in the file)
Data compression

CFF chooses to not do any of this nonsense. It also does not believe in consistent offset types. Sometimes the offsets within data objects refer to other objects by their order in the index they are in. Sometimes they refer to number of bytes from the beginning of the file. Sometimes they refer to number of bytes from the beginning of the object the offset data is written in. Sometimes it refers to something else. One of the downsides of this is that while some of the data is neatly organized into index structures with specified offsets, a lot of it is just free floating in the file and needs the equivalent of three pointer dereferences to access.

Said offsets are stored with a variable width encoding like so:

This makes writing subset CFF font files a pain. In order to write an offset value at some location X, you first must serialize everything up to that point to know where the value would be written. To know the value to write you have to serialize the the entire font up to the point where that data is stored. Typically the data comes later in the file than its offset location. You know what that means? Yes, storing all these index locations and hotpatching them afterwards once you find out where the actual data pointed to ended up in. Be sure to compute your patching locations correctly lest you end up in lengthy debugging sessions where your subset font files do not render correctly. In fairness all of the incorrect writes were within the data array and thus 100% memory safe, and, really, isn't that the only thing that actually matters?

One of the main data structures in a CFF file is a font dictionary stored in, as the docs say, "key-value pairs". This is not true. The "key-value dictionary" is neither key-value nor is it a dictionary. The entries must come in a specific order (sometimes) so it is not a dictionary. The entries are not stored as key-value pairs but as value-key pairs. The more accurate description of "value-key somewhat ordered array" does lack some punch so it is understandable that they went with common terminology. The backwards ordering of elements to some people confusion bring might, but it perfect sense makes, as the designers of the format a long history with PostScript had. Unknown is whether some of them German were.

Anyhow, after staring directly into the morass of madness for a sufficient amount of time the following picture emerges.

Final words

The CFF specification document contains data needed to decipher CFF data streams in nice tabular format, which is easy to convert to an enum. Trying it fails with an error message saying that the file has prohibited copypasting. This is a bit rich coming from Adobe, whose current stance seems to be that they can take any document opened with their apps and use it for AI training. I'd like to conclude this blog post by sending the following message to the (assumed) middle manager who made the decision that publicly available specification documents should prohibit copypasting:

YOU GO IN THE CORNER AND THINK ABOUT WHAT YOU HAVE DONE! AND DON'T EVEN THINK ABOUT COMING BACK UNTIL YOU ARE READY TO APOLOGIZE TO EVERYONE FOR YOU ACTIONS!

Tuesday, January 14, 2025

Measuring code size and performance

Are exceptions faster and/or bloatier than using error codes? Well...

The traditional wisdom is that exceptions are faster when not taken, slower when taken and lead to more bloated code. On the other hand there are cases where using exceptions makes code a lot smaller. In embedded development, even, where code size is often the limiting factor.

Artificial benchmarks aside, measuring the effect on real world code is fairly difficult. Basically you'd need to implement the exact same, nontrivial piece of code twice. One implementation would use exceptions, the other would use error codes but they should be otherwise identical. No-one is going to do that for fun or even idle curiosity.

CapyPDF has been written exclusively using C++ 23's new expected object for error handling. As every Go programmer knows, typing error checks over and over again is super annoying. Very early on I wrote macros to autopropagate errors. That props up an interesting question, namely could you commit horrible macro crimes to make the error handling either use error objects or exceptions?

It tuns out that yes you can. After a thorough scrubbing of the ensuring shame from your body and soul you can start doing measurements. To get started I built and ran CapyPDF's benchmark application with the following option combinations:

Optimization -O1, -O2, -O3, -Os
LTO enabled and disabled
Exceptions enabled and disabled
RTTI enabled and disabled
NDEBUG enabled and disabled

The measurements are the stripped size of the resulting shared library and runtime of the test executable. The code and full measurement data can be found in this repo. The code size breakdown looks like this:

Performance goes like this:

Some interesting things to note:

The fastest runtime of 0.92 seconds with O3-lto-rtti-noexc-ndbg
The slowest is 1.2s with Os-noltto-rtti-noexc-ndbg
If we ignore Os the slowes is 1.07s O1-noltto-rtti-noexc-ndbg
The largest code is 724 kb with O3-nolto-nortt-exc-nondbg
The smallest is 335 kB with Os-lto-nortti-noexc-ndbg
Ignoring Os the smallest is 470 kB with O1-lto-nortti-noexc-ndbg

Things noticed via eyeballing

Os leads to noticeably smaller binaries at the cost of performance
O3 makes binaries a lot bigger in exchange for a fairly modest performance gain
NDEBUG makes programs both smaller and faster, as one would expect
LTO typically improves both speed and code size
The fastest times for O1, O2 and O3 are within a few percent points of each other with 0.95, 094 and 0.92 seconds, respectively

Caveats

At the time of writing the upstream code uses error objects even when exceptions are enabled. To replicate these results you need to edit the source code.

The benchmark does not actually raise any errors. This test only measures the golden path.

The tests were run on GCC 14.2 on x86_64 Ubuntu 10/24.

Monday, December 23, 2024

CapyPDF 0.14 is out

I have just released version 0.14 of CapyPDF. This release has a ton of new functionality. So much, in fact, that I don't even remember them all. The reason for this is that it is actually starting to see real world usage, specifically as the new color managed PDF exporter for Inkscape. It has required a lot of refactoring work in the color code of Inkscape proper. This work has been done mostly by Doctormo, who has several videos on the issue.

The development cycle has consisted mostly of him reporting missing features like "specifying page labels is not supported", "patterns can be used for fill, but not for stroke" and "loading CMYK TIFF images with embedded color profiles does not work" and me then implementing said features or finding out how how setjmp/longjmp actually works and debugging corrupted stack traces when it doesn't.

Major change coming in the next version

The API for CapyPDF is not stable, but in the next release it will be extra unstable. The reason is C strings. Null terminated UTF-8 strings are a natural text format for PDF, as strings in PDF must not contain the zero glyph. Thus there are many functions like this in the public C API:

void do_something(const char *text);

This works and is simple, but there is a common use case it can't handle. All strings must be zero terminated so you can't point to a middle of an existing buffer, because it is not guaranteed to be zero terminated. Thus you always have to make a copy of the text you want to pass. In other words this means that you can't use C++'s string_view (or any equivalent string) as a source of text data. The public API should support this use case.

Is this premature optimization? Maybe. But is is also a usability issue as string views seem to be fairly common nowadays. There does not seem to be a perfect solution, but the best one I managed to crib seems to be to do this:

void do_something(const char *text, int32_t len_or_negative);

If the last argument is positive, use it as the length of the buffer. If i is negative then treat the char data as a zero terminated plain string. This requires changing all functions that take strings and makes the API more unpleasant to use.

If someone has an idea for a better API, do post a comment here.

Tuesday, December 17, 2024

Meson build definitions merged into Git's git repo

The developers of Git have been considering switchibg build systems for a while. No definitive decision have been made as of yet, but they gave merged Meson build definitions in the main branch. Thus it now possible, and even semi-supported, to develop and build Git with Meson instead of the vintage Makefile setup (which, AFAICT, remains as the default build system for now).

The most interesting thing about this conversion is that the devs were very thorough in their evaluation of all the different possibilities. Those who are interested in the details or are possibly contemplating a build system switch on their own are recommended to read the merge's commit message.

Huge congratulations for everyone involved and thank you for putting in the work (FTR i did not work on this myself).

Friday, December 13, 2024

CMYK me baby one more time!

Did you know that Jpeg supports images in the CMYK colorspace? And that people are actually using them in the wild? This being the case I needed to add support to them into CapyPDF. The development steps are quite simple, first you create a CMYK Jpeg file, then you create a test document that embeds it and finally look at the result in a PDF renderer.

Off to a painter application then. This is what the test image looks like.

Then we update the Jpeg parsing code to detect cmyk images and write the corresponding metadata to the output PDF. What does then end result look like then?

Aaaaand now we have a problem. Specifically one of an arbitrary color remapping. It might seem this is just a case of inverted colors. It's not (I checked), something weirder is going on. For reference Acrobat Reader's output looks identical.

At this point rather than poke things at random and hoping for the best, a good strategy is to get more test data. Since Scribus is pretty much a gold standard on print quality PDF production I went about recreating the test document in it.

Which failed immediately on loading the image.

Here we have Gwenview and Scribus presenting their interpretations of the exact same image. If you use Scribus to generate a PDF, it will convert the Jpeg into some three channel (i.e. RGB) ICC profile.

Take-home exercise

Where is the bug (or a hole in the spec) in this case:

The original CMYK jpeg is correct, but Scribus and PDF renderers read it in incorrectly?
The original image is incorrect and Gwenview has a separate inverse bug that cancel each other out?
The image is correct but the metadata written in the file by CapyPDF is incorrect?
The PDF spec has a big chunk of UB here and the final result can be anything?
Aliens?

I don't know the correct answer. If someone out there does, do let me know.

Thursday, December 5, 2024

Compiler daemon thought experiment

According to information I have picked up somewhere (but can't properly confirm via web searches ATM) there was a compiler in the 90s (the IBM VisualAge compiler maybe?) which had a special caching daemon mode. The basic idea was that you would send your code to that process and then it could return cached compile results without needing to reparse and reprocess same bits of code over and over. A sort of an in-compiler CCache, if you will. These compilers no longer seem to exist, probably because you can't just send snippets of code to be compiled, you have to send the entire set of code up to the point you want to compile. If it is different, for example because some headers are included in a different order, the results can not be reused. You have to send everything over and at that point it becomes distcc.

I was thinking about this some time ago (do not ask why, I don't know) and while this approach does not work in the general case, maybe it could be made to work for a common special case. However I am not a compiler developer so I have no idea if the following idea could work or not. But maybe someone skilled in the art might want to try this or maybe some university professor could make their students test the approach for course credit.

The basic idea is quite simple. Rather than trying to cache compiler internal state to disk somehow, persist it in a process without even attempting to be general.

The steps to take

Create a C++ project with a dozen source files or so. Each of those sources include some random set of std headers and have a single method that does something simple like returns the sum of its arguments. What they do is irrelevant, they just have to be slow to compile.

Create a PCH file that has all the std headers used in the source files. Compile that to a file.

Start compiling the actual sources one by one. Do not use parallelism to emphasize the time difference.

When the first compilation starts, read the PCH file contents into memory in the usual way. Then fork the process. One of the processes carries on compiling as usual. The second process opens a port and waits for connections, this process is the zygote server process.

When subsequent compilations are done, they connect to the port opened by the zygote process, send the compilation flags over the socket and wait for the server process to finish.

The zygote process reads the command line arguments over the socket and then forks itself. One process starts waiting on the socket again whereas the other compiles code according to the command line arguments it was given.

The performance boost comes from the fact that the zygote process already has stdlib headers in memory in compiler native data structures. In the optimal case loading the PCH file takes effectively zero time. What makes this work (in this test at least) is that the PCH file is the same for all compilations and it is the first thing the compiler starts processing. Thus it is always the same for all compilations. Conceptually at least, the actual compiler might do something else. There may be a dozen other reasons it might not work.

If someone tries this out, do let us know whether it actually worked.