Tuesday, February 4, 2025

The trials and tribulations of supporting CJK text in PDF

In the past I may have spoken critically on Truetype fonts and their usage in PDF files. Recently I have come to the conclusion that it may have been too harsh and that Truetype fonts are actually somewhat nice. Why? Because I have had to add support for CFF fonts to CapyPDF. This is a font format that comes from Adobe. It encodes textual PostScript drawing operations into binary bytecode. Wikipedia does not give dates, but it seems to have been developed in the late 80s - early 90s. The name CFF part is an abbeviation for "complicated font format".

Double-checks notes.

Compact font format. Yes, that is what I meant to write. Most people reading this have probably not ever even seen a CFF file so you might be asking why is supporting CFF fonts even a thing nowadays? It's all quite simple. Many of the Truetype (and especially OpenType) fonts you see are not actually Truetype fonts. Instead they are Transfontners, glyphs in disguise. It is entirely valid to have a Truetype font that is merely an envelope holding a CFF font. As an example the Noto CJK fonts are like this. Aggregation of different formats is common in font files, and the main reason OpenType fonts have like four different and mutually incompatible ways of specifying color emoji. None of the participating entities were willing to accept anyone else's format so the end result was to add all of them. If you want Asian language support, you have to dive into the bowels of the CFF rabid hole.

As most people probably do not have sufficient historical perspective, let's start by listing out some major computer science achievements that definitely existed when CFF was being designed.

  • File format magic numbers
  • Archive formats that specify both the offset and size of the elements within
  • Archive formats that afford access to their data in O(number of items in the archive) rather than O(number of bytes in the file)
  • Data compression
CFF chooses to not do any of this nonsense. It also does not believe in consistent offset types. Sometimes the offsets within data objects refer to other objects by their order in the index they are in. Sometimes they refer to number of bytes from the beginning of the file. Sometimes they refer to number of bytes from the beginning of the object the offset data is written in. Sometimes it refers to something else. One of the downsides of this is that while some of the data is neatly organized into index structures with specified offsets, a lot of it is just free floating in the file and needs the equivalent of three pointer dereferences to access.

Said offsets are stored with a variable width encoding like so:

This makes writing subset CFF font files a pain. In order to write an offset value at some location X, you first must serialize everything up to that point to know where the value would be written. To know the value to write you have to serialize the the entire font up to the point where that data is stored. Typically the data comes later in the file than its offset location. You know what that means? Yes, storing all these index locations and hotpatching them afterwards once you find out where the actual data pointed to ended up in. Be sure to compute your patching locations correctly lest you end up in lengthy debugging sessions where your subset font files do not render correctly. In fairness all of the incorrect writes were within the data array and thus 100% memory safe, and, really, isn't that the only thing that actually matters?

One of the main data structures in a CFF file is a font dictionary stored in, as the docs say, "key-value pairs". This is not true. The "key-value dictionary" is neither key-value nor is it a dictionary. The entries must come in a specific order (sometimes) so it is not a dictionary. The entries are not stored as key-value pairs but as value-key pairs. The more accurate description of "value-key somewhat ordered array" does lack some punch so it is understandable that they went with common terminology. The backwards ordering of elements to some people confusion bring might, but it perfect sense makes, as the designers of the format a long history with PostScript had. Unknown is whether some of them German were.

Anyhow, after staring directly into the morass of madness for a sufficient amount of time the following picture emerges.

Final words

The CFF specification document contains data needed to decipher CFF data streams in nice tabular format, which is easy to convert to an enum. Trying it fails with an error message saying that the file has prohibited copypasting. This is a bit rich coming from Adobe, whose current stance seems to be that they can take any document opened with their apps and use it for AI training. I'd like to conclude this blog post by sending the following message to the (assumed) middle manager who made the decision that publicly available specification documents should prohibit copypasting:

YOU GO IN THE CORNER AND THINK ABOUT WHAT YOU HAVE DONE! AND DON'T EVEN THINK ABOUT COMING BACK UNTIL YOU ARE READY TO APOLOGIZE TO EVERYONE FOR YOU ACTIONS!

Tuesday, January 14, 2025

Measuring code size and performance

Are exceptions faster and/or bloatier than using error codes? Well...

The traditional wisdom is that exceptions are faster when not taken, slower when taken and lead to more bloated code. On the other hand there are cases where using exceptions makes code a lot smaller. In embedded development, even, where code size is often the limiting factor.

Artificial benchmarks aside, measuring the effect on real world code is fairly difficult. Basically you'd need to implement the exact same, nontrivial piece of code twice. One implementation would use exceptions, the other would use error codes but they should be otherwise identical. No-one is going to do that for fun or even idle curiosity.

CapyPDF has been written exclusively using C++ 23's new expected object for error handling. As every Go programmer knows, typing error checks over and over again is super annoying. Very early on I wrote macros to autopropagate errors. That props up an interesting question, namely could you commit horrible macro crimes to make the error handling either use error objects or exceptions?

It tuns out that yes you can. After a thorough scrubbing of the ensuring shame from your body and soul you can start doing measurements. To get started I built and ran CapyPDF's benchmark application with the following option combinations:

  • Optimization -O1, -O2, -O3, -Os
  • LTO enabled and disabled
  • Exceptions enabled and disabled
  • RTTI enabled and disabled
  • NDEBUG enabled and disabled
The measurements are the stripped size of the resulting shared library and runtime of the test executable. The code and full measurement data can be found in this repo. The code size breakdown looks like this:

Performance goes like this:

Some interesting things to note:

  • The fastest runtime of 0.92 seconds with O3-lto-rtti-noexc-ndbg
  • The slowest is 1.2s with Os-noltto-rtti-noexc-ndbg
  • If we ignore Os the slowes is 1.07s O1-noltto-rtti-noexc-ndbg
  • The largest code is 724 kb with O3-nolto-nortt-exc-nondbg
  • The smallest is 335 kB with Os-lto-nortti-noexc-ndbg
  • Ignoring Os the smallest is 470 kB with O1-lto-nortti-noexc-ndbg
Things noticed via eyeballing

  • Os leads to noticeably smaller binaries at the cost of performance
  • O3 makes binaries a lot bigger in exchange for a fairly modest performance gain
  • NDEBUG makes programs both smaller and faster, as one would expect
  • LTO typically improves both speed and code size
  • The fastest times for O1, O2 and O3 are within a few percent points of each other with 0.95, 094 and 0.92 seconds, respectively

Caveats

At the time of writing the upstream code uses error objects even when exceptions are enabled. To replicate these results you need to edit the source code.

The benchmark does not actually raise any errors. This test only measures the golden path.

The tests were run on GCC 14.2 on x86_64 Ubuntu 10/24.