Friday, June 21, 2024

Advanced text features and PDF

The basic text model of PDF is quite nice. On the other hand its basic design was a very late 80s "ASCII is everything everyone really needs, but we'll be super generous and provide up to 255 glyphs using a custom encoding that is not in use everywhere else". As you can probably guess, this causes a fair bit of problems in the modern world.

To properly understand the text that follows you should know that there are four different ways in which text and letters need to be represented to get things working:

  • Source text is the "original" written text in UTF-8 (typically)
  • Unicode codepoints represent unique Unicode IDs as specified by the Unicode standard
  • A glyph id uniquely specifies a glyph (basically a series of drawing operations), these are arbitrary and typically unique for each font
  • ActualText is sort of like an AltText for PDF but uses UTF-16BE as was the way of the future in the early 90s

Kerning

The most common advanced typography feature in use is probably kerning, that is, custom spacing between certain letter pairs like "AV" and "To". The PDF text model has native support for kerning and it even supports vertical and horizontal kerning. Unfortunately the way things are set up means that you can only specify horizontal kerning when laying out horizontal text and vertical kerning for vertical text. If your script requires both, you are not going to have a good time.

There are several approaches one can take. The simplest is to convert all text to path drawing operations, which can be placed anywhere with arbitrary precision. This works great for printed documents but also means that document sizes balloon and you can't copypaste text from the document, use screen readers or do any other operation that needs the actual text those shapes represent.

An alternative is to render each glyph as its own text object with exact coordinates. While verbose this works, but since every letter is separate, text selection becomes wonky again. PDF readers seem to have custom heuristics to try to detect these issues and fix text selection in post-processing. Sometimes it works better than at other times.

Everything in PDF drawing operations is based on matrices. Text has its own transform matrix that defines where the next glyph will go. We could specify kerning manually with a custom translation matrix that translates the rendering location by the amount needed. There are two main downsides to this. First of all it would mean that instead of having a stream of glyphs to render, you'd need to define 9 floating point numbers (actually 6 due to reasons) between every pair of glyphs. This would increase the size of you output by a factor of roughly ten. The other downside is that unlike for all other matrices, PDF does not permit you to multiply an existing text state matrix with a new one. You can only replace it completely. So the actual code path would become "tell PDF to draw a glyph, work out what changes it would make to the currently active text matrix, undo that, multiply that matrix with one that has the changes that you wanted to happen and proceed to the next glyph".

Glyph substitution

Most of the time (in most scripts anyway) source text's Unicode codepoints get mapped 1:1 to a font glyph in the final output. Perhaps the most common case where this does not happen is ligatures.

The actual rules when and how this happens are script, font and language dependent. This is something you do not want to do yourself, instead use a shaping engine like Harfbuzz. If you give it the source text as UTF-8 and a font that has the ffi ligature, it will return a list of four glyph ids in the font to use, the way they map back to the original text, kerning (if any) and all of that good stuff.

What it won't give you is the information of what ligatures it replaced your source text with. In this example it will tell you the glyph id of the ffi ligature (2132) but not which Unicode codepoint it corresponds to (0xFB03). You need to tell that number in PDF metadata for the text to work properly in copypaste operations. At first this does not seem like such a big problem, because we have access to the original font file and Freetype. You'd think you can just ask Freetype for the Unicode codepoint for a given font glyph, but you can't. There is a function for finding a glyph for a given Unicode codepoint but mot the other way around. The stackoverflow recommended way of doing this is to iterate over all glyphs until you find the one that is mapped to the desired codepoint. For extra challenge you need to write an ActualText tag in the PDF command stream so that when users copypaste that text they get the original form with each individual letter rather than the ffi Unicode glyph.

All of this means that glyph lookup is basically a O(n^2) operation if it was possible to do. Sometimes it isn't, as we shall now find out.

Alternate forms

OpenType fonts can have multiple different glyphs for the same Unicode codepoint, for example the small caps versions of Noto Serif look like this.

These are proper hand-drawn versions of the glyphs, not ones obtained by scaling down upper case letters. Using these is simple, you tell Harfbuzz to use the small caps versions when shaping and then it does everything for you. For this particular font upper case small caps glyphs are the same as regular upper case glyphs. The lower case ones have their own glyphs in the font. However, according to Freetype at least, those glyphs are not mapped to any Unicode codepoint. Conceptually a small caps lower case "m" should be the same as a regular lower case "m". For some reason it is not and, unless I'm missing something, there is no API that can tell you that. The only way to do it "properly" is to track this yourself based on your input text and requirements.

How does CapyPDF handle all this?

In the same way pretty much all PDF generator libraries do: by ignoring all of it. CapyPDF only provides the means to express all underlying functionality in the PDF library. It is the responsibility of the client application to form glyph sequences and related PDF metadata in the way that makes sense for their application and document structure.