Tuesday, January 17, 2023

PDF, text and fonts, a design by The Devil & Associates

PDF has a fairly straightforward model for rendering text. Basically you specify a font to use, define a transformation matrix if needed and specify the text string to render. The PDF renderer will then load the embedded font file, extract the curves needed for your letters and render them on canvas. And it works quite nicely.

Assuming you are using plain ASCII. Because 127 glyphs should be enough for everybody.

If, for some weirdo reason, you belong to that ~80% minority population of planet Earth whose native tongue uses characters beyond ASCII, you are not going to be in a happy place. The whole situation is nicely summarized in this Computerphile video.




It may be comforting to know that nothing has changed since 1993. Dealing with fonts and text in PDF is still very much a case of voodoo incantations and guesswork. It does not really help that phrases containing words like "PDF", "text", and "fonts" are completely ungoogleable because they find a million hits on people trying to create PDF files with existing tools as opposed how font generation works under the hood.

How does one create text with PDF?

What most people generating PDF would want is to be able to specify a font file, starting location and the text to render in UTF-8. Some people might want finer control over text rendering and instead provide an array of Unicode glyphs and coordinates. Cairo provides both of these. PDF provides neither. Even the ASCII subset PDF does provide is mostly useless by itself for the following reasons.

Let's start with font metadata. When you embed a font inside a PDF file you also need to provide the width of each character as a PDF array in the font specification dictionary. This data is exactly the same as is in the font file itself. The PDF renderer must know how to parse said files to get the glyph curves out but the specification still mandates that every PDF creator program must replicate this data. Don't believe me? This is what the PDF specification document says in subsection 9.3.4:

The width information for each glyph shall be stored both in the font dictionary and in the font program itself.

The two sets of widths shall be identical.

Why is this? A reasonable assumption is that the person who was tasked with adding TrueType font support at Adobe back in the day got lazy and just insisted that font embedders must provide this information in PDF metadata format because the code to process those was already implemented. But it gets worse. The glyph width data is specified with three items, FirstChar, LastChar and Widths. The last of these contains the widths of all glyphs between the two endpoints. This means that if you wanted to create a PDF document containing the single word παν語 (the stylised form of "Pango") then you'd need an array with some 35 thousand entries, all but four of which would be unused. You still need to define them, or at least, you would need to if PDF text worked this way. It does not. We'll come back to this later.

As a palette cleanser let's note that even though PDF documentation always speaks of glyph widths, this array does not contain glyph widths. What it actually contains is glyph advancements, but given that the PDF text model predates TrueType font technology we can let that one pass.

What we can't ignore, though, is kerning. If you want your text to look good, you have to do proper kerning and the font file has all the information necessary to do that. PDF does not read that (and, as far as I can tell, you can't make it do so) but instead it requires that you specify kerning inline in your text rendering command. This means that in practice you can't tell PDF to render more than one glyph of text and expect it to do the right thing. Instead you have to manually process each glyph and add kerning directives between each pair of characters as necessary.

The magic 255

Creating PDF text with glyph IDs and locations seems like a reasonable approach then. The PDF documentation even says that you can specify glyph ids directly with one of two different quoting methods: octal numbering or UTF-16BE values. This works fine until you try to specify a glyph number bigger than 255, at which point fails incomprehensibly. I spent days trying to debug this. There are several different indices that could be used, mapped to each other and so on. I could not for the life of me figure out a combination that worked. Every attempt to use glyph indices failed. There is no useful documentation for this, you basically have to read code for existing PDF creators and examine their output with a hex editor or bespoke tools. The various PDF validators I tried were not very helpful either because their error messages were of the type "this PDF file is not valid in some way lol ;-)".

Eventually I tried creating a LO document that had more than 255 unique glyphs and exported that. The generated PDF file had two different subsets of the used font. The first had 255 characters and the second one had the rest. Cairo's PDF output is roughly similar. This leads me to believe that, contrary to what the documentation implies, it is not possible to have more than 255 glyphs in a single font in PDF. Or, if it is possible, then the way you go about getting that done is so bizarrely complicated that nobody has managed to get it working. Instead what you have to do is to create font subsetting code from scratch that packs all used glyphs into blocks of 255, map each input glyph to these packed ids, and write a cmap file so that the PDF renderer can convert the packed glyph ids back to Unicode values. You can't even use Freetype to do the subsetting, since it does not provide font creation functionality, only reading and rendering. Instead you get to do binary data mangling by hand. In big endian. Obviously.

If someone reading this knows for sure whether all of the above is true or if there is a simpler way (converting text to curves by hand does not count), do let me know in the comments.

5 comments:

  1. I worked on adding Unicode and TrueType font support to ReportLab, an open-source Python PDF generation library, back in 2003 or so. Font subsetting is the only way I was able to make it work. Incidentally, when you pick arbitrary Unicode characters to map to your glyph indexes, be very very sure to always assign space to glyph 32 in every subset, otherwise justified text will not work correctly!

    (BTW I was unable to post a comment here using Firefox -- the SIGN IN WITH GOOGLE button didn't do anything at all. I had to switch to Chromium. I hate blogspot.)

    ReplyDelete
  2. Yes, fonts are hard. This is true when PDF is not in the mix and it gets a bit harder if PDF is in the mix.

    PDF text rendering is different from what you do with Word in that you need to specify the exact positions of *each* character. So you, or more correctly, the library you are using, places each character exactly at the location you specified. This means kerning information and other positional data available e.g. in GPOS, GSUB tables needs to be processed by the PDF creator. So this is all the job of the PDF creator, not of the PDF viewer which just takes this information and renders the glyphs. This way there is (or should be) no ambiguity on how text gets rendered. The text is static in this way.

    As for the widths being specified also in the meta data. One reason for this is that an application which does not do visual text rendering can find out the exact glyph positions of each glyph, without needing all the code to properly decode the fonts. See the following example https://hexapdf.gettalong.org/examples/show_char_bboxes.html which just relies on the meta information in the PDF object to render boxes around each character.

    As for the widths array containing 35000 entries: One wouldn't do it this way. Instead you would subset the font and just include the glyphs you are actually using, making the widths array much smaller.

    So, LibreOffice seems to use what PDF calls "simple fonts". Those fonts can only have 255 characters per font object because they are single-byte fonts. Even if you are using more than 255 characters each font object will only allow using 255 characters. There is, however, another kind of font object in PDF called CID fonts which work differently. These are multi-byte fonts where each glyph is specified one, two, three, ... bytes. And these fonts can actually have subsets with more than 255 characters. I'm actually using this approach in HexaPDF to avoid coordinating multiple subsets.

    ReplyDelete
  3. So, I'm not sure about PDF pecularities, but I suggest you look at HarfBuzz for your font subsetting needs. Google is using it (or will use it) to serve subsets for Google Fonts. The previously used subsetter is the one in the Python library fontTools. If your goal is to implement a text field thing for PDF rendering with correct handling of everything text, you may want to get in contact with the author of https://pypi.org/project/drawbot-skia/, because it's a hard problem (HarfBuzz will tell you the advance width of all glyphs, taking kerning and everything else into consideration, but it doesn't do line breaks, BiDi handling, and various other required things).

    ReplyDelete
    Replies
    1. The code does not aim to provide any high level document format, it merely exposes PDF functionality as is. The API is probably going to be just like Cairo's: there is one function that does "best effort UTF-8 text" and another that only takes a list of glyphs and positions. It is up to the caller to do higher level text layout (probably using Harfbuzz themselves).

      Delete
  4. This URL https://unicode-table.com/ has changed to https://symbl.cc/

    ReplyDelete