Friday, June 21, 2024

Advanced text features and PDF

The basic text model of PDF is quite nice. On the other hand its basic design was a very late 80s "ASCII is everything everyone really needs, but we'll be super generous and provide up to 255 glyphs using a custom encoding that is not in use everywhere else". As you can probably guess, this causes a fair bit of problems in the modern world.

To properly understand the text that follows you should know that there are four different ways in which text and letters need to be represented to get things working:

  • Source text is the "original" written text in UTF-8 (typically)
  • Unicode codepoints represent unique Unicode IDs as specified by the Unicode standard
  • A glyph id uniquely specifies a glyph (basically a series of drawing operations), these are arbitrary and typically unique for each font
  • ActualText is sort of like an AltText for PDF but uses UTF-16BE as was the way of the future in the early 90s

Kerning

The most common advanced typography feature in use is probably kerning, that is, custom spacing between certain letter pairs like "AV" and "To". The PDF text model has native support for kerning and it even supports vertical and horizontal kerning. Unfortunately the way things are set up means that you can only specify horizontal kerning when laying out horizontal text and vertical kerning for vertical text. If your script requires both, you are not going to have a good time.

There are several approaches one can take. The simplest is to convert all text to path drawing operations, which can be placed anywhere with arbitrary precision. This works great for printed documents but also means that document sizes balloon and you can't copypaste text from the document, use screen readers or do any other operation that needs the actual text those shapes represent.

An alternative is to render each glyph as its own text object with exact coordinates. While verbose this works, but since every letter is separate, text selection becomes wonky again. PDF readers seem to have custom heuristics to try to detect these issues and fix text selection in post-processing. Sometimes it works better than at other times.

Everything in PDF drawing operations is based on matrices. Text has its own transform matrix that defines where the next glyph will go. We could specify kerning manually with a custom translation matrix that translates the rendering location by the amount needed. There are two main downsides to this. First of all it would mean that instead of having a stream of glyphs to render, you'd need to define 9 floating point numbers (actually 6 due to reasons) between every pair of glyphs. This would increase the size of you output by a factor of roughly ten. The other downside is that unlike for all other matrices, PDF does not permit you to multiply an existing text state matrix with a new one. You can only replace it completely. So the actual code path would become "tell PDF to draw a glyph, work out what changes it would make to the currently active text matrix, undo that, multiply that matrix with one that has the changes that you wanted to happen and proceed to the next glyph".

Glyph substitution

Most of the time (in most scripts anyway) source text's Unicode codepoints get mapped 1:1 to a font glyph in the final output. Perhaps the most common case where this does not happen is ligatures.

The actual rules when and how this happens are script, font and language dependent. This is something you do not want to do yourself, instead use a shaping engine like Harfbuzz. If you give it the source text as UTF-8 and a font that has the ffi ligature, it will return a list of four glyph ids in the font to use, the way they map back to the original text, kerning (if any) and all of that good stuff.

What it won't give you is the information of what ligatures it replaced your source text with. In this example it will tell you the glyph id of the ffi ligature (2132) but not which Unicode codepoint it corresponds to (0xFB03). You need to tell that number in PDF metadata for the text to work properly in copypaste operations. At first this does not seem like such a big problem, because we have access to the original font file and Freetype. You'd think you can just ask Freetype for the Unicode codepoint for a given font glyph, but you can't. There is a function for finding a glyph for a given Unicode codepoint but mot the other way around. The stackoverflow recommended way of doing this is to iterate over all glyphs until you find the one that is mapped to the desired codepoint. For extra challenge you need to write an ActualText tag in the PDF command stream so that when users copypaste that text they get the original form with each individual letter rather than the ffi Unicode glyph.

All of this means that glyph lookup is basically a O(n^2) operation if it was possible to do. Sometimes it isn't, as we shall now find out.

Alternate forms

OpenType fonts can have multiple different glyphs for the same Unicode codepoint, for example the small caps versions of Noto Serif look like this.

These are proper hand-drawn versions of the glyphs, not ones obtained by scaling down upper case letters. Using these is simple, you tell Harfbuzz to use the small caps versions when shaping and then it does everything for you. For this particular font upper case small caps glyphs are the same as regular upper case glyphs. The lower case ones have their own glyphs in the font. However, according to Freetype at least, those glyphs are not mapped to any Unicode codepoint. Conceptually a small caps lower case "m" should be the same as a regular lower case "m". For some reason it is not and, unless I'm missing something, there is no API that can tell you that. The only way to do it "properly" is to track this yourself based on your input text and requirements.

How does CapyPDF handle all this?

In the same way pretty much all PDF generator libraries do: by ignoring all of it. CapyPDF only provides the means to express all underlying functionality in the PDF library. It is the responsibility of the client application to form glyph sequences and related PDF metadata in the way that makes sense for their application and document structure.

13 comments:

  1. Thanks for chronicling these things on the Internet for future generations Jussi. You are doing a real service. In a past life I worked on Scribus and KWord (later Calligra Words) and while I did not work on the nitty gritty details of exporting text to PDF I find it all interesting, and I'm sure others will find these posts golden down the line. Kudos.

    ReplyDelete
  2. > You'd think you can just ask Freetype for the Unicode codepoint for a given font glyph, but you can't. There is a function for finding a glyph for a given Unicode codepoint but not the other way around.

    That’s because there is no such thing a a reverse lookup in modern smart fonts. An Opentype font is a glyph soup with rules to transform codepoints into soup references (and some of the transformations depend on the codepoint context ie the codepoints before and after, the same codepoint can render into different glyphs depending on the context and the same glyph can be used to render several codepoints). And the soup ids are not stable from one version on a font to the next one let alone from one font to another.

    The correct way to do cut and paste is to give back the original codepoint list and have the shaper re-render it depending on the context and the available fonts. Never pass glyph ids around they are internal ids not intended for reuse.

    > Unicode codepoints represent unique glyph IDs as specified by the Unicode standard

    Nope. Not remotely true except in the simplest cases.

    ReplyDelete
    Replies
    1. > Never pass glyph ids around they are internal ids not intended for reuse.

      This is absolutely true in the general case. However when creating PDFs things are more complicated. Not only does the PDF generator library have to deal with glyph ids, it needs to create new subset fonts from the main font _and_ it needs to map those new subset fonts and their glyph ids to Unicode codepoints in order for text selection to work.

      All of this has to be done _after_ the shaper has done its thing. That's just how PDF likes to roll.

      > Nope. Not remotely true except in the simplest cases.

      Sorry, that is a typo. It should say "Unicode characters".

      Delete
    2. Since this is all a bit abstract, let me give you a concrete example.

      You can design a font (and it will work in pdf files) that contains a CapyPDF symbol glyph and renders the CapyPDF codepoint list with this glyph. And you can decide the CapyPDF symbol is nice and the font should also render the JussiPakkanen codepoint list with the same glyph. That is essentially what so-called programming fonts do.

      If you ask Freetype “what is the codepoint of the CapyPDF glyph” it can not answer because first the Unicode Consortium did not standardise a codepoint for CapyPDF and second the CapyPDF glyph reverse maps both to CapyPDF and JussiPakkanen. Reverse mapping is essentially pointless that is why Freetype does not bother with it.

      And real-world fonts do much stranger things even without going into non-standard things like a CapyPDF glyph. A single codepoint can map to an assemblage of different glyphs for example (very common for non-latin asian scripts).

      Delete
    3. Yes. I do know all of that already. The (simplified) point was more along the lines that sometimes you can't get the reverse mapping even though it seems like you should. So the application needs to track all of this, the PDF generator library can't do it on their behalf.

      Delete
  3. > What it won't give you is the information of what ligatures it replaced your source text with. In this example it will tell you the glyph id of the ffi ligature (2132) but not which Unicode codepoint it corresponds to (0xFB03).

    When using HarfBuzz, the output glyph info has both the glyph id and a “cluster” field. The cluster is basically an index to the input string so that you can tell what part of the input string was mapped to this glyph. It looks simple, but it is rather powerful and can represent one-to-one, one-to-many, many-to-one and many-to-many relationships between input text to output glyphs.

    https://harfbuzz.github.io/clusters.html

    ReplyDelete
    Replies
    1. I know all of that, I even discuss that in the actual blog post. The problem is that Harfbuzz does (conceptually at least) two separate mappings. First it maps the three Unicode characters "ffi" into one Unicode character that represents the ligature (0xFB03). It then maps that to the glyph id for the current font and gives you the end result. There is no way to get information about the first substitution _only_. It is up to every downstream user (who care about this) to detect after the fact that "HB replaced 'ffi' in my source text with one glyph so this probably means that it represents the ffi ligature Unicode code point, so I'll write that info out to the PDF". PDF requires that you provide a glyph-to-unicode mapping for each of the glyphs you use in the document, so the generator _must_ know that glyph such-and-such is the ffi ligature. As discussed in the post you are not guaranteed to get this information from Freetype after the fact either. Granted, this is somewhat of a niche use case. If you just render text on screen then you don't really care about that. If you want to do PDF exactly right, then you do have to care.

      This is a known issue in Harfbuzz, Google finds several discussions (some on the HB bug tracker even) on this and all of them seem to end with "this is not supported ATM".

      Delete
    2. > First it maps the three Unicode characters "ffi" into one Unicode character that represents the ligature (0xFB03). It then maps that to the glyph id for the current font and gives you the end result. There is no way to get information about the first substitution _only_.

      That is because no such substitution happens. HarfBuzz will first use “cmap” table to map input characters to glyph indices, after that all operations (including substitutions) happen exclusively on glyph indices. There are no such thing as intermediate Unicode substitutions. The fact that in some fonts U+FB03 has a “cmap” to the ffi glyph is irrelevant to HarfBuzz and not all fonts do that. U+FB03 is a legacy code point that exists in Unicode for backward compatibility, and some fonts will map it to the ffi glyph because they already have the glyph or because of some other reason, but HarfBuzz do not consult this mapping in anyway and the fonts can have any arbitrary ligature and very few of them have such legacy code points.

      > It is up to every downstream user (who care about this) to detect after the fact that "HB replaced 'ffi' in my source text with one glyph so this probably means that it represents the ffi ligature Unicode code point, so I'll write that info out to the PDF".

      They don’t. They need to know that this glyph came from f f i characters, and HarfBuzz provide that information.

      > PDF requires that you provide a glyph-to-unicode mapping for each of the glyphs you use in the document, so the generator _must_ know that glyph such-and-such is the ffi ligature.

      Which HarfBuzz can tell you already. It tells you that this glyph represents those three input characters. PDF ToUnicode mapping is limited, but it supports one to many glyph to code point mapping i.e. the mapping can map one glyph to 1+ code points.

      > This is a known issue in Harfbuzz, Google finds several discussions (some on the HB bug tracker even) on this and all of them seem to end with "this is not supported ATM".

      I’m not aware of such limitation. I worked in two PDF generators that use HarfBuzz and handle Unicode mapping properly, and I know of a third implementation as well. Namely, LibreOffice, LuaTeX (with luaotfload and HarfBuzz), and PangoCairo.

      Delete
    3. Instead of talking about things in the abstract, let's look up a concrete example. Suppose we want to create a PDF document that consists only of a single "ffi" ligature glyph.

      The PDF would then contain a single subset font with just that one glyph.

      The PDF graphics stream would consist of one text object that sets up the font to the subset one and would draw the ffi glyph (at glyph id 1, presumably)

      That text object should be surrounded by an ActualText tag with the value "ffi" (not relevant for this issue, but mentioned for completeness)

      Finally, and most importantly, the subset font would need to have a cmap that says that glyph 1 refers to Unicode codepoint for the ligature which is 0xFB03.

      How is the PDF creator app supposed to know that Unicode value? More generally, how is the app supposed to know the Unicode values of _any_ such substitution made by Harfbuzz (when they can be represented with a single Unicode codepoint)?

      The closest thing to this that I could find was this: https://github.com/harfbuzz/harfbuzz/discussions/3267 but even that depends on using heuristics to reverse engineer what the shaper did after the fact.

      Delete
    4. > The PDF would then contain a single subset font with just that one glyph.

      Right.

      > The PDF graphics stream would consist of one text object that sets up the font to the subset one and would draw the ffi glyph (at glyph id 1, presumably)

      Right.

      > That text object should be surrounded by an ActualText tag with the value "ffi" (not relevant for this issue, but mentioned for completeness)

      This shouldn’t be needed for this particular case, but might be needed for other cases, see below.

      > Finally, and most importantly, the subset font would need to have a cmap that says that glyph 1 refers to Unicode codepoint for the ligature which is 0xFB03.

      The font does not need a cmap table at all, since we are working with glyphs now not codepoints.

      > How is the PDF creator app supposed to know that Unicode value? More generally, how is the app supposed to know the Unicode values of _any_ such substitution made by Harfbuzz (when they can be represented with a single Unicode codepoint)?

      It does not need to know it, because its existence is irrelevant. What would it do if the ligature has no Unicode codepoint associated with it at all (some fonts have ft, fj, fb, ffb, etc. ligatures, and there are no legacy Unicode codepoints for these).

      What it needs to know, is that this ligature corresponds two the string “ffi” i.e. the codepoints U+0066, U+0066, and U+0069. It would then add a /ToUnicode to the font dictionary that has a CMap with a mapping from the glyph id to the three codepoints:
      <0001> <006600660069>

      ToUnicode CMaps can handle one glyph to one code point mappings (i.e. reverse of font’s cmap table, as well as handle single substitutions like smallcaps, where it would map the small cap glyph to the codepoint of the lower case letters it was mapped from), as well as one glyph to multiple code points (like the ligature here).

      It, however, can’t handle multiple glyphs to one code point or multiple glyphs to multiple code points (both can happen in OpenType). It also requires the mapping to be Unique, so if the same glyph is used for multiple code point (e.g. a font where “A” glyph is used for both Latin A and Greek Alpha), only one of them can appear in the CMap table. In these cases /ActualText tags have to be used.

      > The closest thing to this that I could find was this: https://github.com/harfbuzz/harfbuzz/discussions/3267 but even that depends on using heuristics to reverse engineer what the shaper did after the fact.

      Yes, because there is no way to reliably and generically map glyphs to code points. For a given shaping output, the shaper can tell you where in the input text this glyph came from (which is how text editing in GUI applications work, otherwise applications would have now way to know where to insert text or what to copy when user hits a certain glyph). What it can’t tell you, is what a code point any arbitrary glyph maps to, because fonts can do complex context-sensitive substitution, and without the said context the shaper can’t give you reliable information. Not to mention that a glyph can correspond to multiple code points or only a part of a code point (e.g. “i” codepoint might result in two glyph, dotless i and a tittle).

      Delete
    5. > The font does not need a cmap table at all, since we are working with glyphs now not codepoints.

      Sorry for my typo, it should say "the PDF file needs to have a cmap entry". Not the font itself. For comparison if you do this in PangoCairo, it does map glyph 1 to 0xFB03.

      Somehow Pango is able to deduce the substitution. I don't know how it does that..

      > What it needs to know, is that this ligature corresponds two the string “ffi” i.e. the codepoints U+0066, U+0066, and U+0069. It would then add a /ToUnicode to the font dictionary that has a CMap with a mapping from the glyph id to the three codepoints:

      <0001> <006600660069>

      This is finally the relevant bit. I did not know you can specify multiple character in the tounicode block. Now that I know what to look for, you can fairly easily find the part of the PDF spec that says that is also supported in PDF's subset.

      So from CapyPDF's point of view we need to support rendering a glyph that maps to a Unicode string in addition to mapping to an Unicode value (which we already support).

      It is still the end users's problem, but at least the problem is now simpler.

      Delete
    6. > For comparison if you do this in PangoCairo, it does map glyph 1 to 0xFB03.
      >
      > Somehow Pango is able to deduce the substitution. I don't know how it does that..

      This seems to be coming from Cairo. It is doing reverse cmap mapping (looking into font’s cmap table and mapping glyphs to Unicode) and it uses that over the value supplied by the API caller (Pango here) when what it provides is more than one letter. IMO, this is wrong and the value the caller supplies should take precedence since reverse cmap mapping is heuristic that can fail and the case here is an example of its failure.

      Delete