Friday, June 21, 2024

Advanced text features and PDF

The basic text model of PDF is quite nice. On the other hand its basic design was a very late 80s "ASCII is everything everyone really needs, but we'll be super generous and provide up to 255 glyphs using a custom encoding that is not in use everywhere else". As you can probably guess, this causes a fair bit of problems in the modern world.

To properly understand the text that follows you should know that there are four different ways in which text and letters need to be represented to get things working:

Source text is the "original" written text in UTF-8 (typically)
Unicode codepoints represent unique Unicode IDs as specified by the Unicode standard
A glyph id uniquely specifies a glyph (basically a series of drawing operations), these are arbitrary and typically unique for each font
ActualText is sort of like an AltText for PDF but uses UTF-16BE as was the way of the future in the early 90s

Kerning

The most common advanced typography feature in use is probably kerning, that is, custom spacing between certain letter pairs like "AV" and "To". The PDF text model has native support for kerning and it even supports vertical and horizontal kerning. Unfortunately the way things are set up means that you can only specify horizontal kerning when laying out horizontal text and vertical kerning for vertical text. If your script requires both, you are not going to have a good time.

There are several approaches one can take. The simplest is to convert all text to path drawing operations, which can be placed anywhere with arbitrary precision. This works great for printed documents but also means that document sizes balloon and you can't copypaste text from the document, use screen readers or do any other operation that needs the actual text those shapes represent.

An alternative is to render each glyph as its own text object with exact coordinates. While verbose this works, but since every letter is separate, text selection becomes wonky again. PDF readers seem to have custom heuristics to try to detect these issues and fix text selection in post-processing. Sometimes it works better than at other times.

Everything in PDF drawing operations is based on matrices. Text has its own transform matrix that defines where the next glyph will go. We could specify kerning manually with a custom translation matrix that translates the rendering location by the amount needed. There are two main downsides to this. First of all it would mean that instead of having a stream of glyphs to render, you'd need to define 9 floating point numbers (actually 6 due to reasons) between every pair of glyphs. This would increase the size of you output by a factor of roughly ten. The other downside is that unlike for all other matrices, PDF does not permit you to multiply an existing text state matrix with a new one. You can only replace it completely. So the actual code path would become "tell PDF to draw a glyph, work out what changes it would make to the currently active text matrix, undo that, multiply that matrix with one that has the changes that you wanted to happen and proceed to the next glyph".

Glyph substitution

Most of the time (in most scripts anyway) source text's Unicode codepoints get mapped 1:1 to a font glyph in the final output. Perhaps the most common case where this does not happen is ligatures.

The actual rules when and how this happens are script, font and language dependent. This is something you do not want to do yourself, instead use a shaping engine like Harfbuzz. If you give it the source text as UTF-8 and a font that has the ffi ligature, it will return a list of four glyph ids in the font to use, the way they map back to the original text, kerning (if any) and all of that good stuff.

What it won't give you is the information of what ligatures it replaced your source text with. In this example it will tell you the glyph id of the ffi ligature (2132) but not which Unicode codepoint it corresponds to (0xFB03). You need to tell that number in PDF metadata for the text to work properly in copypaste operations. At first this does not seem like such a big problem, because we have access to the original font file and Freetype. You'd think you can just ask Freetype for the Unicode codepoint for a given font glyph, but you can't. There is a function for finding a glyph for a given Unicode codepoint but mot the other way around. The stackoverflow recommended way of doing this is to iterate over all glyphs until you find the one that is mapped to the desired codepoint. For extra challenge you need to write an ActualText tag in the PDF command stream so that when users copypaste that text they get the original form with each individual letter rather than the ffi Unicode glyph.

All of this means that glyph lookup is basically a O(n^2) operation if it was possible to do. Sometimes it isn't, as we shall now find out.

Alternate forms

OpenType fonts can have multiple different glyphs for the same Unicode codepoint, for example the small caps versions of Noto Serif look like this.

These are proper hand-drawn versions of the glyphs, not ones obtained by scaling down upper case letters. Using these is simple, you tell Harfbuzz to use the small caps versions when shaping and then it does everything for you. For this particular font upper case small caps glyphs are the same as regular upper case glyphs. The lower case ones have their own glyphs in the font. However, according to Freetype at least, those glyphs are not mapped to any Unicode codepoint. Conceptually a small caps lower case "m" should be the same as a regular lower case "m". For some reason it is not and, unless I'm missing something, there is no API that can tell you that. The only way to do it "properly" is to track this yourself based on your input text and requirements.

How does CapyPDF handle all this?

In the same way pretty much all PDF generator libraries do: by ignoring all of it. CapyPDF only provides the means to express all underlying functionality in the PDF library. It is the responsibility of the client application to form glyph sequences and related PDF metadata in the way that makes sense for their application and document structure.

13 comments:

AnonymousJune 22, 2024 at 9:51 PM
Thanks for chronicling these things on the Internet for future generations Jussi. You are doing a real service. In a past life I worked on Scribus and KWord (later Calligra Words) and while I did not work on the nitty gritty details of exporting text to PDF I find it all interesting, and I'm sure others will find these posts golden down the line. Kudos.
ReplyDelete
Replies
nimJune 24, 2024 at 7:18 PM
> You'd think you can just ask Freetype for the Unicode codepoint for a given font glyph, but you can't. There is a function for finding a glyph for a given Unicode codepoint but not the other way around.

That’s because there is no such thing a a reverse lookup in modern smart fonts. An Opentype font is a glyph soup with rules to transform codepoints into soup references (and some of the transformations depend on the codepoint context ie the codepoints before and after, the same codepoint can render into different glyphs depending on the context and the same glyph can be used to render several codepoints). And the soup ids are not stable from one version on a font to the next one let alone from one font to another.

The correct way to do cut and paste is to give back the original codepoint list and have the shaper re-render it depending on the context and the available fonts. Never pass glyph ids around they are internal ids not intended for reuse.

> Unicode codepoints represent unique glyph IDs as specified by the Unicode standard

Nope. Not remotely true except in the simplest cases.
ReplyDelete
Replies
Khaled HosnyJuly 3, 2024 at 9:09 PM
> What it won't give you is the information of what ligatures it replaced your source text with. In this example it will tell you the glyph id of the ffi ligature (2132) but not which Unicode codepoint it corresponds to (0xFB03).

When using HarfBuzz, the output glyph info has both the glyph id and a “cluster” field. The cluster is basically an index to the input string so that you can tell what part of the input string was mapped to this glyph. It looks simple, but it is rather powerful and can represent one-to-one, one-to-many, many-to-one and many-to-many relationships between input text to output glyphs.

https://harfbuzz.github.io/clusters.html
ReplyDelete
Replies

Add comment