Writing a book is not that difficult. Sure, it is laborious, but if you merely keep typing away day after day, eventually you end up with a manuscript. Writing a book that is "good" or one that other people would want to read is a lot harder. Still, even that is easy compared to trying to get a book published. According to various unreferenced sources on the Internet, out of all manuscripts submitted only 1 in 1000 to 1 in 10 000 gets accepted for publication. Probabilitywise this is roughly as unlikely casting five dice and getting six with all of them.

Having written a manuscript I went about tying to get it published. The common approach in most countries is that first you have to pitch your manuscript to a literary agent, and if you succeed, they will then try to pitch it to publishers. In Finland the the procedure is simpler, anyone can submit their manuscripts directly to book publishing houses without a middle man. While this makes things easier, it does not help with deciding how much the manuscript should be polished before submission. The more you polish the bigger your chances of getting published, but the longer it takes and the more work you have to do if the publisher wants to make changes to the content.

Eventually I ended up with a sort-of-agile approach. I first gathered a list of all book publishers that have published scifi recently (there were not many). Then I polished the manuscript enough so it had no obvious deficiencies and sent it to the first publisher on the list. Then I did a full revision of the text and sent it to the next one and so on. Eventually I had sent it to all of them. Very soon thereafter I had received either a rejection email or nothing at all from each one.

It's not who you are, but who you know

Since the content did not succeed in selling itself, it was time to start using connections. I have known Pertti Jarla, who is one of Finland's most popular cartoonists, for several years. He runs a small scale publishing company. Its most famous book thus far has been a re-translated version of Philip K. Dick's Do Androids Dream of Electric Sheep. I reached out to him and, long story short, the book should available in Finnish bookstores in a few weeks. The front cover looks like this.

More information in Finnish can be found on the publisher's web site. As for the obvious question of what would the book's title be in English, unfortunately the answer is "it's quite complicated to translate, actually". Basically what the title says is "A giant leap for mankind" but also not, and I have not managed to come up with a description or a translation that would not be a spoiler.

So you'll just have to wait for part VI: Getting translated. Which is an order of magnitude more difficult than getting published.

PDF has a fairly straightforward model for rendering text. Basically you specify a font to use, define a transformation matrix if needed and specify the text string to render. The PDF renderer will then load the embedded font file, extract the curves needed for your letters and render them on canvas. And it works quite nicely.

Assuming you are using plain ASCII. Because 127 glyphs should be enough for everybody.

If, for some weirdo reason, you belong to that ~80% minority population of planet Earth whose native tongue uses characters beyond ASCII, you are not going to be in a happy place. The whole situation is nicely summarized in this Computerphile video.

It may be comforting to know that nothing has changed since 1993. Dealing with fonts and text in PDF is still very much a case of voodoo incantations and guesswork. It does not really help that phrases containing words like "PDF", "text", and "fonts" are completely ungoogleable because they find a million hits on people trying to create PDF files with existing tools as opposed how font generation works under the hood.

How does one create text with PDF?

What most people generating PDF would want is to be able to specify a font file, starting location and the text to render in UTF-8. Some people might want finer control over text rendering and instead provide an array of Unicode glyphs and coordinates. Cairo provides both of these. PDF provides neither. Even the ASCII subset PDF does provide is mostly useless by itself for the following reasons.

Let's start with font metadata. When you embed a font inside a PDF file you also need to provide the width of each character as a PDF array in the font specification dictionary. This data is exactly the same as is in the font file itself. The PDF renderer must know how to parse said files to get the glyph curves out but the specification still mandates that every PDF creator program must replicate this data. Don't believe me? This is what the PDF specification document says in subsection 9.3.4:

The width information for each glyph shall be stored both in the font dictionary and in the font program itself.
The two sets of widths shall be identical.

Why is this? A reasonable assumption is that the person who was tasked with adding TrueType font support at Adobe back in the day got lazy and just insisted that font embedders must provide this information in PDF metadata format because the code to process those was already implemented. But it gets worse. The glyph width data is specified with three items, FirstChar, LastChar and Widths. The last of these contains the widths of all glyphs between the two endpoints. This means that if you wanted to create a PDF document containing the single word παν語 (the stylised form of "Pango") then you'd need an array with some 35 thousand entries, all but four of which would be unused. You still need to define them, or at least, you would need to if PDF text worked this way. It does not. We'll come back to this later.

As a palette cleanser let's note that even though PDF documentation always speaks of glyph widths, this array does not contain glyph widths. What it actually contains is glyph advancements, but given that the PDF text model predates TrueType font technology we can let that one pass.

What we can't ignore, though, is kerning. If you want your text to look good, you have to do proper kerning and the font file has all the information necessary to do that. PDF does not read that (and, as far as I can tell, you can't make it do so) but instead it requires that you specify kerning inline in your text rendering command. This means that in practice you can't tell PDF to render more than one glyph of text and expect it to do the right thing. Instead you have to manually process each glyph and add kerning directives between each pair of characters as necessary.

The magic 255

Creating PDF text with glyph IDs and locations seems like a reasonable approach then. The PDF documentation even says that you can specify glyph ids directly with one of two different quoting methods: octal numbering or UTF-16BE values. This works fine until you try to specify a glyph number bigger than 255, at which point fails incomprehensibly. I spent days trying to debug this. There are several different indices that could be used, mapped to each other and so on. I could not for the life of me figure out a combination that worked. Every attempt to use glyph indices failed. There is no useful documentation for this, you basically have to read code for existing PDF creators and examine their output with a hex editor or bespoke tools. The various PDF validators I tried were not very helpful either because their error messages were of the type "this PDF file is not valid in some way lol ;-)".

Eventually I tried creating a LO document that had more than 255 unique glyphs and exported that. The generated PDF file had two different subsets of the used font. The first had 255 characters and the second one had the rest. Cairo's PDF output is roughly similar. This leads me to believe that, contrary to what the documentation implies, it is not possible to have more than 255 glyphs in a single font in PDF. Or, if it is possible, then the way you go about getting that done is so bizarrely complicated that nobody has managed to get it working. Instead what you have to do is to create font subsetting code from scratch that packs all used glyphs into blocks of 255, map each input glyph to these packed ids, and write a cmap file so that the PDF renderer can convert the packed glyph ids back to Unicode values. You can't even use Freetype to do the subsetting, since it does not provide font creation functionality, only reading and rendering. Instead you get to do binary data mangling by hand. In big endian. Obviously.

If someone reading this knows for sure whether all of the above is true or if there is a simpler way (converting text to curves by hand does not count), do let me know in the comments.

Nibble Stew

Wednesday, January 25, 2023

Typesetting an entire book part V: Getting it published

It's not who you are, but who you know

Tuesday, January 17, 2023

PDF, text and fonts, a design by The Devil & Associates

How does one create text with PDF?

The magic 255