Monday, February 27, 2023

Unit testing PDF generation

How would you test PDF generation?

This turns out to be unexpectedly difficult because you need to check that the files are both syntactically and semantically valid. The former could be tested with existing PDF validators and converters but the latter is more difficult.

If you, say, try to render a red square, the end result should be that the PDF command stream has two commands, a re command and an f command. That could be verified simply by grepping the command stream with some regexps. It would not be very useful, though, as there is no guarantee that those commands actually produce a red square in the output document. There are dozens of ways to make the output stream not produce a red square in the intended location without breaking file "validity" in any way.

What even is the truth?

The only reliable way is to render the PDF file into an image and compare it to a ground truth image. Assuming the output is "close enough" then the generator can be said to have worked correctly. The problem, as is often the case, lies inside those quote marks. Fuzzy image equality is a difficult problem. Those interested in details can look it up online. For our case we'll just ignore it and require pixel perfect reproduction. This means that we can have test failures if we change the PDF rendering backend, run it on a different operating system or even just upgrade it to a new version.

The other problem comes from the desire to have a plain C API. Writing unit tests in C is cumbersome to say the least. Fortunately there is a simpler solution. Since the project ships its own Python bindings, we can write all of these tests using Python. This affords us all the niceties that exist in Python such as an extensive unit testing suite, the ability to easily spawn external processes and image difference operators (via PIL). After some boilerplate, writing an unit tests reduces to this:

@validate_image('python_simple', 480, 640)
def test_simple(self, ofilename):
    ofile = pathlib.Path(ofilename)
    with a4pdf.Generator(ofile) as g:
        with g.page_draw_context() as ctx:
            ctx.set_rgb_nonstroke(1.0, 0.0, 0.0)
            ctx.cmd_re(10, 10, 100, 100)
            ctx.cmd_f()

Behind the scenes this will generate the PDF, render it with Ghostscript and compare the result to an existing image. If the output is not bitwise identical the test fails.

Get the code

The project is now available here.

Friday, February 17, 2023

PDF output in images

Generating PDF files is mostly (but not entirely) a serialization problem where you keep repeating the following loop:

  • Find out what functionality PDF has
  • Read the specification to find out how it is expressed using PDF document syntax
  • Come up with some sort of an API to express same
  • Serialize the latter into the former
  • Debug

This means that you have to spend a fair bit of time without much to show for it apart from documents with various black boxes in them. However once you have enough foundational code, then suddenly you can generate all sorts of fun images. Let's look at some now.

Paths are easy to define with lines, beziers and the like, as are path paint styles like line caps and joints. Choosing between nonzero and even-odd winding rules is just a question of choosing a different paint operator. 

PDF allows you to set any draw object as a "clipping path" which behaves like a stencil. Subsequent drawing operations are only applied to those pixels that are inside the specified clipping area. The painting model is uniform, text and paths are mostly interchangeable so text can be used as a clipping path. The gradient is a PNG image, not a vector object.

This color wheel looks fairly average, but it is defined in L*a*b* color space. Did you know that PDF has native support for L*a*b* colors without needing any ICC profiles? I sure didn't until I read the spec.

And finally here are some shadings and patterns. The first two are your standard linear and spherical gradients, but the latter two are more interesting. In PDF you can specify a pattern, which is basically just a rectangular area. You can draw on it with almost all the same operators as on a page (you can't use patterns within patterns, though). You can then use said pattern to paint other objects and the PDF renderer will fill the space by tiling the pattern (yes, of course there is a transformation matrix you can specify). As text is not special you can draw a single character and fill it with a repeated instance of a different character.

Using it in Python

The code needed to generate an empty PDF document looks approximately like this:

import a4pdf
o = a4pdf.Options()
g = a4pdf.Generator('out.pdf', o)
with g.page_draw_context() as ctx:
    # Drawing commands would go here.
    pass

This snippet utilizes almost 100% of all available API thus far. So there's not much you can do with it yet.

Monday, February 13, 2023

Plain C API design, the real world Kobayashi Maru test

Designing APIs is hard. Designing good APIs that future people will not instantly classify as "total crap" is even harder. There are typically many competing requirements such as:

  • API stability
  • ABI stability (if you are into that sort of thing, some are not)
  • Maximize the amount of functionality supported
  • Minimize the number of functions exposed
  • Make the API as easy as possible to use
  • Make the API as difficult as possible to use incorrectly (preferably it should be impossible)
  • Make the API as easy as possible to use from scripting languages

Recently I have been trying to create a proper API for PDF generation so let's use that as an example.

Cairo, simple but limited

The API that Cairo exposes is on the whole pretty good. It has a fair bit of functions, but only one main "painter", the Cairo context. Cairo is a general drawing library with many backends, but the drawing commands map very closely to the ones in PDF. This is probably because Cairo's drawing model is patterned after PostScript, which is almost the same as PDF. Having only one context type means that the users do not have to manually keep track of life times between different object types, which is the source of many C bugs.

This approach works nicely with Cairo but not so well if you want to expose the full functionality of PDF directly, specifically patterns. In PDF you can specify a "pattern object". The basic use case for it is if you need to draw a repeating shape, like a brick wall, by specifying how to draw a single tile and then telling the PDF interpreter to "fill in" the area you specify with this pattern. (Cairo also has pattern support which behaves mostly the same but is ideologically slightly different. We'll ignore those for the rest of this text.)

When defining a pattern you can use almost but not exactly the same drawing commands as when doing regular painting on page surfaces. There are also at least two different pattern types with slightly varying semantics. Since we want to expose PDF functionality directly, we need to have one function for each command, like pdf_draw_cmd_l(ctx, x, y) to draw a line. The question then becomes how does one expose all this as types and functions.

Keep everything in a single object

The simplest thing objectwise would be to keep everything in a single god object and have functions like pdf_draw_page_cmd_l, pdf_draw_pattern1_cmd_l and pdf_draw_pattern2_cmd_l. This is a terrible API because everything is smooshed together and you need to remember to finish patterns before using them. Don't do this.

Fully separate object types

Another approach is to make each concept their own separate type. Then you can have functions like pdf_page_cmd_l(page, x, y), pdf_pattern_cmd_l(pattern, x, y) and so on. This also makes it easy to prevent using commands that are not supported. If, say, a command called bob is not supported on patterns, then all you have to do is to not implement the corresponding function pdf_pattern_cmd_bob.

The big downside is that there are a lot of drawing commands in PDF and in this approach almost all of them need to be defined three times, once for each context type. Their implementations are identical, so they all need to call a fourth function or the code needs to be triplicated.

A common context class

One approach is to abstract this have a PaintContext class that internally knows whether it is used for page or pattern painting. This reduces the number of functions back to one. pdf_ctx_cmd_l(ctx, x, y). The main downside is that now it is possible to accidentally call a function that requires a page drawing context with a pattern drawing context and the type system will not stop you.

A second problem is that you can call the aforementioned bob command with a pattern context. The library needs to detect that and return an error code if it happens. What this means is that a bunch of functions that previously could not fail, can now return error codes. For consistency you might want to change all paint commands to return error codes instead, but then >90% of them never return anything except success.

A common base class

The "object oriented" way of doing this would be to have a common base class for the painting functionality and then inherit that. In this approach functions that can take any context would have names like pdf_ctx_cmd_l(ctx, x, y) wheres functions that don't get specializations like pdf_page_cmd_bob. Since C does not have any OO functionality this would need to be reimplemented from scratch, probably using some Gobject-style preprocessor macro hackery like pdf_ctx_cmd_l(PDF_CTX(page), x, y) or alternatively pdf_ctx_cmd_l(pdf_page_get_ctx(page), x, y). This works, but means a lot of typing for end users and macros are type unsafe even by C standards. If you use the wrong type, woe is you. Macros make providing wrappers harder because they require you to always compile some glue code rather than using something simple like Python's ctypes.

Is there a way to cheat?

I have not managed to come up with a way. Do let me know if you do.

Wednesday, February 8, 2023

More PDF, C API and Python

After a whole lot of bashing my head against the desk I finally managed to find out what Acrobat reader's "error 14" means, I managed to make both font subsetting and graphics generation work. Which means you can now do things like this:

After this improving the code to handle full PDF graphics seems mostly to be a question of adding functions for all primitives in the PDF graphics model. The big unknown thing is PDF form support, of which I know nothing. Being a new design it probably is a lot less convoluted than PDF fonts were.

Dependencies

The code is a few thousand lines of C++ 20. It requires surprisingly few dependencies:

  • Fmt
  • Zlib
  • Libpng
  • Freetype
  • LittleCMS
Some of these are not actually necessary. Fmtlib will be in the standard library in C++23. Libpng is only used to load PNG images from disk. The library could require its users to load graphics themselves and pass images in as pixel arrays. Interestingly doing font subsetting requires parsing the raw data of TrueType files by hand, so Freetype is not strictly mandatory, though it does make some things easier.

The only things you'd actually need are Zlib and LittleCMS. If one wanted to support CCIT Group 4 compression for 1 bit images, you'd need a dependency on libtiff.

A plain C API

The unfortunate side of library development is that if you want your library to be widely used, you have to provide a plain C API. For PDF it's not all that bad as you can mostly copy what Cairo does as its C API is quite nice to use. You might want to design this early on as getting the C API as easy and reliable to use as possible has effects on how the internal architecture works. As an example you should make all objects independent of each other. If the end user has to do things like "be sure to destroy all objects of type X before calling function F on object Y", then, because this is C, they are going to get it wrong and cause segfaults (at best).

Python integration

Once you have the C API, though, you can do all sorts of fun things, such as using Python's ctypes module. It takes a bit of typing and drudgery, but eventually you can create a "dependencyless" Python wrapper. With it you can do this to create an empty PDF file:

o = PdfOptions()
g = PdfGenerator(b"python.pdf", o)
g.new_page()

That's all you can do ATM, as these are the only methods exposed in the C API. Just implementing these made it very clear that the API is not good and needs to be changed.


Wednesday, February 1, 2023

PDF with font subsetting and a look in the future

After several days of head scratching, debugging and despair I finally got font subsetting working in PDF. The text renders correctly in Okular, goes througg Ghostscript without errors and even passes an online PDF validator I found. But not Acrobat Reader, which chokes on it completely and refuses to show anything. Sigh.

The most likely cause is that the subset font that gets generated during this operation is not 100% valid. The approach I use is almost identical to what LO does, but for some reason their PDFs work. Opening both files in FontForge seems to indicate that the .notdef glyph definition is somehow not correct, but offers no help as to why.

In any case it seems like there would be a need for a library for PDF generation. Existing libs either do not handle non-RGB color spaces or are implemented in Java, Ruby or other languages that are hard to use from languages other than themselves. Many programs, like LO and Scribus, have their own libraries for generating PDF files. It would be nice if there could be a single library for that.

Is this a reasonable idea that people would actually be interested in? I don't know, but let's try to find out. I'm going to spend the next weekend in FOSDEM. So if you are going too and are interested in PDF generation, write a comment below or send me an email, I guess? Maybe we can have a shadow meeting in the cafeteria.

Wednesday, January 25, 2023

Typesetting an entire book part V: Getting it published

Writing a book is not that difficult. Sure, it is laborious, but if you merely keep typing away day after day, eventually you end up with a manuscript. Writing a book that is "good" or one that other people would want to read is a lot harder. Still, even that is easy compared to trying to get a book published. According to various unreferenced sources on the Internet, out of all manuscripts submitted only 1 in 1000 to 1 in 10 000 gets accepted for publication. Probabilitywise this is roughly as unlikely casting five dice and getting six with all of them.

Having written a manuscript I went about tying to get it published. The common approach in most countries is that first you have to pitch your manuscript to a literary agent, and if you succeed, they will then try to pitch it to publishers. In Finland the the procedure is simpler, anyone can submit their manuscripts directly to book publishing houses without a middle man. While this makes things easier, it does not help with deciding how much the manuscript should be polished before submission. The more you polish the bigger your chances of getting published, but the longer it takes and the more work you have to do if the publisher wants to make changes to the content.

Eventually I ended up with a sort-of-agile approach. I first gathered a list of all book publishers that have published scifi recently (there were not many). Then I polished the manuscript enough so it had no obvious deficiencies and sent it to the first publisher on the list. Then I did a full revision of the text and sent it to the next one and so on. Eventually I had sent it to all of them. Very soon thereafter I had received either a rejection email or nothing at all from each one.

It's not who you are, but who you know

Since the content did not succeed in selling itself, it was time to start using connections. I have known Pertti Jarla, who is one of Finland's most popular cartoonists, for several years. He runs a small scale publishing company. Its most famous book thus far has been a re-translated version of Philip K. Dick's Do Androids Dream of Electric Sheep. I reached out to him and, long story short, the book should available in Finnish bookstores in a few weeks. The front cover looks like this.

More information in Finnish can be found on the publisher's web site. As for the obvious question of what would the book's title be in English, unfortunately the answer is "it's quite complicated to translate, actually". Basically what the title says is "A giant leap for mankind" but also not, and I have not managed to come up with a description or a translation that would not be a spoiler.

So you'll just have to wait for part VI: Getting translated. Which is an order of magnitude more difficult than getting published.

Tuesday, January 17, 2023

PDF, text and fonts, a design by The Devil & Associates

PDF has a fairly straightforward model for rendering text. Basically you specify a font to use, define a transformation matrix if needed and specify the text string to render. The PDF renderer will then load the embedded font file, extract the curves needed for your letters and render them on canvas. And it works quite nicely.

Assuming you are using plain ASCII. Because 127 glyphs should be enough for everybody.

If, for some weirdo reason, you belong to that ~80% minority population of planet Earth whose native tongue uses characters beyond ASCII, you are not going to be in a happy place. The whole situation is nicely summarized in this Computerphile video.




It may be comforting to know that nothing has changed since 1993. Dealing with fonts and text in PDF is still very much a case of voodoo incantations and guesswork. It does not really help that phrases containing words like "PDF", "text", and "fonts" are completely ungoogleable because they find a million hits on people trying to create PDF files with existing tools as opposed how font generation works under the hood.

How does one create text with PDF?

What most people generating PDF would want is to be able to specify a font file, starting location and the text to render in UTF-8. Some people might want finer control over text rendering and instead provide an array of Unicode glyphs and coordinates. Cairo provides both of these. PDF provides neither. Even the ASCII subset PDF does provide is mostly useless by itself for the following reasons.

Let's start with font metadata. When you embed a font inside a PDF file you also need to provide the width of each character as a PDF array in the font specification dictionary. This data is exactly the same as is in the font file itself. The PDF renderer must know how to parse said files to get the glyph curves out but the specification still mandates that every PDF creator program must replicate this data. Don't believe me? This is what the PDF specification document says in subsection 9.3.4:

The width information for each glyph shall be stored both in the font dictionary and in the font program itself.

The two sets of widths shall be identical.

Why is this? A reasonable assumption is that the person who was tasked with adding TrueType font support at Adobe back in the day got lazy and just insisted that font embedders must provide this information in PDF metadata format because the code to process those was already implemented. But it gets worse. The glyph width data is specified with three items, FirstChar, LastChar and Widths. The last of these contains the widths of all glyphs between the two endpoints. This means that if you wanted to create a PDF document containing the single word παν語 (the stylised form of "Pango") then you'd need an array with some 35 thousand entries, all but four of which would be unused. You still need to define them, or at least, you would need to if PDF text worked this way. It does not. We'll come back to this later.

As a palette cleanser let's note that even though PDF documentation always speaks of glyph widths, this array does not contain glyph widths. What it actually contains is glyph advancements, but given that the PDF text model predates TrueType font technology we can let that one pass.

What we can't ignore, though, is kerning. If you want your text to look good, you have to do proper kerning and the font file has all the information necessary to do that. PDF does not read that (and, as far as I can tell, you can't make it do so) but instead it requires that you specify kerning inline in your text rendering command. This means that in practice you can't tell PDF to render more than one glyph of text and expect it to do the right thing. Instead you have to manually process each glyph and add kerning directives between each pair of characters as necessary.

The magic 255

Creating PDF text with glyph IDs and locations seems like a reasonable approach then. The PDF documentation even says that you can specify glyph ids directly with one of two different quoting methods: octal numbering or UTF-16BE values. This works fine until you try to specify a glyph number bigger than 255, at which point fails incomprehensibly. I spent days trying to debug this. There are several different indices that could be used, mapped to each other and so on. I could not for the life of me figure out a combination that worked. Every attempt to use glyph indices failed. There is no useful documentation for this, you basically have to read code for existing PDF creators and examine their output with a hex editor or bespoke tools. The various PDF validators I tried were not very helpful either because their error messages were of the type "this PDF file is not valid in some way lol ;-)".

Eventually I tried creating a LO document that had more than 255 unique glyphs and exported that. The generated PDF file had two different subsets of the used font. The first had 255 characters and the second one had the rest. Cairo's PDF output is roughly similar. This leads me to believe that, contrary to what the documentation implies, it is not possible to have more than 255 glyphs in a single font in PDF. Or, if it is possible, then the way you go about getting that done is so bizarrely complicated that nobody has managed to get it working. Instead what you have to do is to create font subsetting code from scratch that packs all used glyphs into blocks of 255, map each input glyph to these packed ids, and write a cmap file so that the PDF renderer can convert the packed glyph ids back to Unicode values. You can't even use Freetype to do the subsetting, since it does not provide font creation functionality, only reading and rendering. Instead you get to do binary data mangling by hand. In big endian. Obviously.

If someone reading this knows for sure whether all of the above is true or if there is a simpler way (converting text to curves by hand does not count), do let me know in the comments.