Monday, February 27, 2023

Unit testing PDF generation

How would you test PDF generation?

This turns out to be unexpectedly difficult because you need to check that the files are both syntactically and semantically valid. The former could be tested with existing PDF validators and converters but the latter is more difficult.

If you, say, try to render a red square, the end result should be that the PDF command stream has two commands, a re command and an f command. That could be verified simply by grepping the command stream with some regexps. It would not be very useful, though, as there is no guarantee that those commands actually produce a red square in the output document. There are dozens of ways to make the output stream not produce a red square in the intended location without breaking file "validity" in any way.

What even is the truth?

The only reliable way is to render the PDF file into an image and compare it to a ground truth image. Assuming the output is "close enough" then the generator can be said to have worked correctly. The problem, as is often the case, lies inside those quote marks. Fuzzy image equality is a difficult problem. Those interested in details can look it up online. For our case we'll just ignore it and require pixel perfect reproduction. This means that we can have test failures if we change the PDF rendering backend, run it on a different operating system or even just upgrade it to a new version.

The other problem comes from the desire to have a plain C API. Writing unit tests in C is cumbersome to say the least. Fortunately there is a simpler solution. Since the project ships its own Python bindings, we can write all of these tests using Python. This affords us all the niceties that exist in Python such as an extensive unit testing suite, the ability to easily spawn external processes and image difference operators (via PIL). After some boilerplate, writing an unit tests reduces to this:

@validate_image('python_simple', 480, 640)
def test_simple(self, ofilename):
    ofile = pathlib.Path(ofilename)
    with a4pdf.Generator(ofile) as g:
        with g.page_draw_context() as ctx:
            ctx.set_rgb_nonstroke(1.0, 0.0, 0.0)
            ctx.cmd_re(10, 10, 100, 100)
            ctx.cmd_f()

Behind the scenes this will generate the PDF, render it with Ghostscript and compare the result to an existing image. If the output is not bitwise identical the test fails.

Get the code

The project is now available here.

Friday, February 17, 2023

PDF output in images

Generating PDF files is mostly (but not entirely) a serialization problem where you keep repeating the following loop:

  • Find out what functionality PDF has
  • Read the specification to find out how it is expressed using PDF document syntax
  • Come up with some sort of an API to express same
  • Serialize the latter into the former
  • Debug

This means that you have to spend a fair bit of time without much to show for it apart from documents with various black boxes in them. However once you have enough foundational code, then suddenly you can generate all sorts of fun images. Let's look at some now.

Paths are easy to define with lines, beziers and the like, as are path paint styles like line caps and joints. Choosing between nonzero and even-odd winding rules is just a question of choosing a different paint operator. 

PDF allows you to set any draw object as a "clipping path" which behaves like a stencil. Subsequent drawing operations are only applied to those pixels that are inside the specified clipping area. The painting model is uniform, text and paths are mostly interchangeable so text can be used as a clipping path. The gradient is a PNG image, not a vector object.

This color wheel looks fairly average, but it is defined in L*a*b* color space. Did you know that PDF has native support for L*a*b* colors without needing any ICC profiles? I sure didn't until I read the spec.

And finally here are some shadings and patterns. The first two are your standard linear and spherical gradients, but the latter two are more interesting. In PDF you can specify a pattern, which is basically just a rectangular area. You can draw on it with almost all the same operators as on a page (you can't use patterns within patterns, though). You can then use said pattern to paint other objects and the PDF renderer will fill the space by tiling the pattern (yes, of course there is a transformation matrix you can specify). As text is not special you can draw a single character and fill it with a repeated instance of a different character.

Using it in Python

The code needed to generate an empty PDF document looks approximately like this:

import a4pdf
o = a4pdf.Options()
g = a4pdf.Generator('out.pdf', o)
with g.page_draw_context() as ctx:
    # Drawing commands would go here.
    pass

This snippet utilizes almost 100% of all available API thus far. So there's not much you can do with it yet.

Monday, February 13, 2023

Plain C API design, the real world Kobayashi Maru test

Designing APIs is hard. Designing good APIs that future people will not instantly classify as "total crap" is even harder. There are typically many competing requirements such as:

  • API stability
  • ABI stability (if you are into that sort of thing, some are not)
  • Maximize the amount of functionality supported
  • Minimize the number of functions exposed
  • Make the API as easy as possible to use
  • Make the API as difficult as possible to use incorrectly (preferably it should be impossible)
  • Make the API as easy as possible to use from scripting languages

Recently I have been trying to create a proper API for PDF generation so let's use that as an example.

Cairo, simple but limited

The API that Cairo exposes is on the whole pretty good. It has a fair bit of functions, but only one main "painter", the Cairo context. Cairo is a general drawing library with many backends, but the drawing commands map very closely to the ones in PDF. This is probably because Cairo's drawing model is patterned after PostScript, which is almost the same as PDF. Having only one context type means that the users do not have to manually keep track of life times between different object types, which is the source of many C bugs.

This approach works nicely with Cairo but not so well if you want to expose the full functionality of PDF directly, specifically patterns. In PDF you can specify a "pattern object". The basic use case for it is if you need to draw a repeating shape, like a brick wall, by specifying how to draw a single tile and then telling the PDF interpreter to "fill in" the area you specify with this pattern. (Cairo also has pattern support which behaves mostly the same but is ideologically slightly different. We'll ignore those for the rest of this text.)

When defining a pattern you can use almost but not exactly the same drawing commands as when doing regular painting on page surfaces. There are also at least two different pattern types with slightly varying semantics. Since we want to expose PDF functionality directly, we need to have one function for each command, like pdf_draw_cmd_l(ctx, x, y) to draw a line. The question then becomes how does one expose all this as types and functions.

Keep everything in a single object

The simplest thing objectwise would be to keep everything in a single god object and have functions like pdf_draw_page_cmd_l, pdf_draw_pattern1_cmd_l and pdf_draw_pattern2_cmd_l. This is a terrible API because everything is smooshed together and you need to remember to finish patterns before using them. Don't do this.

Fully separate object types

Another approach is to make each concept their own separate type. Then you can have functions like pdf_page_cmd_l(page, x, y), pdf_pattern_cmd_l(pattern, x, y) and so on. This also makes it easy to prevent using commands that are not supported. If, say, a command called bob is not supported on patterns, then all you have to do is to not implement the corresponding function pdf_pattern_cmd_bob.

The big downside is that there are a lot of drawing commands in PDF and in this approach almost all of them need to be defined three times, once for each context type. Their implementations are identical, so they all need to call a fourth function or the code needs to be triplicated.

A common context class

One approach is to abstract this have a PaintContext class that internally knows whether it is used for page or pattern painting. This reduces the number of functions back to one. pdf_ctx_cmd_l(ctx, x, y). The main downside is that now it is possible to accidentally call a function that requires a page drawing context with a pattern drawing context and the type system will not stop you.

A second problem is that you can call the aforementioned bob command with a pattern context. The library needs to detect that and return an error code if it happens. What this means is that a bunch of functions that previously could not fail, can now return error codes. For consistency you might want to change all paint commands to return error codes instead, but then >90% of them never return anything except success.

A common base class

The "object oriented" way of doing this would be to have a common base class for the painting functionality and then inherit that. In this approach functions that can take any context would have names like pdf_ctx_cmd_l(ctx, x, y) wheres functions that don't get specializations like pdf_page_cmd_bob. Since C does not have any OO functionality this would need to be reimplemented from scratch, probably using some Gobject-style preprocessor macro hackery like pdf_ctx_cmd_l(PDF_CTX(page), x, y) or alternatively pdf_ctx_cmd_l(pdf_page_get_ctx(page), x, y). This works, but means a lot of typing for end users and macros are type unsafe even by C standards. If you use the wrong type, woe is you. Macros make providing wrappers harder because they require you to always compile some glue code rather than using something simple like Python's ctypes.

Is there a way to cheat?

I have not managed to come up with a way. Do let me know if you do.

Wednesday, February 8, 2023

More PDF, C API and Python

After a whole lot of bashing my head against the desk I finally managed to find out what Acrobat reader's "error 14" means, I managed to make both font subsetting and graphics generation work. Which means you can now do things like this:

After this improving the code to handle full PDF graphics seems mostly to be a question of adding functions for all primitives in the PDF graphics model. The big unknown thing is PDF form support, of which I know nothing. Being a new design it probably is a lot less convoluted than PDF fonts were.

Dependencies

The code is a few thousand lines of C++ 20. It requires surprisingly few dependencies:

  • Fmt
  • Zlib
  • Libpng
  • Freetype
  • LittleCMS
Some of these are not actually necessary. Fmtlib will be in the standard library in C++23. Libpng is only used to load PNG images from disk. The library could require its users to load graphics themselves and pass images in as pixel arrays. Interestingly doing font subsetting requires parsing the raw data of TrueType files by hand, so Freetype is not strictly mandatory, though it does make some things easier.

The only things you'd actually need are Zlib and LittleCMS. If one wanted to support CCIT Group 4 compression for 1 bit images, you'd need a dependency on libtiff.

A plain C API

The unfortunate side of library development is that if you want your library to be widely used, you have to provide a plain C API. For PDF it's not all that bad as you can mostly copy what Cairo does as its C API is quite nice to use. You might want to design this early on as getting the C API as easy and reliable to use as possible has effects on how the internal architecture works. As an example you should make all objects independent of each other. If the end user has to do things like "be sure to destroy all objects of type X before calling function F on object Y", then, because this is C, they are going to get it wrong and cause segfaults (at best).

Python integration

Once you have the C API, though, you can do all sorts of fun things, such as using Python's ctypes module. It takes a bit of typing and drudgery, but eventually you can create a "dependencyless" Python wrapper. With it you can do this to create an empty PDF file:

o = PdfOptions()
g = PdfGenerator(b"python.pdf", o)
g.new_page()

That's all you can do ATM, as these are the only methods exposed in the C API. Just implementing these made it very clear that the API is not good and needs to be changed.


Wednesday, February 1, 2023

PDF with font subsetting and a look in the future

After several days of head scratching, debugging and despair I finally got font subsetting working in PDF. The text renders correctly in Okular, goes througg Ghostscript without errors and even passes an online PDF validator I found. But not Acrobat Reader, which chokes on it completely and refuses to show anything. Sigh.

The most likely cause is that the subset font that gets generated during this operation is not 100% valid. The approach I use is almost identical to what LO does, but for some reason their PDFs work. Opening both files in FontForge seems to indicate that the .notdef glyph definition is somehow not correct, but offers no help as to why.

In any case it seems like there would be a need for a library for PDF generation. Existing libs either do not handle non-RGB color spaces or are implemented in Java, Ruby or other languages that are hard to use from languages other than themselves. Many programs, like LO and Scribus, have their own libraries for generating PDF files. It would be nice if there could be a single library for that.

Is this a reasonable idea that people would actually be interested in? I don't know, but let's try to find out. I'm going to spend the next weekend in FOSDEM. So if you are going too and are interested in PDF generation, write a comment below or send me an email, I guess? Maybe we can have a shadow meeting in the cafeteria.