Monday, February 27, 2023

Unit testing PDF generation

How would you test PDF generation?

This turns out to be unexpectedly difficult because you need to check that the files are both syntactically and semantically valid. The former could be tested with existing PDF validators and converters but the latter is more difficult.

If you, say, try to render a red square, the end result should be that the PDF command stream has two commands, a re command and an f command. That could be verified simply by grepping the command stream with some regexps. It would not be very useful, though, as there is no guarantee that those commands actually produce a red square in the output document. There are dozens of ways to make the output stream not produce a red square in the intended location without breaking file "validity" in any way.

What even is the truth?

The only reliable way is to render the PDF file into an image and compare it to a ground truth image. Assuming the output is "close enough" then the generator can be said to have worked correctly. The problem, as is often the case, lies inside those quote marks. Fuzzy image equality is a difficult problem. Those interested in details can look it up online. For our case we'll just ignore it and require pixel perfect reproduction. This means that we can have test failures if we change the PDF rendering backend, run it on a different operating system or even just upgrade it to a new version.

The other problem comes from the desire to have a plain C API. Writing unit tests in C is cumbersome to say the least. Fortunately there is a simpler solution. Since the project ships its own Python bindings, we can write all of these tests using Python. This affords us all the niceties that exist in Python such as an extensive unit testing suite, the ability to easily spawn external processes and image difference operators (via PIL). After some boilerplate, writing an unit tests reduces to this:

@validate_image('python_simple', 480, 640)
def test_simple(self, ofilename):
    ofile = pathlib.Path(ofilename)
    with a4pdf.Generator(ofile) as g:
        with g.page_draw_context() as ctx:
            ctx.set_rgb_nonstroke(1.0, 0.0, 0.0)
            ctx.cmd_re(10, 10, 100, 100)

Behind the scenes this will generate the PDF, render it with Ghostscript and compare the result to an existing image. If the output is not bitwise identical the test fails.

Get the code

The project is now available here.


  1. FYI - cairo uses a perceptual diff program in their unit test framework.

  2. too