Saturday, May 20, 2023

Annotated PDF, HTML, exporters

If one were to compare PDF to HTML, one interesting thing that comes up fairly quickly is that their evolution has been the exact opposite of each other.

HTML was originally about structure, with its h1 and p and ul tags and the like. Given this structure the web browser was then (mostly) free to lay out the text however it saw fit given the current screen size and browser window orientation. As usage grew the appearance of pages became more and more important and thus they had to invent a whole new syntax for specifying how the semantic content should be laid out visually. This eventually became CSS and the goal became, roughly, "pixel perfect layout everywhere".

PDF was originally just a page description language, basically a sequence of vector and raster paint operations. Its main goal was perfect visual fidelity as one of the main use cases was professional printing. The document had no understanding of what it contained. You either read as it was on the monitor or, more often, printed it out on paper and then read it. As PDF's popularity grew, new requirements appeared. First people wanted to copypaste text from PDF document (contrary to what you might expect, this is surprisingly tricky to implement). Then they needed to integrate it with screen readers and other accessibility tools, reformat the text to read it on mobile devices and so on. The solution was to create a from-scratch syntax in the PDF data stream alongside existing code to express the semantic contents of the document in a machine-readable format. This became known as tagged PDF, introduced in PDF 1.4.

The way they chose to implement it was to add new commands to the PDF data stream that are used to wrap existing draw operations. For example, drawing a single line of text looks like this:

(Hello, world!) Tj

Adding structural information means writing it like this instead:

/P << /MCID 0 >>
  BDC
    (Hello, world!) Tj
  EMC

and then adding a bunch of bookkeeping information in document metadata dictionaries. This is a bit tedious, but since you can't really write PDF files by hand, you implement this once in the PDF generation code and then use that.

Why so many PDF exporters?

There are many PDF generator libraries. Almost every "serious" program, like Scribus and Libreoffice, has its own bespoke library. This is not just because developers love reimplementing things from scratch. The underlying issue is that PDF as a specification is big. and generating valid PDFs have very specific semantic requirements that require the application developer to understand PDFs document model down to the finest detail.

Existing PDF generator libraries either ignore all of this and just do basic RGB graphics (Cairo) or provide their own document model which it can then convert to PDF constructs (HexaPDF). The latter works nicely for simple cases but it requires that you have to map your app's internal document structure to the library's document structure so that it then maps to PDF's document structure in the way you need it to. Often it becomes simpler to cut out the middle man and generate PDF directly.

The way A4PDF aims to solve the issue is to stay low level enough. It merely exposes PDF primitives directly and ensures that the syntax of the generated PDF is correct. It is up to the caller to ensure that the output is semantically correct. This should afford every application to easily and directly map their document output to PDF in just they way they want it (color managed or not, annotated or not and so on).

No comments:

Post a Comment