Friday, November 8, 2024

PDF/AAAARGH

Note: the PDF/A specification is not freely available so everything here is based on reverse engineering. It might be complete bunk.

There are many different "subspecies" of PDF. The most common are PDF/X and PDF/A. CapyPDF can already do PDF/X, so I figured it's time to look into PDF/A. Like, how much worse could it possibly be?

Specifying that a PDF file is PDF/X is straightforward. Each PDF has a Catalog dictionary that defines properties of the document. All you need to do is to add an OutputIntent dictionary and link it to the Catalog. The dictionary has a key that specifies the subtype. Setting that to /GTS_PDFX does the trick. There are many different versions of PDF/X so you need to define that as well. A simple solution would be to have a second key in that dictionary for specifying the subtype. Half of that expectation is correct. There is indeed a key you can set, but it is in a completely different part of the object tree called the Information dictionary. It's a bit weird but you implement it once and then forget it.

PDF/A has four different versions, namely 1, 2, 3, 4 and each of these have several conformance levels that are specified with a single letter. Thus the way you specify that the file is a PDF/A document is that you write the value /GTS_PDFA1 to the intent dictionary. Yes. regardless of which version of PDF/A you want, this dictionary will say it is PDFA1.

What would be the mechanism, then, to specify the sub version:

  1. In the Information dictionary, just like with PDF/X?
  2. In some other PDF object dictionary?
  3. In a standalone PDF object that is in fact an embedded XML document?
  4. Something even worse?
Depending on your interpretation, the correct answer is either 3 or 4. Here is the XML file in question as generated by LibreOffice. The payload parts are marked with red arrows.

The other bits are just document metadata replicated. PDF version 2.0 has gone even further and deprecated storing PDF metadata in PDF's own data structures. The sructures that have been designed specifically for PDF documents, which all PDF processing software already know how to handle and which tens of billions (?) of documents already use and which can thus never be removed? Those ones. As Sun Tzu famously said:

A man with one metadata block in his file format always knows what his document is called.

A man with two can never be sure. 

Thus far we have only been at level 3. So what more could possibly be added to this to make it even worse?

Spaces.

Yes, indeed. The screen shot does not show it, but the recommend way to use this specific XML format is to add a whole lot of whitespace below the XML snippet so it can be edited in place later if needed. This is highly suspicious for PDF/A for two main reasons. First of all PDF/A is meant for archiving usage. Documents in it should not be edited afterwards. That is the entire point. Secondly, the PDF file format already has a way of replacing objects with newer versions.

The practical outcome of all this is that every single PDF/A document has approximately 5 kilobytes of fluff to represent two bytes of actual information. Said object can not even be compressed because the RDF document must be stored uncompressed to be editable. Even though in PDF/A documents it will never be edited.

No comments:

Post a Comment