Tuesday, December 26, 2023

Tagged PDF funsies

HTML was originally designed as a file format that merely contains the logical structure of a document. End users could format it in a way that was most suitable for them. For example people with reading disabilities could make the text bigger or even use a screen reader. As time went on web site developers wanted pixel perfect control over the layout on end user machines whether this made sense or not. This lead to inventing a side channel to control layout. Since HTML was not originally designed for visual design, this lead to an impedance mismatch which caused a lot of work and headscratching to make it work. There is no "proper" solution so problems persist to this day.

PDF was originally designed as a file format for pixel perfect layout of graphics on every conceivable device. In this way people could be sure that their design was not randomly mangled along the way. As time went on people wanted to make PDF documents more broadly usable, for example to be able to copypaste text out of them and to expose the logical structure of the document to end users to the benefit of e.g. people with disabilities. This lead to inventing a side channel to describe structure but since PDF was not originally designed for semantic content, this lead to an impedance mismatch which caused a lot of work and headscratching to make it work. There is no "proper" solution so problems persist to this day.

Both of these formats also use JavaScript, but let's not go there.

In the case of PDF, the logical format is called "tagged PDF" and is implemented by writing magic tags and commands in the PDF stream to specify "marked content". A PDF generator also has to write many different dictionaries and arrays all of which have criss-cross-references to each other to make it work. Or that's my theory at least, since I have been unable to prove that CapyPDF's tagged PDF generation actually works. At best what can be said that no PDF processor I have used it with has reported errors.

Going through these lesser used parts of the file format teaches you quite quickly that the PDF specification varies wildly in quality. For example let's look at the aforementioned references. PDF has both dictionaries and arrays as native data types. Thus if you have to map arbitrary text keys to values you'd use a dictionary whereas mapping consecutive integers from zero upwards you'd use an array. Seems simple, right?

One of the data mappings needed for tagged PDF has gone beyond and reinvented this fairly simple structure. It has keys counting from zero upwards. Not only does the specification say that consecutive integers are needed, it even says that the PDF producer must write to a separate dictionary a key called ParentTreeNextKey. It goes on to say that when a new entry is added to this array (nee dictionary) then it must use the given key for the value. A more logical name for this would be ArraySize but that is not even the worst of it.

Said array is actually a key-value store where every other entry is the key (or "index") as an integer and every other entry is the corresponding value. Yes. this means that the contents of the array look like this: [ 0 "value0" 1 "value1" 2 "value2" ... ]. The actual values happen to also be index arrays, but they contain only values. In case you don't believe me, here is a screenshot from the official PDF spec.

Presumably the rationale is that you could leave some elements out from the array. A simpler approach would have been to store an empty array instead, but one should not meddle with the affairs of adobeans, for they are subtle and quick to anger.

Fortunately at least declaring a PDF as tagged is simple. There is a specific key in one of the metadata dictionaries and when that is set to true, the file is considered tagged.  Pdfinfo agrees with this assessment.

Good, good. Just to be sure, let's validate that it behaves the same on Acrobat Reader.

I wonder if I still have some leftover glögi?

No comments:

Post a Comment