Friday, February 23, 2024

Creating tagged PDFs with CapyPDF now sort of possible

There are many open source PDF generators available. Unfortunately they all have some limitations when it comes to generating tagged PDFs

  • Cairo does not support tagged PDFs at all
  • LaTeX can create tagged PDFs, but obviously only out of LaTeX documents
  • Scribus does not support tagged PDF
  • LibreOffice does support tagged PDF generation, but its code is not available as a standalone library, it can only be used via LibreOffice
  • HexaPDF does not support tagged PDFs (though they seem to be on the roadmap), but it is implemented in Ruby so most projects can't use it
  • Firefox can print pages to PDF, but the result is not tagged, even though the PDF document structure model is almost exactly the same as in HTML

There does not seem to be a library that provides for all of this with a "plain C" API that can be used to easily generated tagged PDFs using almost any programming language.

There still isn't, but at least now CapyPDF can generate simple tagged PDF documents. A sample document can be downloaded via this link. Here is a picture showing the document structure in Acrobat Pro.

It should also work in things like screen readers and other accessibility tools, but I have not tested it.

None of this is exposed in the C API, because this has a fairly large API surface and I have not yet come up with a good way to represent it.


Sunday, January 21, 2024

CapyPDF 0.8.0 released

Version 0.8.0 of the CapyPDF library has been released. The main new feature is support for form XObjects and printer's mark annotations.

Printer's marks are things like color bars, crop marks and registration marks (also known as "bullseye marks") that high end printers need for quality control. Here is a very simple document.

An experienced print operator would use the black lines to set up paper trimming. Traditionally these marks were drawn in the page's graphics stream. This is problematic because nowadays printers prefer to use their own custom marks instead of ones created by the document author. PDF solves this issue by moving these graphics operations to separate draw contexts (specifically "form XObjects", which are not actually forms, though they are XObjects) that can then be "glued" on top of the page. These annotations are shown in PDF viewer applications but they are not printed. I have no experience with high end RIP software, but presumably the operator can choose to either print the document's annotations or replace them with their custom QA marks.

As usual, to find out what features CapyPDF has and how to use them, look up either the public C header or the Python code used in unit tests.

Tuesday, January 2, 2024

C++ module tooling emulator playground

Developing tooling for C++ modules is challenging to say the least. Module implementation maturity in compilers varies, they all work slightly (well massively) differently, there are bugs and you also need a code base that uses modules. Because of these and other reasons there are maybe five people in the entire world who even think about this issue. This is bad, because it is supposed to be future foundational technology. It would benefit from more eyes.

Something ought to be done about this. So I did.

I created a fake "module only" C++ compiler, a fake linker, a fake module scanner and fake project generator. The compiler does not produce any code. It only does three things:

  1. Reads export statements from sources and writes corresponding fake module files
  2. Reads import statements from source files.
  3. Validates that all module files a source file imports exist on disk at the time of invocation
This did not take much effort. In fact the whole project is approximately 300 lines of Python and can be obtained from this repository.

With this it is possible for anyone interested to try to come up with a module building scheme that does not require generating O(N²) command line arguments on the fly. The current scanner implementation is taken almost directly from Meson and it works one build target at a time as opposed to a single source file at a time. With this approach the overhead of scanning is one process invocation per build target. A per-source approach would take two processes per source scanning: one for invoking the compiler to generate the standard JSON-format dependency file and another for converting that to the dyndep format that Ninja understands.

The setup assumes that the compiler supports a "write module files to this directory" command line argument. It is mandatory to avoid generating compiler arguments dynamically.

Or maybe it isn't and there is a way to make it work in some other way. At least now the tooling is available so anyone reading this can try to solve this problem.

Tuesday, December 26, 2023

Tagged PDF funsies

HTML was originally designed as a file format that merely contains the logical structure of a document. End users could format it in a way that was most suitable for them. For example people with reading disabilities could make the text bigger or even use a screen reader. As time went on web site developers wanted pixel perfect control over the layout on end user machines whether this made sense or not. This lead to inventing a side channel to control layout. Since HTML was not originally designed for visual design, this lead to an impedance mismatch which caused a lot of work and headscratching to make it work. There is no "proper" solution so problems persist to this day.

PDF was originally designed as a file format for pixel perfect layout of graphics on every conceivable device. In this way people could be sure that their design was not randomly mangled along the way. As time went on people wanted to make PDF documents more broadly usable, for example to be able to copypaste text out of them and to expose the logical structure of the document to end users to the benefit of e.g. people with disabilities. This lead to inventing a side channel to describe structure but since PDF was not originally designed for semantic content, this lead to an impedance mismatch which caused a lot of work and headscratching to make it work. There is no "proper" solution so problems persist to this day.

Both of these formats also use JavaScript, but let's not go there.

In the case of PDF, the logical format is called "tagged PDF" and is implemented by writing magic tags and commands in the PDF stream to specify "marked content". A PDF generator also has to write many different dictionaries and arrays all of which have criss-cross-references to each other to make it work. Or that's my theory at least, since I have been unable to prove that CapyPDF's tagged PDF generation actually works. At best what can be said that no PDF processor I have used it with has reported errors.

Going through these lesser used parts of the file format teaches you quite quickly that the PDF specification varies wildly in quality. For example let's look at the aforementioned references. PDF has both dictionaries and arrays as native data types. Thus if you have to map arbitrary text keys to values you'd use a dictionary whereas mapping consecutive integers from zero upwards you'd use an array. Seems simple, right?

One of the data mappings needed for tagged PDF has gone beyond and reinvented this fairly simple structure. It has keys counting from zero upwards. Not only does the specification say that consecutive integers are needed, it even says that the PDF producer must write to a separate dictionary a key called ParentTreeNextKey. It goes on to say that when a new entry is added to this array (nee dictionary) then it must use the given key for the value. A more logical name for this would be ArraySize but that is not even the worst of it.

Said array is actually a key-value store where every other entry is the key (or "index") as an integer and every other entry is the corresponding value. Yes. this means that the contents of the array look like this: [ 0 "value0" 1 "value1" 2 "value2" ... ]. The actual values happen to also be index arrays, but they contain only values. In case you don't believe me, here is a screenshot from the official PDF spec.

Presumably the rationale is that you could leave some elements out from the array. A simpler approach would have been to store an empty array instead, but one should not meddle with the affairs of adobeans, for they are subtle and quick to anger.

Fortunately at least declaring a PDF as tagged is simple. There is a specific key in one of the metadata dictionaries and when that is set to true, the file is considered tagged.  Pdfinfo agrees with this assessment.

Good, good. Just to be sure, let's validate that it behaves the same on Acrobat Reader.

I wonder if I still have some leftover glögi?

Wednesday, December 20, 2023

Even more breakage in the C++ module world

In an earlier blog post we looked into some of the problems that the current C++ module implementation (specifically using Clang and CMake) has.  Yesterday it got a reply detailing a further, fairly serious problem.

In order to understand the issue, let's first look at the steps used in the current implementation:

  1. CMake runs, parses its input files and generates a Ninja build file
  2. As it is not possible (with the current implementation) to know the full command line up front, CMake adds a command line argument like @module_arg_file. Clang treats these kinds of arguments as "read the contents of the specified file and pretend that its contents was given as command line arguments".
  3. When the actual build starts, it first invokes CMake to run the module scanner, parse its output and generate dyndep info as well as the command line arguments for every single compilation task in the project.
  4. When this is done, compilation proceeds as usual.
The "works" in the sense that you can compile projects using it. But the problem is that there is also a file called compilation_commands.json, which is used by many C and C++ tools. It has all the same compilation commands as the Ninja file, just in an easy to parse JSON format. This means that if the corresponding module argument files have not been generated, all those compilation commands are broken.

Thus if you want to run any tooling that uses the compilation database, you first have to do a full build. Otherwise the compilation arguments are not guaranteed to be correct and the whole thing crashes like a house of cards. To make things even nastier, one of the main users of the compilation database is Clang's code completion functionality. If you check out a new project, you can not use code completion in your editor until you have successfully built the entire project. It's also not guaranteed to produce correct results if you have done any edits on said source files (because the list of modules imported/exported may have changed).

And if that was not bad enough, this is not a theoretical issue, bugs caused by this are already happening.

I repeat the suggestion made in the original post that this is not a good foundation to place the future of C++ modules on. We should move to a better one, such as the one outlined in the post.

Monday, December 18, 2023

AI silliness: getting to a no

One of the most annoying thing about LLM chatbots is that they all talk like american used car salespersons or people who work in sales departments of software companies. That is, the answer to every question is always yes. Somehow everything must be positive. This got me thinking: what would you  need to do to get a "no" answer from a chatbot.

Let's start simple.

Well that would have been too easy, I guess. Can we at least get it to give "no" as an answer?

Okay, let's refine this further.

Not only does the answer have a lot more words than one, it is actually incorrect. The correct answer is not false, it is "no". We'll ignore it for now and try to condense the answer.

No, I don't think that is the word I'm searching for and the answer is even longer than the previous one even though I asked for a shorter one. But no matter, let's be even more specific.

That is almost a single word answer but it is still not a "no". A change of tactics is needed. Rather than trying to come up with a working question on our own, we can ask the LLM to generate it for us.

Our poor AI overlord does not seem to have a good grasp on what is and is not a paradox. Or maybe it is trying to distract us from the fact that it can't give a negative reply. Let's give it a much easier question to answer instead.

At this point I gave up.

Saturday, December 16, 2023

On the scalability of C++ module implementations or lack thereof

This is a follow up to an earlier blog post called The road to hell is paved with good intentions and C++ modules. As it was a bit on the gloomy side so maybe I should preface this post by reiterating that adding module support to C++ is a huge task. All people who have worked on the specification and implementations in the various toolchains have had to do an incredible amount of unrewarding work. Everything said here (and in the previous post) is not to badmouth their existing work. It is merely to say that where we currently are is not a place we want to stay in, we need to move forward to get a better foundation to base decades of future work on.

How many ways do I #include thee? Let me count the ways.

In the Reddit discussion on the previous post user mathstuf (whose flair indicates that they are a CMake dev) makes the following claim:

Clang and MSVC need the transitive closure of imported modules specified as well, so while you M0 might only import M1, if it imports M2 through M9, they need to be there too (GCC doesn't need this today, but Clang 17 warns if they're not there and I believe Clang 18 will error; MSVC has always errored without them).

"Needs to be defined" means adding a compiler flag like -fmodule-file=M0=path/to/M0.mod. Let's assume this to be true. What does this imply? How many compiler flags would this add in the general case?

We don't know, because there are no large scale C++ codebases that use modules exclusively. Thus we need do Fermi estimation based on the data we do have. Let's start simple with the following C++ source file.

#include<vector>

If this were a "module library" the contents would be import vector (we'll ignore the std part for simpllicity). On the Ubuntu machine I'm typing this using such a module would require using a command line argument -fmodule-file=vector=/usr/include/c++/13/vector.mod to match where the header is. But this is not enough. The header file itself is only 151 lines long. Inside it we find things like this:

#include <bits/stl_algobase.h>
#include <bits/allocator.h>
#include <bits/stl_construct.h>
#include <bits/stl_uninitialized.h>
#include <bits/stl_vector.h>
#include <bits/stl_bvector.h>
#include <bits/refwrap.h>
#include <bits/range_access.h>

If we assume that a "modular version" of vector would be split roughly among the same lines as the header version then each one of these includes would a import statement and a corresponding compiler arugment and so on recursively. 

How many is that in total?

Computing this is a fairly simple one-liner:

g++ -std=c++20 -E includecount.cpp|grep /usr/include/c |cut -d ' ' -f 3|sort|uniq

This yields the answer 55.

In a library split up in this way merely importing this one module means having to add 55 compiler flags to get it to compile.

This is of course simplistic and in a real world use case the library would probably be set up in a different way. But the point is that currently we do not know what that setup would be because these kinds of libraries do not exist. Even if they did exist the bigger fundamental problem here is that the number of compiler arguments for this must be at most one (or two for reasons discussed below). Anything more than that is wrong both conceptually and in practice.

Fine, but does it work in theory?

Suppose you don't use any external libraries or modules but instead hand craft your very own artisanal humongous batch code base. What would happen then? Let's assume that each source file specifies one module for simplicity.  At the bottom of the dependency chain are source files that area fully standalone. Above that are files that import one or more of the given standalone modules and so on.

In order to compute the total amount of work we need to compute the transitive closure of each file. This is unknown so again we have to estimate. A reasonable first order approximation is that each module's transitive closure encompasses half of the modules "below it". If we order all source files from "lowest level" to "highest level" we can say that for each source file i sizeof_transitive_closure(i) = i/2. Thus the total number of compiler arguments we need to specify all these transitive closures is (according to Wolfram Alpha):

Or, in other words, this is an O(N²) algorithm. Those do not scale. A quick check says that Clang's code base has almost 30 000 cpp files. That would produce approximately

If each of these was just 10 characters long, the corresponding Ninja file would be 10 gigabytes in size if it were possible to build it up front. It is not, instead you have to split it into a million horcruxes and then glue them together. 

For something Chromium all these numbers would be a lot bigger.

Back to the basics

Let's look at the problem from yet another angle. Conceptually, module files are identical to header files. We should treat them as such. There is decades of experience in making that scheme work, even for header files that are generated during build time (which is how modules work).

Instead of manually telling the compiler which file contains which module, we instead pass it a list of module search directories. It will then look up modules in those directories automatically. Just like it does for headers currently. This is simple, understandable and we even know that it scales, because this is how header inclusion has worked for almost 50 years now. Not only that, we also know that it scales to binary modules, because this is exactly how Fortran compilers do it. They compiler flag they use is -J.

The only downside is that this approach might not be easily usable in projects whose source code layout is batshit insane. Fortunately those do not need to be supported, since currently there are no such code bases. C++ modules have only been used by early adopters and test projects and those people are typically fine with some breakage in exchange for a much better experience going forward. There is no burden of backwards compatibility for modules yet. Let's improve things so we don't have it in the future either.