Tuesday, December 26, 2023

Tagged PDF funsies

HTML was originally designed as a file format that merely contains the logical structure of a document. End users could format it in a way that was most suitable for them. For example people with reading disabilities could make the text bigger or even use a screen reader. As time went on web site developers wanted pixel perfect control over the layout on end user machines whether this made sense or not. This lead to inventing a side channel to control layout. Since HTML was not originally designed for visual design, this lead to an impedance mismatch which caused a lot of work and headscratching to make it work. There is no "proper" solution so problems persist to this day.

PDF was originally designed as a file format for pixel perfect layout of graphics on every conceivable device. In this way people could be sure that their design was not randomly mangled along the way. As time went on people wanted to make PDF documents more broadly usable, for example to be able to copypaste text out of them and to expose the logical structure of the document to end users to the benefit of e.g. people with disabilities. This lead to inventing a side channel to describe structure but since PDF was not originally designed for semantic content, this lead to an impedance mismatch which caused a lot of work and headscratching to make it work. There is no "proper" solution so problems persist to this day.

Both of these formats also use JavaScript, but let's not go there.

In the case of PDF, the logical format is called "tagged PDF" and is implemented by writing magic tags and commands in the PDF stream to specify "marked content". A PDF generator also has to write many different dictionaries and arrays all of which have criss-cross-references to each other to make it work. Or that's my theory at least, since I have been unable to prove that CapyPDF's tagged PDF generation actually works. At best what can be said that no PDF processor I have used it with has reported errors.

Going through these lesser used parts of the file format teaches you quite quickly that the PDF specification varies wildly in quality. For example let's look at the aforementioned references. PDF has both dictionaries and arrays as native data types. Thus if you have to map arbitrary text keys to values you'd use a dictionary whereas mapping consecutive integers from zero upwards you'd use an array. Seems simple, right?

One of the data mappings needed for tagged PDF has gone beyond and reinvented this fairly simple structure. It has keys counting from zero upwards. Not only does the specification say that consecutive integers are needed, it even says that the PDF producer must write to a separate dictionary a key called ParentTreeNextKey. It goes on to say that when a new entry is added to this array (nee dictionary) then it must use the given key for the value. A more logical name for this would be ArraySize but that is not even the worst of it.

Said array is actually a key-value store where every other entry is the key (or "index") as an integer and every other entry is the corresponding value. Yes. this means that the contents of the array look like this: [ 0 "value0" 1 "value1" 2 "value2" ... ]. The actual values happen to also be index arrays, but they contain only values. In case you don't believe me, here is a screenshot from the official PDF spec.

Presumably the rationale is that you could leave some elements out from the array. A simpler approach would have been to store an empty array instead, but one should not meddle with the affairs of adobeans, for they are subtle and quick to anger.

Fortunately at least declaring a PDF as tagged is simple. There is a specific key in one of the metadata dictionaries and when that is set to true, the file is considered tagged.  Pdfinfo agrees with this assessment.

Good, good. Just to be sure, let's validate that it behaves the same on Acrobat Reader.

I wonder if I still have some leftover glögi?

Wednesday, December 20, 2023

Even more breakage in the C++ module world

In an earlier blog post we looked into some of the problems that the current C++ module implementation (specifically using Clang and CMake) has.  Yesterday it got a reply detailing a further, fairly serious problem.

In order to understand the issue, let's first look at the steps used in the current implementation:

  1. CMake runs, parses its input files and generates a Ninja build file
  2. As it is not possible (with the current implementation) to know the full command line up front, CMake adds a command line argument like @module_arg_file. Clang treats these kinds of arguments as "read the contents of the specified file and pretend that its contents was given as command line arguments".
  3. When the actual build starts, it first invokes CMake to run the module scanner, parse its output and generate dyndep info as well as the command line arguments for every single compilation task in the project.
  4. When this is done, compilation proceeds as usual.
The "works" in the sense that you can compile projects using it. But the problem is that there is also a file called compilation_commands.json, which is used by many C and C++ tools. It has all the same compilation commands as the Ninja file, just in an easy to parse JSON format. This means that if the corresponding module argument files have not been generated, all those compilation commands are broken.

Thus if you want to run any tooling that uses the compilation database, you first have to do a full build. Otherwise the compilation arguments are not guaranteed to be correct and the whole thing crashes like a house of cards. To make things even nastier, one of the main users of the compilation database is Clang's code completion functionality. If you check out a new project, you can not use code completion in your editor until you have successfully built the entire project. It's also not guaranteed to produce correct results if you have done any edits on said source files (because the list of modules imported/exported may have changed).

And if that was not bad enough, this is not a theoretical issue, bugs caused by this are already happening.

I repeat the suggestion made in the original post that this is not a good foundation to place the future of C++ modules on. We should move to a better one, such as the one outlined in the post.

Monday, December 18, 2023

AI silliness: getting to a no

One of the most annoying thing about LLM chatbots is that they all talk like american used car salespersons or people who work in sales departments of software companies. That is, the answer to every question is always yes. Somehow everything must be positive. This got me thinking: what would you  need to do to get a "no" answer from a chatbot.

Let's start simple.

Well that would have been too easy, I guess. Can we at least get it to give "no" as an answer?

Okay, let's refine this further.

Not only does the answer have a lot more words than one, it is actually incorrect. The correct answer is not false, it is "no". We'll ignore it for now and try to condense the answer.

No, I don't think that is the word I'm searching for and the answer is even longer than the previous one even though I asked for a shorter one. But no matter, let's be even more specific.

That is almost a single word answer but it is still not a "no". A change of tactics is needed. Rather than trying to come up with a working question on our own, we can ask the LLM to generate it for us.

Our poor AI overlord does not seem to have a good grasp on what is and is not a paradox. Or maybe it is trying to distract us from the fact that it can't give a negative reply. Let's give it a much easier question to answer instead.

At this point I gave up.

Saturday, December 16, 2023

On the scalability of C++ module implementations or lack thereof

This is a follow up to an earlier blog post called The road to hell is paved with good intentions and C++ modules. As it was a bit on the gloomy side so maybe I should preface this post by reiterating that adding module support to C++ is a huge task. All people who have worked on the specification and implementations in the various toolchains have had to do an incredible amount of unrewarding work. Everything said here (and in the previous post) is not to badmouth their existing work. It is merely to say that where we currently are is not a place we want to stay in, we need to move forward to get a better foundation to base decades of future work on.

How many ways do I #include thee? Let me count the ways.

In the Reddit discussion on the previous post user mathstuf (whose flair indicates that they are a CMake dev) makes the following claim:

Clang and MSVC need the transitive closure of imported modules specified as well, so while you M0 might only import M1, if it imports M2 through M9, they need to be there too (GCC doesn't need this today, but Clang 17 warns if they're not there and I believe Clang 18 will error; MSVC has always errored without them).

"Needs to be defined" means adding a compiler flag like -fmodule-file=M0=path/to/M0.mod. Let's assume this to be true. What does this imply? How many compiler flags would this add in the general case?

We don't know, because there are no large scale C++ codebases that use modules exclusively. Thus we need do Fermi estimation based on the data we do have. Let's start simple with the following C++ source file.

#include<vector>

If this were a "module library" the contents would be import vector (we'll ignore the std part for simpllicity). On the Ubuntu machine I'm typing this using such a module would require using a command line argument -fmodule-file=vector=/usr/include/c++/13/vector.mod to match where the header is. But this is not enough. The header file itself is only 151 lines long. Inside it we find things like this:

#include <bits/stl_algobase.h>
#include <bits/allocator.h>
#include <bits/stl_construct.h>
#include <bits/stl_uninitialized.h>
#include <bits/stl_vector.h>
#include <bits/stl_bvector.h>
#include <bits/refwrap.h>
#include <bits/range_access.h>

If we assume that a "modular version" of vector would be split roughly among the same lines as the header version then each one of these includes would a import statement and a corresponding compiler arugment and so on recursively. 

How many is that in total?

Computing this is a fairly simple one-liner:

g++ -std=c++20 -E includecount.cpp|grep /usr/include/c |cut -d ' ' -f 3|sort|uniq

This yields the answer 55.

In a library split up in this way merely importing this one module means having to add 55 compiler flags to get it to compile.

This is of course simplistic and in a real world use case the library would probably be set up in a different way. But the point is that currently we do not know what that setup would be because these kinds of libraries do not exist. Even if they did exist the bigger fundamental problem here is that the number of compiler arguments for this must be at most one (or two for reasons discussed below). Anything more than that is wrong both conceptually and in practice.

Fine, but does it work in theory?

Suppose you don't use any external libraries or modules but instead hand craft your very own artisanal humongous batch code base. What would happen then? Let's assume that each source file specifies one module for simplicity.  At the bottom of the dependency chain are source files that area fully standalone. Above that are files that import one or more of the given standalone modules and so on.

In order to compute the total amount of work we need to compute the transitive closure of each file. This is unknown so again we have to estimate. A reasonable first order approximation is that each module's transitive closure encompasses half of the modules "below it". If we order all source files from "lowest level" to "highest level" we can say that for each source file i sizeof_transitive_closure(i) = i/2. Thus the total number of compiler arguments we need to specify all these transitive closures is (according to Wolfram Alpha):

Or, in other words, this is an O(N²) algorithm. Those do not scale. A quick check says that Clang's code base has almost 30 000 cpp files. That would produce approximately

If each of these was just 10 characters long, the corresponding Ninja file would be 10 gigabytes in size if it were possible to build it up front. It is not, instead you have to split it into a million horcruxes and then glue them together. 

For something Chromium all these numbers would be a lot bigger.

Back to the basics

Let's look at the problem from yet another angle. Conceptually, module files are identical to header files. We should treat them as such. There is decades of experience in making that scheme work, even for header files that are generated during build time (which is how modules work).

Instead of manually telling the compiler which file contains which module, we instead pass it a list of module search directories. It will then look up modules in those directories automatically. Just like it does for headers currently. This is simple, understandable and we even know that it scales, because this is how header inclusion has worked for almost 50 years now. Not only that, we also know that it scales to binary modules, because this is exactly how Fortran compilers do it. They compiler flag they use is -J.

The only downside is that this approach might not be easily usable in projects whose source code layout is batshit insane. Fortunately those do not need to be supported, since currently there are no such code bases. C++ modules have only been used by early adopters and test projects and those people are typically fine with some breakage in exchange for a much better experience going forward. There is no burden of backwards compatibility for modules yet. Let's improve things so we don't have it in the future either.

Monday, December 4, 2023

CapyPDF 0.7.0 released

Version 0.7.0 of CapyPDF is out. There is nothing major as such, just a bunch of minor features. As CapyPDF already supports a fair bit of stuff, the easiest way of finding out what can be done is to look at the unit test file.

Gradients, Gouraud meshes and Coons patches can now be defined in grayscale and CMYK in addition to RGB.

Support has also been added for separations. In PDF lingo this means that you can define new inks in addition to the basic CMYK ones. Typically these include metal inks or foils, Pantone colors and different kinds of glosses and varnishes.

Thursday, November 9, 2023

CapyPDF, performance and shared libraries

People who are extremely performance conscious might not like the fact that CapyPDF ships as a shared library with a C API that hides all internal implementation details. This has several potential sources of slowdown:

  • Function calls can not be inlined to callers
  • Shared library function calls are indirect and thus slower for the CPU to invoke
  • All objects require a memory allocation, they can't be stored on the caller's heap

I have not optimized CapyPDF's performance all that much, so let's see if these things are an issue in practice. As a test we'll use the Python bindings to generate the following PDF document 100 times and measuring the execution time.

Running the test tells us that we can create the document roughly 9 times a second on a single core. This does not seem like a lot, but that's because the measurement was done on a Raspberry Pi 3B+. On a i7-1185G7 laptop processor the performance shoots up to 100 times a second.

For comparison I created the same program in native code and the Python version is roughly 10% slower. More interestingly 40 % of the time is not spent on generating the PDF file itself at all, but is instead taken by font subsetting.

A further thing to note is that in PDF any data stream can be compressed, typically with zlib. This test does not use compression at all. If the document had images, compressing the raw pixel data could easily take 10 to 100x longer than all other parts combined.

PDF does support embedding some image formats (e.g. jpegs) directly, but most of the time you need to generate raw pixel data and compress it yourself. In this case compression is easily the dominant term.

Conclusions

A "type hidden" C API does have some overhead, but in this particular case it is almost certainly unmeasurably small. You probably don't want to create a general array or hash map in this way if you need blazing speeds, but for this kind of high level functionality it is unlikely to be a performance bottleneck.

Saturday, November 4, 2023

CapyPDF 0.6.0 is out

Fans of PDF generators and large rodents rejoice, version 0.6.0 of CapyPDF is out. Here is the new official logo.

Probably the biggest new feature is that since Xcode has added support for C++23 features, the code now builds on macOS as well. Python wheels for Windows and macOS can be downloaded from PyPI.

Feature-wise the biggest change is probably the fact that one can generate native PDF gradients via the public C and Python APIs. Only RGB colorspace is supported for now.

For comparison, the PDF document shown above is 2 kB in size whereas the PNG image showing it is 60 kB.

Sunday, October 15, 2023

The road to hell is paved with good intentions and C++ modules

The major C++ compilers are starting to ship modules implementations so I figured I'd add more support for those in Meson. That resulted in this blog post. It will not be pleasant or fun. Should you choose to read it, you might want to keep your emergency kitten image image reserve close at hand.

At first you run into all the usual tool roughness that you'd expect from new experimental features. For example, GCC generates dependency files that don't work with Ninja. Or that Clang's module scanner does not support a -o command line argument to write results to a file, but instead it dumps them to stdout. (If you are the sort of person who went "pfft, what a moron, he should learn to use shell forwarding" then how would you like it if all compilers dumped their .o files to stdout?) The output of the scanner is not usable for actually building the code, mind you, it needs to be preprocessed with a different tool to convert its "high level" JSON output to "file based" dep files that Ninja understands.

Then you get into deeper problems like the fact that currently the three major C++ compilers do not support a common file extension to specify C++ module sources. (Or output binaries for that matter.) Thus the build system must have magic code to inject the "compile this thing as that thing" arguments for every source. Or, in the case of CMake, have the end user manually type the format of each source file, even if it had an extension that uniquely specifies it as a module. No, I am not kidding.

Then things get even worse, but first let's set a ground rule.

The basic axiom for command line tools

One could write a blog post as long as, or longer than, this one just explaining why the following requirement is necessary. I'm not going to, but instead just state it as an axiom.

For a given data processing tool, one must be able to determine all command line arguments the tool needs to do its job without needing to examine contents of its input files.

As a very simple example, suppose you have an image file format named .bob, which can be either a png file or a jpg file and you have a program to display image files on screen. Invoking it as showimage image.bob would be good. Requiring users to invoke the program with showimage image.bob --image-is-actually-a=png would be bad.

Every time you end up with a design that would violate this requirement, the correct thing to do is to change the design somehow so that the violation goes away. If you disagree with this claim, you are free to do so. Just don't send me email about it.

A brief overview of how build systems work

The main pipeline commonly looks like this for systems that generate e.g. Makefiles or Ninja files.

The important step here is #3. A Ninja file is static, thus we need to know a) the compilation arguments to use and b) interdependencies between files. The latter can't be done for things like Fortran or C++ modules. This is why Ninja has a functionality called dyndeps. It is a way to call an external program that scans the sources to be built and then writes a simple Makefile-esque snippet describing which order things should be built in. This is a simple and understandable step that works nicely for Fortran modules but is by itself insufficient for C++ modules.

Once more into the weeds, my friends!

As an example let's examine the Clang compiler. For compiling C++ modules it has a command line argument for specifying where it should put the output module file: -fmodule-output=somewhere/modulename.pcm. The output file name must be same as the module it exports (not actually true, but sufficiently true for our purposes). Immediately we see the problem. When the Ninja file is first generated, we must already know what module said file exports. Or, in other words, we have to scan the contents of the file in order to know what command line argument we need to pass to it.

In reality even that is not sufficient. You need to parse the contents of said file and all its compiler arguments and possibly other things, because the following is perfectly valid code (or at least it compiles on Clang):

module;
#include<evil.h>
export module I_AM_A_DEFINE_GOOD_LUCK_FINDING_OUT_WHERE_I_COME_FROM;

In other words in order to be able to compile this source file you first need to parse the source file and all included sources until you hit the export declaration and then throw away the result. Simpler approaches don't work because the only thing that can reliably parse a C++ source file is a full C++ compiler.

C++ has to support a lot of stupid things because of backwards compatibility. In this case that does not apply, because there is no existing code that would require this to keep working. Defining a module name with the preprocessor should just be made invalid (or at least ill defined no diagnostics required).

Even worse, now you get into problems with reconfiguration. You want to minimize the number of reconfigurations you do, because they are typically very slow compared to just rerunning Ninja. In order to be reliable, we must assume that any change in a module source file might change the module name it generates. Thus we need to change the compiler flags needed to build the file, which means recreating the Ninja file, which can only be done by running reconfigure.

Ninja does not support dynamic compiler argument generation. Which is good, because it makes things faster, simpler and more reliable. This does not stop CMake, which hacked it on top anyway. The compiler command they invoke looks like the following.

clang++ <some compiler flags> @path/to/somefile.modmap <more flags>

The @file syntax means "read the contents of the file and pretend that it contains command line arguments". The file itself is created at build time with a scanner program. Here's what its contents look like.

-x c++-module
-fmodule-output=CMakeFiles/modtest.dir/M0.pcm
-fmodule-file=M1=CMakeFiles/modtest.dir/M1.pcm
-fmodule-file=M2=CMakeFiles/modtest.dir/M2.pcm
-fmodule-file=M3=CMakeFiles/modtest.dir/M3.pcm
-fmodule-file=M4=CMakeFiles/modtest.dir/M4.pcm
-fmodule-file=M5=CMakeFiles/modtest.dir/M5.pcm
-fmodule-file=M6=CMakeFiles/modtest.dir/M6.pcm
-fmodule-file=M7=CMakeFiles/modtest.dir/M7.pcm
-fmodule-file=M8=CMakeFiles/modtest.dir/M8.pcm
-fmodule-file=M9=CMakeFiles/modtest.dir/M9.pcm

This file has the output filename as well as listing every other module file in this build target (only M1 is used by M0 so the other flags are superfluous). A file like this is created for each C++ source file and it must remain on disk for the entire compilation. NTFS users, rejoice!

The solution that we have here is exceedingly clever. Unfortunately the word "clever" is used here in the pejorative sense, as in "aren't I super clever for managing to create this hideously complicated Rube Goldberg machine to solve a problem caused by people not communicating with each other". This is a "first beta" level solution. The one where you prove that something can be done and which you then improve to be actually good. But no, CMake has decreed module support as "done", so this is what you are going to be stuck with for the foreseeable future.

AFAICT this setup has been designed mostly by CMake devs. Which kind of makes you wonder. Why would a company that makes a lot of money consulting and training people on their product make the practical use of C++ modules really hard for competing projects to implement? That would make it an even bigger burden to replace their software with something else? What kind of business goal could that possibly serve?

An interlude that is 100% unrelated to anything discussed above

There is a proposal for Sandia National Laboratories to fund Kitware to create a new front end language to CMake. Page 4 of the Powerpoint presentation on that page looks like this (emphasis mine):

I have no way of knowing how large the code base in question is. The document mentions on page 10 that a port from Autotools to CMake cost $1.2 million, 500k of which was internal costs and 700k went to Kitware for implementing basic functionality missing in CMake. For comparison the Cocomo estimate for Meson in its entirety is only $1.3 million. I don't know how much code they have but I have ported several fairly hefty projects from one build system to another and none of them has gotten even close to $50k. The text does not say if the $10 million includes internal costs as well as external (it probably does), but the $1 million one seems to be purely external costs. The cost of updating existing CMake code to use the new syntax seems to be ignored (or that is my interpretation of the doc, anyway). 

Without knowing the true ratio of internal costs (training, recertification etc) over total costs it is hard to compare the two numbers. But just for reference, if one were to charge $100 an hour, $10 million would get you 100k hours. At 8 hours per day that means 12.5k days or 2500 weeks or 625 months or 56 years at 11 months per year. Even a million would last for over five years of full time effort. That's a lot of time to spend on converting build systems.

We now return to your regularly scheduled program.

Is there a better design?

Maybe. Let's start by listing the requirements:

  • All command lines used must be knowable a priori, that is, without scanning the contents of source files
  • Any compiler may choose to name its module files however it wants, but said mapping must be knowable just from the compiler name and version, i.o.w. it has to be documented
  • Changing source files must not cause, by itself, a reconfiguration, only a rescan for deps followed by a build of the changed files
  • The developer must not need to tell which sources are which module types in the build system, it is to be deduced automatically without needing to scan the contents of source files (i.e. by using the proper file extension)
  • Module files are per-compiler and per-version and only make sense within the build tree of a single project
  • If module files are to be installed, those are defined in a way that does not affect source files with modules that do not need installing.
Fortran modules already statisfy all of these. 

We split requirements between the tools as follows.

  • It is the responsibility of the compiler to output files where told
  • It is the responsibility of the build system to invoke compilations in the correct order to satisfy module requirements
The immediate consequence of this is that the Ninja file must not contain any command line arguments with module names in them.

For now we assume a project having only one executable target. The scheme is extended to multiple targets later.

To get this working we need a special built modules subdirectory. It is the responsibility of the build system to guarantee that every source file within a given build target is given the same directory via a command line argument.

For Clang this would mean that instead of specifying -fmodule-output=moddir/M0.pcm meaning "write the module to this file" you'd say -fmodule-output-dir=moddir meaning "I don't care what the name of the module is, just write it in the given directory using your own naming scheme". That directory is also implicitly used as a module import directory so any source file can do import foo for any module foo defined in the same target without any additional command line arguments. Other targets' module import dirs can be added with a separate command line argument (not specified here).

With this setup we don't need anything else. The command line arguments are always the same, the dependency scanner can detect when a file generates a module, what its output name and path is going to be and it can generate Ninja dyndep info to make compilations happen in the correct order. This is, more or less, how Fortran modules work and also how Meson's WIP module builder currently works. It has the added benefit that the scanner only invokes one process per build target, not one process per source file. You can only do this if the build system enforces that all sources within a target have the same command line arguments. Meson does this. CMake does not.

The scheme listed above has been tested for small projects but not at any sort of scale. No fundamental blockers to large scale use are known at this time but they might exist.

Multiple targets and subprojects

If you have a module executable A that uses module library B, you obviously need to compile all module producers of B before compiling any module consumers of A. You also need to tell A where the module files of B are.

The former is easy to do with phony targets if you are willing to tolerate that all compilations of B (but not linking) need to happen before any compilations of A. That causes a minor build time hit, but since one of the main features of modules were faster build times, you should still come out ahead.

There are two solutions to the latter. The first one is that when you add a link dependency of B to A, you also add its module output directory to the module input path of A. The other, and much simpler, solution is that the module output directory is shared between all build targets and subprojects within a single build tree. It could be called, for example, built-modules and be at the root dir of the build tree. The only real downside is that if your project defines a module with the same name in two different source files, there would be a clash. But if that is the case your build setup is broken already due to potential module-ODR violations, and bailing out is the sensible thing to do.

Monday, October 2, 2023

Could we make C arrays memory safe? Probably not, but let's try anyway

Preamble

This is a project that I have wanted to implement for a long time. However it has become quite clear that I don't have the time to do it. Thus you get this blog post instead. If someone wants to try to do this on their own, feel free. If you succeed, it would improve computer security by a fair bit.

Nothing in this blog post is actually new. Similar ideas have been posted in research papers such as this one. AFAICT no-one has done a full scale implementation for C.

Even if the scheme worked (it is entirely possible that it can not be feasibly implemented), it would still not make C a memory safe language by any stretch of the imagination. There are many other sources of memory unsafety in C. But it would make it somewhat safer, which would be a good thing all considered.

Let us begin

Suppose we have the following very simple C function:

int do_indexing(int *buf, const size_t bufsize, size_t index) {
    return buf[index];
}

Is this memory safe? Obviously not. There is no guarantee that the index variable points to a valid array entry (we assume that bufsize is always correct for simplicity). Let's do this then:

int do_indexing(int *buf, const size_t bufsize, size_t index) {
    if(index < bufsize)
      return buf[index];
    return -1;
}

Assuming bufsize is a valid size of the array, then yes, this is memory safe (size_t is unsigned and is used to simplify the checking logic, signed integers behave the same but need two checks). What we want to achieve is to make the first version emit a compiler warning (or error) whereas the latter would not. First we need to tell the compiler that bufsize defines the array size. The GCC page on function attributes does not seem to provide such a thing, so let's invent our own magic syntax for it:

int do_indexing(int *buf, 
    const size_t bufsize [[arraysize_of:buf]], 
    size_t index);

We're going to leave this out in subsequent code samples for simplicity. This is just an attribute, it does not affect ABI, code generation or anything else.

Now the compiler can in fact verify that the latter example is safe. When buf is dereferenced we know that the value of index is nonnegative and less than the size of the array.

Similarly the compiler can diagnose that the first example can lead to a buffer overflow, because the value of index can be anything.

In other words, now we have a system where the programmer has to write a formal proof that the index is valid before being allowed to use it in a dereferencing operation. This has roughly the same basic idea as the borrow checker in Rust, where the programmer needs to provide a formal proof that they are the only one holding an object before being allowed to mutate it.

Onwards to more examples

What this scheme requires is to know each variable's domain, that is, the range of possible values it could have. Then on every array dereference the domain of the index must not be wider than the domain of the array size variable. Let's go through the sample function step by step.

int do_indexing(int *buf, const size_t bufsize, size_t index) {
    // domain of bufsize: unknown
    // domain of index: unknown
    if(index < bufsize)
      // domain of bufsize: not zero
      // domain of index: [0, bufsize)
      return buf[index]; // Index is formally valid.
    return -1;
}

After the check we know that bufsize is not zero (because there exists an integer smaller than it) and that index is valid.

The problem thus reduces to determining the worst possible domain a variable could hold. For example an if branch:

// Domain of index: [0, bufsize)
if(something())
    ++index;
//Domain of index: [1, bufsize+1)
return array[index]; // Error, index not valid

Loops with a known amount of iterations follow from this by assuming the worst case for both bounds:

// Domain of index: [a, b]
for(int i=0; i<count; ++i) {
    if(something())
        ++index;
    else
        --index;
}
// Domain of index: [a-count, b+count]
// Detect underflow if a < count.
// Detect overflow if b >= MAX_SIZE_T - count

A loop that runs an unknown number of times is surprisingly even simpler.

// Domain of index: [a, b)
while(keep_going()) {
    if(something())
        ++index;
    else
        --index;
}
// Domain of index: unknown

The checker does not need to be able to determine what the true domain is. It can merely throw up its hands and require the developer to add the necessary checks between this block and any subsequent array indexing operations.

If you want to get even fancier, you could create a second pragma for the index value, like so:

int do_indexing(int *buf, 
    const size_t bufsize [[arraysize_of:buf]], 
    size_t index [[arrayindex_of:buf]]
);

You would only be allowed to call this function after formally proving that the domain of index is [0, bufsize). This would give you array indexing operations that are both memory safe and fast as it can do the array dereferences without any runtime overhead (i.e. bounds checking). In theory you could also leave out the middle argument and still have safe dereferencing operations but a decent optimizer should be able to inline the function and optimize it out.

That's it? Sadly no

Suppose the value of bufsize is not constant and you do this:

// Domain of index: [0, bufsize]
shrink_buffer_by_one(&buf, &bufsize);
return buf[index]; // NOT SAFE

If the size of the array can change, then the domain might not be correct. In this case you'd need to store that the max value is bufsize at an earlier point in the program, not the current value. A full implementation would get quite complicated quite quickly.

Tuesday, September 19, 2023

Circles do not exist

Many logos, drawings and other graphical designs have the following shape in it. What is this shape?

If you thought: "Ah-ha! I'm smart and read the title of this blog post so I know that this is most definitely not a circle."

Well it is. Specifically it is a raster image of a circle that I created with the Gimp just for this use.

However almost every "circle" you can see in printed media (and most purely digital ones) are not, in fact, circles. Why is this?

Since roughly the mid 80s all "high quality" print jobs have been done either in PostScript or, nowadays almost exclusively, in PDF. They use the same basic drawing model, which does not have a primitive for circles (or circle arcs). The only primitives they have are straight line segments, rectangles and BĂ©zier curves. None of these can be used to express a circle accurately. You can only do an approximation of a circle but it is always slightly eccentric. The only way to create a proper circle is to have a raster image like the one above.

Does this matter in the real world?

For printing probably not. Almost nobody can tell the difference between a real circle and one that has been approximated with a BĂ©zier curve with just four points. Furthermore, the human vision system is a bit weird and perfect circles look vertically elongated. You have to make them non-circular for people to consider them properly circular.

But suppose you want to use one of these things:

This is a laser cutter that takes its "print job" as a PDF file and uses its vector drawing commands to drive the cutting head. This means that it is impossible to use it to print a wheel. You'd need to attach the output to a lathe and sand it down to be round so it actually functions as a wheel rather than as a vibration source.

Again one might ask whether this has any practical impact. For this case, again, probably not. But did you know that one of the cases PDF is being considered (and, based on Internet rumors, is already being used) is as an interchange format for CAD drawings? Now it suddenly starts mattering. If you have any component where getting a really accurate circle shape is vital (like pistons and their holes), suddenly all your components are slightly misshaped. Which would not be fun.

Extra bonus information

Even though it is impossible to construct a path that is perfectly circular, PDF does provide a way to draw a filled circle. Here is the relevant snippet from the PDF 2.0 spec, subsection 8.5.3.2:

If a subpath is degenerate (consists of a single-point closed path or of two or more points at the same coordinates), the S operator shall paint it only if round line caps have been specified, producing a filled circle centred at the single point.

Who is willing to put money on the line that every PDF rendering implementation actually uses circles rather than doing the simple thing of approximating it with BĂ©ziers?

Sunday, September 10, 2023

A logo for CapyPDF

The two most important things about any software project are its logo and mascot. Here is a proposal for both for CapyPDF.

As you can probably tell I'm not a professional artist, but you gotta start somewhere. The original idea was to have a capybara head which is wearing the PDF logo much like a bow tie around its ear. The gist of it should come across, though it did look much better inside my brain. The PDF squiggle logo is hard to mold to the desired shape.

The font is Nimbus Sans, which is one of the original PostScript Core Fonts. More precisely it is a freely licensed metrically compatible version of Helvetica. This combines open source with the history of PDF quite nicely.

Friday, September 1, 2023

CapyPDF 0.5 is out

There are no actual release notes, but a bunch of stuff got added. Code is here. Many of these were things needed by Inkscape for its new CMYK PDF exporter. More info about that can be found on DoctorMo's YouTube channel.

Wednesday, August 16, 2023

The least convenient way of defining an alpha value

PDF's drawing model inherits from PostScript, which was originally designed in the early 80s. It works and is really nice but the one glaring hole it has is missing transparency support. The original PDF spec from 1993 has no transparency support either, it was added in version 1.4 that came out mere 8 years later in 2001.

Setting drawing operation colours in PDF is fairly straightforward. You list the values you want and call the color setting command, like this:

1.0 0.1 0.4 rg    % Sets RGB nonstroke colour
0.2 0.8 0.0 0.2 K % Sets CMYK stroke colour

The property name for alpha in PDF is CA and ca for stroking and nonstroking operations, respectively. This would imply that you should be able to do this:

0.9 ca  % Sets nonstroke alpha? [no]
0.2 CA  % Sets stroke alpha?    [no]

But of course you can't. Instead the alpha value can only be set via a Graphics State dictionary, which is a top level PDF object that you then need to create, use, link to the current page's used resources and all that jazz.

/Alpha01State gs % Set graphics state to the dictionary with the given name

This works fine for use cases such as compositing layers that do not themselves use transparency. In that case you render the layer to a separate XObject, create a graphics dictionary with the given composition parameters, use it and then render the layer XObject. It does not work at all in pretty much any other case.

The more observant among you might have already noticed that you need to create a separate graphics state dictionary for every transparency value used in your document. If you are, say, a vector graphics application and allow artists to adjust the transparency for each object separately using a slider (as you should), then it means that you may easily have hundreds or thousands of different alpha values in your document. Which means that you need to create hundreds or thousands of different transparency graphics state dictionaries if you want to preserve document fidelity.

Why is it implemented like this?

As is common, we don't really know (or at least I don't). The Finger of Speculation points towards the most usual of suspects: backwards compatibility.

Suppose the alpha command had been added to the PDF spec. It would mean that trying to open documents with those commands in older PDF renderers would cause crashes due to unknown commands. Yes, even if the document had specified a newer minimum PDF version, because PDF renderers would try to open and show them anyway rather than popping an error message saying "Document version is too new". OTOH adding new entries to a graphics state dictionary is backwards compatible because the renderer would just ignore keys it does not understand. This renders the document as if no transparency had ever been defined. The output is not correct, but at least mostly correct. Sadly the same can't be done in the command stream because it is a structureless stream of tokens following The One True Way of RPN.

If the hypothesis above is true, then the inescapable conclusion is that no new commands can be added to the PDF command stream. Ever.

Friday, July 21, 2023

Creating a PDF/X-3 document with CapyPDF

The original motivation for creating CapyPDF was understanding how fully color managed and print-ready PDF files are actually put together. A reasonable measure of this would be being able to generate fully spec conforming PDF/X-3 files. Most of the building blocks were already there so this was mostly a question of exposing all that in Python and adding the missing bits.

I have now done this and it seems to work. The script itself can be found in CapyPDF's Git repo and the generated PDF document can be downloaded using this link. It looks like this in Acrobat Pro.

It is not graphically the most flashy, but it requires quite a lot of functionality:

  • The document is A4 size but it is defined on a larger canvas. After printing it needs to be cut into the final shape. This is needed as the document has a bleed.
  • To obtain this the PDF has a bleed box (the blue box) as well as a trim box (the green box). These are not painted in the document itself, Acrobat Pro just displays them for visualisation purposes.
  • All colors in graphical operations are specified in CMYK.
  • The image is a color managed CMYK TIFF. It has an embedded ICC profile that is different from the one used in final printing. This is due to laziness as I had this image laying around already. You don't want to do this in a real print job.
  • The heading has both a stroke and a fill
  • The printer marks are just something I threw on the page at random.
Here is a screenshot of the document passing PDF-X/3 validation in Acrobat Pro.

Sunday, July 16, 2023

PDF transparency groups and composition

The PDF specification has the following image as an example of how to do transparent graphics composition.

This seems simple but actually requires quite a lot of functionality:

  • Specifying CMYK gradients
  • Setting the blend mode for paint operations
  • Specifying transparency group xobjects
  • Specifying layer composition parameters
I had to implement a bunch of new functionality to get this working, but here is the same image reproduced with CapyPDF. (output PDF here)

The only difference here is that out of laziness I used a simple two color linear gradient rather than a rainbow one.

Nonportability

This is again one of those features that only Acrobat Reader can handle. I tried Okular, Ghostscript, Firefox, Chromium and Apple Preview and they all rendered the file incorrectly. There was no consistency, each one was broken in a different way.

Wednesday, July 5, 2023

CapyPDF 0.4 release and presenter tool

I have just released version 0.4 of CapyPDF. You can get it either via Github or PyPI. The target of this release was to be able to create a pure Python script that can be used to generate PDF slides to be used in presentations. It does not read any input, just always produces the same hardcoded output, but since the point of this was to create a working backend, that does not really matter.

  • The source code for the script is here. It is roughly 200 lines long.
  • The PDF it creates can be accessed here.
  • A screenshot showing all pages looks like the following.

What the screenshot does not tell, however, is that the file uses PDF transition effects. They are used both for between-page transitions as well as within-page transitions, specifically for the bullet points on page 2. This is, as far as I can tell, the first time within-page navigation functionality has been implemented in an open source package, possibly even the first time ever outside of Adobe (dunno if MS Office exports transitions, I don't have it so can't test it). As you can probably tell this took a fair bit of black box debugging because all PDF tools and validators simply ignore transition dictionaries. For example transcoding PDFs with Ghostscript scrubs all transitions from the output.

The only way to see the transitions correctly is to open the PDF in Acrobat Reader and then enable presenter mode. PDF viewer developers who want to add support for presentation effects might want to use that as an example document. I don't guarantee that it's correct, though, only that it makes Acrobat Reader display transition effects.

Thursday, June 29, 2023

PDF and embedded videos

PDF supports playing back video content since version 1.5. I could do the whole shtick and shpiel routine of "surely this is working technology as the specification is over 20 years old by now". But you already know that it is not the case. Probably you came here just to see how badly broken things are. So here you go:

Time specification (or doing pointlessly verbose nested dictionaries before it was cool)

A video in PDF is a screen annotation. It is defined with a PDF object dicrionary that houses many a different subdictionaries as keys. In addition to playing back the entire video you can also specify to play back only a subsection of it. To do the latter you need an annotation dictionary with an action subdictionary of subtype rendition, which has a subdictionary of type rendition (no, that is not a typo), which has a subdictionary of a mediasource, which has three subdictionaries: a MediaClip and the start and end times.

There are three different ways of specifying the start and end times: time (in seconds), frames or bookmarks. We'll ignore the last one and look at frames first. As most people are probably not familiar with the PDF dictionary syntax, I'm going to write those out in JSON. The actual content is the same, it is just written in a different syntax. Here's what it looks like:

{
  "B": {
    "S": "F",
    "F": 1000
  }
}

Here we define the start time "B", which is a MediaOffset. It has a subtype that uses frames ("S" is "F") and finally we have the frame number of 1000. Thus you might expect that specifying the time in seconds would mean changing the value of S to T and the key–value pair "F" -> 1000 to something like "V" -> 33.4. Deep in your heart you know that not to be true, because the required format is this.

{
  "B": {
    "S": "T",
    "T": {
      "S": "S",
      "V": 33.4
    }
  }
}

Expressing the time in seconds requires an entire whole new dictionary just to hold the number. The specification says that the value of that dictionary's "S" must always be "S". Kinda makes you wonder why they chose to do this. Probably to make it possible to add other time formats in the future. But in the 20 years since this specification was written no such functionality has appeared. Even if it did, you could easily support it by adding a new value type in the outer dictionary (something like "T2" in addition to "T").

Most people probably have heard the recommendation of not doing overly general solutions to problems you don't have yet. This is one of the reasons why.

Does it work in practice?

No. Here's what happens when I tried using a plain h264 MP4 file (which is listed as a supported format on Adobe's site and which plays back in Windows and macOS using the system's native video player).

Okular


Instead of a video, Okular displays a screenshot of the user desktop with additional rendering glitches. Those are probably courtesy of bugs in either DRM code or the graphic card's device driver.

Evince

Evince shows the first frame of the video and even a player GUI, but clicking on it does nothing and the video can't be played.

Apple Preview

Displays nothing. Shows no error message.

Firefox

Displays nothing. Shows no error message.

Chromium

Displays nothing. Shows no error message.

Acrobat Reader, macOS

Plays back videos correctly but only if you play it back in its entirety. If you specify a subrange it instead prints an error message saying that the video can't be presented "in the way the author intended" and does nothing.

Acrobat Reader, Windows

Video playback works. Even the subrange file that the macOS version failed on plays and the playback is correctly limited to the time span specified.

But!

Video playback is always wonky. Either the video is replaced with a line visualisation of the audio track or the player displays a black screen until the video stream hits a key frame and plays correctly after that.

In conclusion

Out of the major PDF implementations, 100% are broken in one way or another.

Friday, June 23, 2023

PDF subpage navigation

A common presentation requirement is that you want to have a list of bullet points that appear one by one as you click forward. Almost all PDF presentations that do this fake it by having multiple pages, one for each state. So if you have a presentation with one page and five bullet points, the PDF has six pages, one for the empty state and a further one for each bullet point appearing.

This need not be so. The PDF specification has a feature called subpage navigation (PDF 2.0 spec, 12.4.4.2). This kind of makes you wonder. Why are people not using this clearly useful functionality? After trying to implement it in CapyPDF the answer became obvious. Quite quickly.

While most of the PDF specification is fairly readable, this section is not. It is very confusing. Once you get over the initial bafflement you realize that the spec is actually self-contradictory. I was so badly confused that eventually I filed a bug report against the PDF specification itself.

That does not really help in the short term so I did the only thing that can be done under these circumstances, namely looking at how existing software handles the issue. The answer turned out to be: they don't. The only PDF viewer that does anything with subpage navigation is Acrobat Reader. Even more annoyingly just getting your hands on a PDF document that has subpage navigation defined is next to impossible. None of the presentation software I tried exports subpage navigation tags. Trying to use a search engine leads to disappointment as well. They won't let you search for "a PDF whose internal data structures contain the word PresSteps" but instead will helpfully search for "PDF files whose payload text contains the word PresSteps". As you might expect this finds various versions of the PDF specification and nothing much else.

The most probable reason why PDF subpage navigation is not used is that nobody else is using it.

How does it actually work then?

Nobody knows for real. But after feeding Acrobat Reader a bunch of sample documents and seeing what it actually does we can create a hypothesis.

The basic operations used are optional content groups and navigation nodes. The former are supported by all PDF viewers, the latter are not. Basically each navigation node represents a state and when the user navigates forwards or backwards the PDF viewer runs a specified action that can be used to either hide or display an optional content group. A reasonable assumption for getting this working, then, would be to set the default visibility of all optional content groups to hidden and then have transitions that enable the bullet points one by one.

That does not work. Instead you have to do this:

Writing arbitrary state machines as graphs just to for appearing bullet points? We're in the big leagues now! The reason the root node exists is that when you enter a page, the root node's transition is performed automatically (in one part of the spec but not another, hence the self-inconsistency). Note especially how there is no forward navigation for state 2 or backwards navigation for state 0. This is intentional, that is how you get the subpage navigation algorithm to leave the current page. I think. It kinda works in Acrobat Reader but you have to press the arrow key twice to leave the page. 

On the other hand you could implement a full choose-your-own-adventure book as a single page PDF using only subpage navigation.

Wednesday, June 14, 2023

Functionality currently implemented in CapyPDF

CapyPDF has a fair bit of functionality and it might be difficult to tell from the outside what works and what does not. Here is a rough outline of implemented functionality.

In the public C API (and Python)

  • Basic draw commands in RGB, gray and CMYK
  • ICC profile support
  • Loading PNG, JPG and TIFF images (including CMYK TIFFs)
  • Embed JPG files directly without unpacking first
  • Using images as a paint mask
  • Handle color profiles embedded in images
  • Using builtin PDF fonts
  • Using TrueType fonts with font subsetting
  • All PDF font operators like extra padding, raise/lower and set as clipping path
  • Page transitions

Implemented but not exposed

The items listed here vary in implementation completeness from "sort of ready, just not exposed yet" to "the Mmest of MVP". You can only get to them by poking at the innards directly.

  • Document navigation tree
  • Overprinting
  • Structured and annotated PDF
  • Additional color channels (called separations in the PDF spec)
  • Form XObjects
  • File embedding
  • Annotations (only a few types)
  • L*a*b* color space support
  • ICC colors in primitive paint operations
  • Type 2, 3 and 4 shadings (i.e. gradients)
  • Color patterns

Saturday, June 10, 2023

A4PDF has been renamed to CapyPDF

As alluded to in the previous post, A4PDF has changed its name. The new project name is CapyPDF. The name refers to capybaras.


Original picture from here. I was in the process of drawing a proper mascot logo, but work on that stalled. Hopefully it'll get done at some point.

There is a new release, but it does not provide new functionality, just the renaming and some general code cleanups.

Tuesday, May 30, 2023

A4PDF release 0.2.0

I have just tagged relase 0.2.0 of A4PDF, the fully color managed PDF generation library.

There are not that many new exposed features added in the public API since 0.1.0. The main goal of this release has been to make the Python integration work and thus the release is also available in Pypi. Due to reasons the binary wheel is available only for Windows.

Trying it out

On Linux you probably want to do a Git checkout and run it from there.

On Windows the easiest way is to install the package via Pip.

On macOS you can't do anything, because Apple's compiler toolchain is too old to build the code.

What functionality does is provide?

There is no actual documentation, so the best bet is to look at the unit test file. There is a lot more functionality in the C++ code, but it is not exposed in the public API yet. These include things like (basics of) tagged PDF generation, annotations and external file embedding.

Impending name change

There is an official variant of PDF called PDF/A. Thre are several versions of it including PDF/A-4. I did not know that when deciding the name. Because having a library called A4PDF that produces PDF/A-4:s is confusing, the name needs to be changed. The new name has not been decided yet, suggestions welcome.

Wednesday, May 24, 2023

Advanced dependency management and building Python wheels with Meson

One of the most complex pieces of developing C and C++ programs (and most other languages) is dependency management. When developing A4PDF I have used Ubuntu's default distro dependencies. This is very convenient because you typically don't need to fiddle with getting them built and they are battle tested and almost always work.

Unfortunately you can't use those in most dev environments, especially Windows. So let's see how much work it takes to build the whole thing on Windows using only Visual Studio and to bundle the whole thing into a Python wheel so it can be installed and distributed. I would have wanted to also put it in Pypi but currently there is a lockdown caused by spammers so no go on that front.

Seems like a lot of effort? Let's start by listing all the dependencies:

  • fmt
  • Freetype
  • LibTIFF
  • LibPNG
  • LibJPEG (turbo)
  • LittleCMS2
  • Zlib
These are all available via WrapDB, so each one can be installed by executing a command like the following:

meson wrap install fmt

With that done Meson will automatically download and compile the dependencies from source. No changes need to be done in the main project's meson.build files. Linux builds will keep using system deps as if nothing happened.

Next we need to build a Python extension package. This is different from a Python extension module, as the project uses ctypes for Python <-> C interop. Fortunately thanks to the contributors of Meson-Python this comes down to writing an 18 line toml file. Everything else is automatically handled for you. Installing the package is then a question of running this command:

pip install .

After a minute or two of compilation the module is installed. Here is a screenshot of the built libraries in the system Python's site-packages.

Now we can open a fresh terminal and start using the module.

Random things of note

  • Everything here uses only Meson. There are no external dependency managers, unix userland emulators or special terminals that you have to use.
  • In theory this could work on macOS too, but the code is implemented in C++23 and Apple's toolchain is too old to support it.
  • The build definitions for A4PDF take only 155 lines (most of which is source and program name listings).
  • If, for whatever reason, you can't use WrapDB, you can host your own.