Monday, December 23, 2024

CapyPDF 0.14 is out

I have just released version 0.14 of CapyPDF. This release has a ton of new functionality. So much, in fact, that I don't even remember them all. The reason for this is that it is actually starting to see real world usage, specifically as the new color managed PDF exporter for Inkscape. It has required a lot of refactoring work in the color code of Inkscape proper. This work has been done mostly by Doctormo, who has several videos on the issue.

The development cycle has consisted mostly of him reporting missing features like "specifying page labels is not supported", "patterns can be used for fill, but not for stroke" and "loading CMYK TIFF images with embedded color profiles does not work" and me then implementing said features or finding out how how setjmp/longjmp actually works and debugging corrupted stack traces when it doesn't.

Major change coming in the next version

The API for CapyPDF is not stable, but in the next release it will be extra unstable. The reason is C strings. Null terminated UTF-8 strings are a natural text format for PDF, as strings in PDF must not contain the zero glyph. Thus there are many functions like this in the public C API:

void do_something(const char *text);

This works and is simple, but there is a common use case it can't handle. All strings must be zero terminated so you can't point to a middle of an existing buffer, because it is not guaranteed to be zero terminated. Thus you always have to make a copy of the text you want to pass. In other words this means that you can't use C++'s string_view (or any equivalent string) as a source of text data. The public API should support this use case.

Is this premature optimization? Maybe. But is is also a usability issue as string views seem to be fairly common nowadays. There does not seem to be a perfect solution, but the best one I managed to crib seems to be to do this:

void do_something(const char *text, int32_t len_or_negative);

If the last argument is positive, use it as the length of the buffer. If i is negative then treat the char data as a zero terminated plain string. This requires changing all functions that take strings and makes the API more unpleasant to use.

If someone has an idea for a better API, do post a comment here.

Tuesday, December 17, 2024

Meson build definitions merged into Git's git repo

The developers of Git have been considering switchibg build systems for a while. No definitive decision have been made as of yet, but they gave merged Meson build definitions in the main branch. Thus it now possible, and even semi-supported, to develop and build Git with Meson instead of the vintage Makefile setup (which, AFAICT, remains as the default build system for now).

The most interesting thing about this conversion is that the devs were very thorough in their evaluation of all the different possibilities. Those who are interested in the details or are possibly contemplating a build system switch on their own are recommended to read the merge's commit message.

Huge congratulations for everyone involved and thank you for putting in the work (FTR i did not work on this myself). 

Friday, December 13, 2024

CMYK me baby one more time!

Did you know that Jpeg supports images in the CMYK colorspace? And that people are actually using them in the wild? This being the case I needed to add support to them into CapyPDF. The development steps are quite simple, first you create a CMYK Jpeg file, then you create a test document that embeds it and finally look at the result in a PDF renderer.

Off to a painter application then. This is what the test image looks like.

Then we update the Jpeg parsing code to detect cmyk images and write the corresponding metadata to the output PDF. What does then end result look like then?

Aaaaand now we have a problem. Specifically one of an arbitrary color remapping. It might seem this is just a case of inverted colors. It's not (I checked), something weirder is going on. For reference Acrobat Reader's output looks identical.

At this point rather than poke things at random and hoping for the best, a good strategy is to get more test data. Since Scribus is pretty much a gold standard on print quality PDF production I went about recreating the test document in it.

Which failed immediately on loading the image.

Here we have Gwenview and Scribus presenting their interpretations of the exact same image. If you use Scribus to generate a PDF, it will convert the Jpeg into some three channel (i.e. RGB) ICC profile.

Take-home exercise

Where is the bug (or a hole in the spec) in this case:

  • The original CMYK jpeg is correct, but Scribus and PDF renderers read it in incorrectly?
  • The original image is incorrect and Gwenview has a separate inverse bug that cancel each other out?
  • The image is correct but the metadata written in the file by CapyPDF is incorrect?
  • The PDF spec has a big chunk of UB here and the final result can be anything?
  • Aliens?
I don't know the correct answer. If someone out there does, do let me know.

Thursday, December 5, 2024

Compiler daemon thought experiment

According to information I have picked up somewhere (but can't properly confirm via web searches ATM)  there was a compiler in the 90s (the IBM VisualAge compiler maybe?) which had a special caching daemon mode. The basic idea was that you would send your code to that process and then it could return cached compile results without needing to reparse and reprocess same bits of code over and over. A sort of an in-compiler CCache, if you will. These compilers no longer seem to exist, probably because you can't just send snippets of code to be compiled, you have to send the entire set of code up to the point you want to compile. If it is different, for example because some headers are included in a different order, the results can not be reused. You have to send everything over and at that point it becomes distcc.

I was thinking about this some time ago (do not ask why, I don't know) and while this approach does not work in the general case, maybe it could be made to work for a common special case. However I am not a compiler developer so I have no idea if the following idea could work or not. But maybe someone skilled in the art might want to try this or maybe some university professor could make their students test the approach for course credit.

The basic idea is quite simple. Rather than trying to cache compiler internal state to disk somehow, persist it in a process without even attempting to be general.

The steps to take

Create a C++ project with a dozen source files or so. Each of those sources include some random set of std headers and have a single method that does something simple like returns the sum of its arguments. What they do is irrelevant, they just have to be slow to compile.

Create a PCH file that has all the std headers used in the source files. Compile that to a file.

Start compiling the actual sources one by one. Do not use parallelism to emphasize the time difference.

When the first compilation starts, read the PCH file contents into memory in the usual way. Then fork the process. One of the processes carries on compiling as usual. The second process opens a port and waits for connections, this process is the zygote server process.

When subsequent compilations are done, they connect to the port opened by the zygote process, send the compilation flags over the socket and wait for the server process to finish.

The zygote process reads the command line arguments over the socket and then forks itself. One process starts waiting on the socket again whereas the other compiles code according to the command line arguments it was given.

The performance boost comes from the fact that the zygote process already has stdlib headers in memory in compiler native data structures. In the optimal case loading the PCH file takes effectively zero time. What makes this work (in this test at least) is that the PCH file is the same for all compilations and it is the first thing the compiler starts processing. Thus it is always the same for all compilations. Conceptually at least, the actual compiler might do something else. There may be a dozen other reasons it might not work.

If someone tries this out, do let us know whether it actually worked.

Friday, November 8, 2024

PDF/AAAARGH

Note: the PDF/A specification is not freely available so everything here is based on reverse engineering. It might be complete bunk.

There are many different "subspecies" of PDF. The most common are PDF/X and PDF/A. CapyPDF can already do PDF/X, so I figured it's time to look into PDF/A. Like, how much worse could it possibly be?

Specifying that a PDF file is PDF/X is straightforward. Each PDF has a Catalog dictionary that defines properties of the document. All you need to do is to add an OutputIntent dictionary and link it to the Catalog. The dictionary has a key that specifies the subtype. Setting that to /GTS_PDFX does the trick. There are many different versions of PDF/X so you need to define that as well. A simple solution would be to have a second key in that dictionary for specifying the subtype. Half of that expectation is correct. There is indeed a key you can set, but it is in a completely different part of the object tree called the Information dictionary. It's a bit weird but you implement it once and then forget it.

PDF/A has four different versions, namely 1, 2, 3, 4 and each of these have several conformance levels that are specified with a single letter. Thus the way you specify that the file is a PDF/A document is that you write the value /GTS_PDFA1 to the intent dictionary. Yes. regardless of which version of PDF/A you want, this dictionary will say it is PDFA1.

What would be the mechanism, then, to specify the sub version:

  1. In the Information dictionary, just like with PDF/X?
  2. In some other PDF object dictionary?
  3. In a standalone PDF object that is in fact an embedded XML document?
  4. Something even worse?
Depending on your interpretation, the correct answer is either 3 or 4. Here is the XML file in question as generated by LibreOffice. The payload parts are marked with red arrows.

The other bits are just document metadata replicated. PDF version 2.0 has gone even further and deprecated storing PDF metadata in PDF's own data structures. The sructures that have been designed specifically for PDF documents, which all PDF processing software already know how to handle and which tens of billions (?) of documents already use and which can thus never be removed? Those ones. As Sun Tzu famously said:

A man with one metadata block in his file format always knows what his document is called.

A man with two can never be sure. 

Thus far we have only been at level 3. So what more could possibly be added to this to make it even worse?

Spaces.

Yes, indeed. The screen shot does not show it, but the recommend way to use this specific XML format is to add a whole lot of whitespace below the XML snippet so it can be edited in place later if needed. This is highly suspicious for PDF/A for two main reasons. First of all PDF/A is meant for archiving usage. Documents in it should not be edited afterwards. That is the entire point. Secondly, the PDF file format already has a way of replacing objects with newer versions.

The practical outcome of all this is that every single PDF/A document has approximately 5 kilobytes of fluff to represent two bytes of actual information. Said object can not even be compressed because the RDF document must be stored uncompressed to be editable. Even though in PDF/A documents it will never be edited.

Wednesday, October 30, 2024

Happenings at work

A few months ago this happened.

Which, for those of you not up to date on your 1960s British television, is to say that I've resigned. I'm currently enjoying the unemployed life style. No, that is not me being cheeky or ironic. I'm actually enjoying being able to focus on my own free time projects and sleeping late.

Since I'm not a millionaire at some point I'll probably have to get a job again. But not for at least six months. Maybe more, maybe less, we'll see what happens.

This should not affect Meson users in any significant way. I plan to spend some time to work on some fundamental issues in the code base to make things better all round. But the most important thing for now is to land the option refactor monster.

Sunday, October 6, 2024

CapyPDF 0.12.0 released

I have just made the 0.12 release of CapyPDF. It does not really have new features, but the API has been overhauled. It is almost guaranteed that no code developed against 0.11 will work without code changes. Such is the joy of not having any users.

Experimental C++ wrapper

CapyPDF has a plain C API. This makes it stable and easy to use from any programming language. That also makes it cumbersome to use. Here is what you need to write to create a PDF file that has a single rectangle:

Given that this is C, you can't really do better. However I was asked if I could create something more ergonomic for C++ users. This seemed like an interesting challenge, so I did some experimentation, which ships with the 0.12 release.

The requirements

The C++ wrapper should fulfill the following requirements:

  • Fully type safe
  • No manual memory management
  • Ideally a single header
  • Zero overhead
  • Fast to compile
  • All objects are move-only
  • IDE code completion friendly
  • Does not need to maintain API or ABI stability (those who need that have to use the C API anyway)

The base

After trying a bunch of different things I eventually came up with this design:

Basically it just stores an underlying CapyPDF object type and a helper class used to deallocate it. The operators merely mean that the object can be cast to a pointer of the underlying type. Typically you want conversion operators to be explicit but these are not, because being implicit removes a ton of boilerplate code.

For example let's look what the Color object wrapper looks like. First you need the deleter:

The class definition itself is simple:

The class has no other members than the one from the base class. The last thing we need before we can start calling into CapyPDF functions is an error handler. Because all functions in the API have the exact same form, this can be done with a single macro:

This makes the constructor look like this:

Method calls bring all of these things together.

Because the wrapper types are implicitly convertible to the underlying pointer types you can pass them directly to the C API functions. Otherwise the wrapper code would be filled with .get() method calls or the like.

The end result

With all that done, the C code from the beginning of this post can be written like this:

Despite using templates, inheritance and stdlib types the end result can be proven to have zero overhead:

After this what remains is mostly the boring work of typing out all the wrapper calls.

Thursday, September 12, 2024

On M and S Type Processes in Software Development

I wrote a post about so called "M type" and "S type" processes in software development. Unfortunately it discusses the concept of human sexuality. Now, just to be sure, it does not have any of the "good stuff" as the kids might say. Nonetheless this blog is syndicated in places where such topics might be considered controversial or even unacceptable.

Thus I can't really post the text here. Those of you who are of legal age (whatever that means in your jurisdiction) and out of their own free will want to read such material, can access the PDF version of the article via this link.

Wednesday, August 21, 2024

Meson's New Option Setup ‒ The Largest Refactoring

The problem

Meson has had togglable options from almost the very beginning. These split into two camps. The first one is "common options" like optimizations, warning level, language standard version and so on. The second one is "per project" options that are specific to each project, such as which backend to use. For a long time things were quite nice but as people started using subprojects more and more, the need to configure common options on a per-subproject basis became more and more important.

Meson added a limited way of setting some options per subproject, but it was never really felt like a proper integrated solution. Doing it properly turns out to have a lot of requirements because you want to be able to:

  • Override any shared option for any subproject
  • Do this at runtime from the command line
  • You want to unset any override given
  • Convert existing per-project settings to the new override format
  • Provide an UI that is readable and sensible
  • Do all of this without needing to edit subproject build files

The last one of these is important. It means that you can use deps directly (i.e. from WrapDB) without any local patches.

What benefits do you get out of it?

The benefits are most easily seen via examples. Let's say you are developing a program that uses a dependency that does heavy number crunching. You need to build that (and only that) subproject with optimizations enabled, otherwise your development experience is intolerably slow. This is done by defining an augment, like so:

meson configure -Acruncher:optimization=2

A stronger version of this would be to compile all subprojects with optimizations but the top level project without them. This is how you'd do it:

meson configure -Doptimization=2 -A:optimization=0

Augments can be unset:

meson configure -Usubproject:option

This scheme permits you to do all sorts of useful things, like disable -Werror on specific projects, build some subprojects with a different language version (such as gnu99), compiling LGPL deps as shared libraries and everything else as a static library, and so on.

Implementing

This is a big internal change. How big? Big! This is the largest refactoring operation I have done in my life. It is big enough that it took me over two years of procrastination before I managed to gather enough strength to start work on this. Pretty much all of my Meson work in the last six months or so has been spent on this one issue. The feature is still not done, but the merge request already has 80 commits and 1700+ new lines and even that is an understatement. I have chopped off bits of the change and merged them on their own. All in all this meant that the schedule for most days of my summer vacation went like this:

  • Wake up
  • Work on Meson refactoring branch until fed up
  • Work on my next book until fed up
  • Maybe do something else
  • Sleep
FTR I don't recommend this style of working for anyone else. Or even to myself. But sometimes you just gotta.

The main reason this change is so complex lies in the architecture. In existing code each built target "knew" the option settings needed for it (options could and can be overridden in build files on a per-target basis). This does not work any more. Instead the code needs one place that encapsulates all option data and provides methods like "what is the value of option X when building target Y in subproject Z". Option code was everywhere, so changing this meant touching the entire code base and that the huge change blob must be landed in master atomically.

The only thing that made this change even remotely feasible was that Meson has an extensive test suite. The main code changes were done months ago, and all work since then has gone into making existing unit tests pass. They still don't pass, so work continues. Without this test suite there would have been hundreds of regressing projects, people would be angry and everyone would pin their Meson to an old version and refuse to update. These are the sorts of breakages that kill projects dead. So, write tests, even if it does not seem fun. Without them every project will eventually end up in a fork in the road where the choice is between "death by stagnation" and "death by breaking end users". Most projects are not Python 3. They probably won't survive a similar level of breakage.

Refactoring, types and Python

Python is, at the same time, my favourite programming language and very much not my favourite programming language. Python in the small is nice, readable, wonderful and productive. As the project size grows, the lack of static types becomes aggravating and eventually you end up debugging cases like "why does this argument that should be a dict is an array one out of 500 times at random". Types make these problems go away and make refactoring easy.

But not always.

For this very specific case the complete lack of types actually made the refactoring easier. Meson currently supports more than one hundred different compilers. I needed to change the way compiler classes work, but I did not know how. Thus I started by just using the GNU C compiler. I could change that (and its base class) as much as I wanted without having to care about any other compiler class. As long as I did not use any other compiler their code was not called and it did not matter that their method signatures were completely different. In a static language all type changed would need to be done up front just to make the dang thing compile.

Still, you can have my types when you drag them from my cold, dead fingers. But maybe this is something for language designers of the future to consider. It would be kind of cool to have a strictly typed language where you could add a compiler flag to say "convert all variables into Python style variant dictionaries and make all type checks, method invocations etc work at runtime". Yes, people would abuse the crap out of this feature, but the same can be said about every new feature.

When will this land?

It is not done yet, so we don't know. At the earliest this will be in the next release, but more likely in the one after that.

If you like trying out new things and living dangerously, you can try the code from this MR. Be sure to post comments on that page if you do.

Saturday, August 10, 2024

Refactoring Python dicts to proper classes

When doing a major refactoring in Meson, I came up with a interesting refactoring technique, which I have not seen before. Some search engineing did not find suitable hits. Obviously it is entirely possible that this is a known refactoring but I don't know its name. In any case, here's my version of it.

The problem

Suppose you have a Python class with the following snippet

class Something:
    def __init__(self):
        self.store = {}

Basically you have a dictionary as a member variable. This is then used all around the class that grows and grows. Then you either find a bug in how the dict is used or you want to add some functionality like, to pick an arbitrary requirement, all keys for this object that are strings, must begin with "s_".

Now you have a problem because you need to do arbitrary changes all around the code. You can't easily debug this. You can't add a breakpoint inside this specific dictionary's setter function (or maybe Python's debugger can do that but I don't know how to do that). Reading code that massages dictionaries directly is tricky, because it's all brackets and open code rather than calling named methods like do_operation_x.

The solution, step one

Create a Python class that looks like this:

class MeaningfulName:
    def __init__(self, *args, **kwargs):
        self.d = dict(*args, **kwargs)

    def contains(self, key):
        return key in self.d

    def __getitem__(self, key):
        return self.d[key]

    def __setitem__(self, key, value):
        self.d[key] = value

    ...

Basically you implement all the special methods that do nothing else than forward to the underlying dictionary. Then replace the self.store dictionary with this object. Nothing should have changed. Run tests to make sure. Then commit this to main. Let it sit in the code base for a while in case there are untested code paths that use functionality that you did not write.

Just doing this gives an advantage: it is easy to add breakpoints to methods that mutate the objects's state.

Step two

Pick any of the special dunder methods and rename it to a more meaningful name. Add validation code if you need. Run tests. Fix all errors by rewriting the calling code to use the new named method. Some methods might need to be replaced with multiple new methods that do slightly different things. For example you might want to add methods like set_value and update_if_changed.

Step three

Repeat step two until all dunder methods are gone.

Wednesday, July 17, 2024

Why refactoring is harder than you think, a pictorial representation

Suppose you are working on a legacy code base. It looks like this:

Within it you find a piece of functionality that does a single thing, but is implemented in a hideously complicated way. Like so.

You look at that and think: "That's awful. I know how to do it better. Faster. Easier. More maintainable." Then you set out to do just that. And you succeed in producing this thing.

Wow. Look at that slick and smooth niceness. Time to excise the bad implementation from your code base.

The new one is put in:

And now you are don ... oh, my!



Friday, June 21, 2024

Advanced text features and PDF

The basic text model of PDF is quite nice. On the other hand its basic design was a very late 80s "ASCII is everything everyone really needs, but we'll be super generous and provide up to 255 glyphs using a custom encoding that is not in use everywhere else". As you can probably guess, this causes a fair bit of problems in the modern world.

To properly understand the text that follows you should know that there are four different ways in which text and letters need to be represented to get things working:

  • Source text is the "original" written text in UTF-8 (typically)
  • Unicode codepoints represent unique Unicode IDs as specified by the Unicode standard
  • A glyph id uniquely specifies a glyph (basically a series of drawing operations), these are arbitrary and typically unique for each font
  • ActualText is sort of like an AltText for PDF but uses UTF-16BE as was the way of the future in the early 90s

Kerning

The most common advanced typography feature in use is probably kerning, that is, custom spacing between certain letter pairs like "AV" and "To". The PDF text model has native support for kerning and it even supports vertical and horizontal kerning. Unfortunately the way things are set up means that you can only specify horizontal kerning when laying out horizontal text and vertical kerning for vertical text. If your script requires both, you are not going to have a good time.

There are several approaches one can take. The simplest is to convert all text to path drawing operations, which can be placed anywhere with arbitrary precision. This works great for printed documents but also means that document sizes balloon and you can't copypaste text from the document, use screen readers or do any other operation that needs the actual text those shapes represent.

An alternative is to render each glyph as its own text object with exact coordinates. While verbose this works, but since every letter is separate, text selection becomes wonky again. PDF readers seem to have custom heuristics to try to detect these issues and fix text selection in post-processing. Sometimes it works better than at other times.

Everything in PDF drawing operations is based on matrices. Text has its own transform matrix that defines where the next glyph will go. We could specify kerning manually with a custom translation matrix that translates the rendering location by the amount needed. There are two main downsides to this. First of all it would mean that instead of having a stream of glyphs to render, you'd need to define 9 floating point numbers (actually 6 due to reasons) between every pair of glyphs. This would increase the size of you output by a factor of roughly ten. The other downside is that unlike for all other matrices, PDF does not permit you to multiply an existing text state matrix with a new one. You can only replace it completely. So the actual code path would become "tell PDF to draw a glyph, work out what changes it would make to the currently active text matrix, undo that, multiply that matrix with one that has the changes that you wanted to happen and proceed to the next glyph".

Glyph substitution

Most of the time (in most scripts anyway) source text's Unicode codepoints get mapped 1:1 to a font glyph in the final output. Perhaps the most common case where this does not happen is ligatures.

The actual rules when and how this happens are script, font and language dependent. This is something you do not want to do yourself, instead use a shaping engine like Harfbuzz. If you give it the source text as UTF-8 and a font that has the ffi ligature, it will return a list of four glyph ids in the font to use, the way they map back to the original text, kerning (if any) and all of that good stuff.

What it won't give you is the information of what ligatures it replaced your source text with. In this example it will tell you the glyph id of the ffi ligature (2132) but not which Unicode codepoint it corresponds to (0xFB03). You need to tell that number in PDF metadata for the text to work properly in copypaste operations. At first this does not seem like such a big problem, because we have access to the original font file and Freetype. You'd think you can just ask Freetype for the Unicode codepoint for a given font glyph, but you can't. There is a function for finding a glyph for a given Unicode codepoint but mot the other way around. The stackoverflow recommended way of doing this is to iterate over all glyphs until you find the one that is mapped to the desired codepoint. For extra challenge you need to write an ActualText tag in the PDF command stream so that when users copypaste that text they get the original form with each individual letter rather than the ffi Unicode glyph.

All of this means that glyph lookup is basically a O(n^2) operation if it was possible to do. Sometimes it isn't, as we shall now find out.

Alternate forms

OpenType fonts can have multiple different glyphs for the same Unicode codepoint, for example the small caps versions of Noto Serif look like this.

These are proper hand-drawn versions of the glyphs, not ones obtained by scaling down upper case letters. Using these is simple, you tell Harfbuzz to use the small caps versions when shaping and then it does everything for you. For this particular font upper case small caps glyphs are the same as regular upper case glyphs. The lower case ones have their own glyphs in the font. However, according to Freetype at least, those glyphs are not mapped to any Unicode codepoint. Conceptually a small caps lower case "m" should be the same as a regular lower case "m". For some reason it is not and, unless I'm missing something, there is no API that can tell you that. The only way to do it "properly" is to track this yourself based on your input text and requirements.

How does CapyPDF handle all this?

In the same way pretty much all PDF generator libraries do: by ignoring all of it. CapyPDF only provides the means to express all underlying functionality in the PDF library. It is the responsibility of the client application to form glyph sequences and related PDF metadata in the way that makes sense for their application and document structure.

Tuesday, May 14, 2024

Generative non-AI

In last week's episode of the Game Scoop podcast an idea was floated that modern computer game names are uninspiring and that better ones could be made by picking random words from existing NES titles. This felt like a fun programming challenge so I went and implemented it. Code and examples can be found in this GH repo.

Most of the game names created in this way are word salad gobbledigook or literally translated obscure anime titles (Prince Turtles Blaster Family). Running it a few times does give results that are actually quite interesting. They range from games that really should exist (Operation Metroid) to surprisingly reasonable (Gumshoe Foreman's Marble Stadium), to ones that actually made me laugh out loud (Punch-Out! Kids). Here's a list of some of my favourites:

  • Ice Space Piano
  • Castelian Devil Rainbow Bros.
  • The Lost Dinosaur Icarus
  • Mighty Hoops, Mighty Rivals
  • Rad Yoshi G
  • Snake Hammerin'
  • MD Totally Heavy
  • Disney's Die! Connors
  • Monopoly Ransom Manta Caper!
  • Revenge Marble
  • Kung-Fu Hogan's F-15
  • Sinister P.O.W.
  • Duck Combat Baseball

I emailed my findings back to the podcast host and they actually discussed it in this week's show (video here starting at approximately 35 minutes). All in all this was an interesting exercise. However pretty quickly after finishing the project I realized that doing things yourself is no longer what the cool kids are doing. Instead this is the sort of thing that is seemingly tailor-made for AI. All you have to do is to type in a prompt like "create 10 new titles for video games by only taking words from existing NES games" and post that to tiktokstagram.

I tried that and the results were absolute garbage. Since the prompt has to have the words "video game" and "NES", and LLMs work solely on the basis of "what is the most common thing (i.e. popular)", the output consists almost entirely of the most well known NES titles with maybe some words swapped. I tried to guide it by telling it to use "more random" words. The end result was a list of ten games of which eight were alliterative. So much for randomness.

But more importantly every single one of the recommendations the LLM created was boring. Uninspired. Bland. Waste of electricity, basically.

Thus we find that creating a list of game names with an LLM is easy but the end result is worthless and unusable. Doing the same task by hand did take a bit more effort but the end result was miles better because it found new and interesting combinations that a "popularity first" estimator seem to not be able to match. Which matches the preconception I had about LLMs from prior tests and seeing how other people have used them.

Sunday, April 21, 2024

C is dead, long live C (APIs)

In the 80s and 90s software development landscape was quite different from today (or so I have been told). Everything that needed performance was written in C and things that did not were written in Perl. Because computers of the time were really slow, almost everything was in C. If you needed performance and fast development, you could write a C extension to Perl.

As C was the only game in town, anyone could use pretty much any other library directly. The number of dependencies available was minuscule compared to today, but you could use all of them fairly easily. Then things changed, as they have a tendency to do. First Python took over Perl. Then more and more languages started eroding C's dominant position. This lead to a duplication of effort. For example if you were using Java and wanted to parse XML (which was the coolness of its day), you'd need an XML parser written in Java. Just dropping libxml in your Java source tree would not cut it (you could still use native code libs but most people chose not to).

The number of languages and ecosystems kept growing and nowadays we have dozens of them. But suppose you want to provide a library that does something useful and you'd like it to be usable by as many people as possible. This is especially relevant for providing closed source libraries but the same applies to open source libs as well. You especially do not want to rewrite and maintain multiple implementations of the code in different languages. So what do you do?

Let's start by going through a list of programming languages and seeing what sort of dependencies they can use natively (i.e. the toolchain or stdlib provides this support out of the box rather than requiring an addon, code generator, IDL tool or the like)

  • C: C
  • Perl: Perl and C
  • Python: Python and C
  • C++: C++ and C
  • Rust: Rust and C
  • Java: Java and C
  • Lua: Lua and C
  • D: D, subset of C++ and C
  • Swift: Swift, Objective C, C++ (eventually?) and C
  • PrettyMuchAnyNewLanguage: itself and C
The message is quite clear. The only thing in common is C, so that is what you have to use. The alternative is maintaining an implementation per language leaving languages you explicitly do not support out in the cold.

So even though C as a language is (most likely) going away, C APIs are not. In fact, designing C APIs is a skill that might even see a resurgence as the language ecosystem fractures even further. Note that providing a library with a C API does not mean having to implement it in C. All languages have ways of providing libraries whose external API is compatible with C. As an extreme example, Visual Studio's C runtime libraries are nowadays written in C++.

CapyPDF's design and things picked up along the way

One of the main design goals of CapyPDF was that it should provide a C API and be usable from any language. It should also (eventually) provide a stable API and ABI. This means that the ground truth of the library's functionality is the C header. This turns out to have design implications to the library's internals that might be difficult to add in after the fact.

Hide everything

Perhaps the most important declaration in widely usable C headers is this.

typedef struct _someObject SomeObject;

In C parlance this means "there is a struct type _someObject somewhere, create an alias to it called SomeObjectType". This means that the caller can create pointers to structs of type SomeObject but do nothing else with them. This leads to the common "opaque structs" C API way of doing things:

SomeObject *o = some_object_new();
some_object_do_something(o, "hello");
some_object_destroy(o);

This permits you to change the internal representation of the object while still maintaining stable public API and ABI. Avoid exposing the internals of structs whenever possible, because once made public they can never be changed.

Objects exposed via pointers must never move in memory

This one is fairly obvious when you think about it. Unfortunately it means that if you want to give users access to objects that are stored in an std::vector, you can't do it with pointers, which is the natural way of doing things in C. Pushing more entries in the vector will eventually cause the capacity to be exceeded so the storage will be reallocated and entries moved to the new backing store. This invalidates all pointers.

There are several solutions to this, but the simplest one is to access those objects via type safe indices instead. They are defined like this:

typedef struct { int32_t id; } SomeObjectId;

This struct behaves "like an integer" in that you can pass it around as an int but it does not implicitly convert to any other "integer" type.

Objects must be destructable in any order

It is easy to write into documentation that "objects of type X must be destroyed before any object Y that they use". Unfortunately garbage collected languages do not read your docs and thus provide no guarantee whatsoever on object destruction order. When used in this way any object must be destructable at any time regardless of the state of any other object.

This is the opposite of how modern languages want to work. For the case of CapyPDF especially page draw contexts were done in an RAII style where they would submit their changes upon destruction. For an internal API this is nice and usable but for a public C API it is not. The implicit action had to be replaced with an explicit function to add the page that takes both object pointers (the draw context and document) as arguments. This ensures that they both must exist and be valid at the point of call.

Use transactionality whenever possible

It would be nice if all objects were immutable but sadly that would mean that you can't actually do anything. A library must provide ways for end users to create, mutate and destroy objects. When possible try to do this with a builder object. That is, the user creates a "transactional change" that they want to do. They can call setters and such as much as they want, but they don't affect the "actual document". All of this new state is isolated in the builder object. Once the user is finished they submit the change to the main object which is then validated and either rejected or accepted as a whole. The builder object then becomes an empty shell that can be either reused or discarded.

CapyPDF is an append only library. Once something has been "committed" it can never be taken out again. This is also something to strive towards, because removing things is a lot harder than adding them.

Prefer copying to sharing

When the library is given some piece of data, it makes a private copy of it. Otherwise it would need to coordinate the life cycle of the shared piece of data with the caller. This is where bugs lie. Copying does cost some performance but makes a whole class of difficult bugs just go away. In the case of CapyPDF the performance hit turned out not to be an issue since most of the runtime is spent compressing the output with zlib.

Every function call can fail, even those that can't

Every function in the library returns an error code. Even those that have no way of failing, because circumstances can change in the future. Maybe some input that could be anything somehow needs to be validated now and you can't change the function definition as it would break API. Thus every function returns an error code (except the function that converts an error code into an error string). Sadly this means that all "return values" must be handled via out parameters.

ErrorCode some_object_new(SomeObject **out_ptr);

This is not great, but such is life. 

Think of C APIs as "in-process RPC"

When designing the API of CapyPDF it was helpful to think of it like a call to a remote endpoint somewhere out there on the Internet. This makes you want to design functions that are as high level as possible and try to ignore all implementation details you can, almost as if the C API was a slightly cumbersome DSL. 

Wednesday, April 17, 2024

CapyPDF 0.10.0 is out

Perhaps the most interesting feature is that this new version reduces the number of external dependencies by almost 15%. In other words the number of deps went from 7 to 6. This is due to Apple Clang finally shipping with std::format support so fmt::format could be removed. The actual change was pretty much a search & replace from fmt::format to std::format. Nice.

Other features include:

  • L*a*b* color support in paint operations
  • Reworked raster image APIs
  • Kerned text support
  • Support for all PDF/X versions, not just 3
  • Better outline support

But most importantly, there are now stickers:

Sadly you can't actually buy them anywhere, they can only be obtained by meeting me in person and asking for one.

Tuesday, April 2, 2024

Aesthetics matter

When I started working on Meson I had several goals: portability, performance, usability and so on. I particularly liked the last one of these, but to my surprise this interest was not shared by people at large, especially those who used Autotools. Eventually the discussion always degenerated with them saying some variant of this:

It does not matter that Autotools is implemented as a mixture of five different scripting languages mismashed together. It works, so why replace it with something that is, at best, a tiny bit better?

One person went so far as to ask me (in public in front of a crowd) why making builds faster is even a thing to waste effort on? Said person followed this by saying he began his every working day by starting a build and going to brew some coffee. When he came back to his computer everything was ready to start programming.

It annoyed me to no end that I did not have a good reply to these people at the time. Unfortunately a thing happened last week that changed this.

The XZ malicious code injection incident.

It would be easy to jump on a bandwagon and blame Autotools for the whole issue and demand it to be banned as an unfixable security vulnerability [1] and all that. But let's not do that. Instead let's look at the issue from a slightly wider perspective.

Take any project you are working on currently. It can either be a work project or an open source one. Now think about all the various components it has. Go through them one by one in your mind. Pause at each one. Ponder them. Does any one of them immediately conjure up the following reaction in your mind:

I'm not touching that shit!

If the answer is yes then congratulations, you have found the most likely attack vector against the project. Why? Because that part that is guaranteed to have the absolute worst code reviews for the simple reason that nobody wants to touch it with a ten foot pole [2]. It is the very definition of someone else's problem. In the case of Autotools the problem is even worse, because there are no tools to find bugs automatically. Static analysis? No [3]! Linters? No! Even something simple like compiler warnings? Lol no! The reason they don't exist is exactly the same as above: the whole problem space is so off-putting that even the people who could do something about it prefer to work on something more meaningful instead. Badness begets more badness and apathy. The fact that it does not halt and catch fire most of the time is seen as sufficient quality.

This is even more of a problem for open source projects. Commercial projects pay people a full living salary to deal with necessary non-glamorous work like this. Volunteer based open source projects can not. A major fraction of the motivation for contributing on an open source project is to work on something that is somehow "cool", "fun" or "interesting". Debugging issues caused by incorrect M4 substitutions somewhere in the guts of a ten layer deep sed/awk/grep/Make/xargs/subshell pipeline is not that.

The reports I have read do not state whether XZ's malicious payload was submitted PR or not, but let's do a thought experiment. Assume that you are the overworked maintainer of an open source project that gets a PR that changes a bunch of M4 files with a description "fixes issue X in Y". What would you do? If you are honest with yourself, you'd probably do the same thing I'd do: merge it in while thinking "I'm just glad someone else fixed this so I don't have to touch that shit [4]".

Thus we find that aesthetics and beauty in fact play a vital role in security, because those systems make people want to work on them. They attract eyeballs. There is never a risk of getting stuck maintaining some awful piece of garbage because you touched it last so it's your responsibility now [5]. Beauty truly is the mother of security, or, as the ancient romans used to say:

Pulchritudo mater securitatis! [6]

[1] Which you still might choose to do.

[2] For a more literal example, several "undefeatable" fortresses have been taken over by attackers entering via sewage pipes.

[3] And not only because all the languages in question are dynamic.

[4] Yes, I have done this with Meson. Several times. Every maintainer of even moderately popular open source project has done the same. Trying to deny it is not productive.

[5] This is especially common in corporations with the added limitation that you are not allowed to make any major changes to avoid breaking things. If you ever find yourself in this situation, find employment elsewhere sooner rather than later. Otherwise your career has reached a dead end you can't escape.

[6] At least according to Google translate, which is good enough for our modern post-truth world.

Wednesday, March 20, 2024

Color management and API design

API design is hard. This is not a smashingly new revelation, but let's look at a sample issue I have been working on for CapyPDF. The main problem we are trying to solve is creating "print quality" PDFs. That is, ones that can be used to print things like books, magazines, posters and other high quality materials. A core component of this is color management, specifically the handling of ICC profiles for raster images.

There are at least four slightly conflicting design goals.

Fine-grained control

An advanced user knows and understands the PDF spec and know exactly how they want it to come out. The library should provide for this and not do, for example, unexpected color conversions behind the user's back.

Easy to use for basic cases

OTOH if your needs are simple, such as just loading images from files on disk, converting them to the output colorspace (almost certainly CMYK) with minimal fuss.

Simplicity

The API should be simple and readable. Even more importantly it should be understandable in the sense that when the user calls certain functions, they should be able to "know" what is going to happen and the behaviour should be the same over multiple invocations.

Safety

The API should prevent you from doing invalid things, such as using an uncalibrated RGB image in a CMYK document.

A wild real world appears!

Thus far things seem simple, but they get awfully complex. PDF is used in many different ways and all of those have their own requirements. For high quality printing specifically there is a specification called PDF/X that many printing shops use. Some might not even accept material that is not in this format. One of the requirements of PDF/X is that all raster images must be color managed. It would seem that a simple approach would be to convert all images to the output color space on load. And this is where things break down.

For you see, PDF does not have a single color managed pipeline, logically it has two. Grayscale images are "different" from full color images. A PDF generator must never convert grayscale raster images (or colors in general, but we'll focus on images now) to "color" images. Not even if the end result were "mathematically equivalent". In high quality printing that is not enough. Suppose you have a pixel whose gray value is 10. Converting that to CMYK can lead to (at least) two different values, (10, 10, 10, 0) and (0, 0, 0, 10). You'd think that the latter would always happen, but in testing LittleCMS produced the former (it also has custom gray-preserving transforms, but I did not try those). Even though these values are mathematically equivalent they (may) produce different output when printed. The latter is pure gray while the former can look muddled and if there are any registration problems the constituent colors might be visible. The RIP can not know whether the "grayscale looking color" was intentional or not. Under some circumstances it might be exactly what the creator intended, thus it can't really be post processed away. The only correct way is to keep the image in the gray color space so the RIP has maximal information to do its thing.

But this causes its own problem, because most grayscale images are not color managed. What should you do with those? Requiring color profiles would not be a nice UI, because then most images would break. For 1-bit grayscale images a color profile would not even make any sense. Not to mention that the grayscale image might not be printed at all but it instead used as an image mask for graphics composition operations (basically it would be used as the alpha channel). In that case you definitely want to use raw pixel values to obtain linear mixing. Doing gamma correction on your transparency channel could lead to some funky effects.

Things get more complicated once you realize that there are 7 variations of PDF/X that permit and prohibit different things. I tried to work out the workflow by writing a full table on color modes and output spaces and what should happen with every combination. Half way through I got a headache and had to stop.

Current status

The original plan was to make things happen automatically and try to validate the semantics of the output document as much as possible. That got simplified a whole lot. Because the state space is just so massive it might turn out that eventually CapyPDF only provides you the tools to do color conversions yourself and then writes out the result without trying to do anything fancy to it. It would then be the responsibility of the user to validate all semantic requirements.

All of this is to say that if you are currently using CapyPDF, just be aware that in the next version all APIs dealing with raster images have changed completely.

Monday, March 4, 2024

CapyPDF 0.9.0 released

I have just released CapyPDF 0.9.0. It can be obtained either via Github or PyPI.

There is no major big feature for this release. The most notable is probably the ability to create structured (or "tagged") PDF files. The code supports using both the builtin tags as well as defining your own. Feel free to try it, just know that the API is guaranteed to change.

As a visual example, here is the full Python source code for one of the unit tests.

When run it creates a tagged PDF file.  Adobe Acrobat reports that the document has the following logical structure.

As you can (hopefully) tell, structure and content are the same in both of them.

Friday, February 23, 2024

Creating tagged PDFs with CapyPDF now sort of possible

There are many open source PDF generators available. Unfortunately they all have some limitations when it comes to generating tagged PDFs

  • Cairo does not support tagged PDFs at all
  • LaTeX can create tagged PDFs, but obviously only out of LaTeX documents
  • Scribus does not support tagged PDF
  • LibreOffice does support tagged PDF generation, but its code is not available as a standalone library, it can only be used via LibreOffice
  • HexaPDF does not support tagged PDFs (though they seem to be on the roadmap), but it is implemented in Ruby so most projects can't use it
  • Firefox can print pages to PDF, but the result is not tagged, even though the PDF document structure model is almost exactly the same as in HTML

There does not seem to be a library that provides for all of this with a "plain C" API that can be used to easily generated tagged PDFs using almost any programming language.

There still isn't, but at least now CapyPDF can generate simple tagged PDF documents. A sample document can be downloaded via this link. Here is a picture showing the document structure in Acrobat Pro.

It should also work in things like screen readers and other accessibility tools, but I have not tested it.

None of this is exposed in the C API, because this has a fairly large API surface and I have not yet come up with a good way to represent it.


Sunday, January 21, 2024

CapyPDF 0.8.0 released

Version 0.8.0 of the CapyPDF library has been released. The main new feature is support for form XObjects and printer's mark annotations.

Printer's marks are things like color bars, crop marks and registration marks (also known as "bullseye marks") that high end printers need for quality control. Here is a very simple document.

An experienced print operator would use the black lines to set up paper trimming. Traditionally these marks were drawn in the page's graphics stream. This is problematic because nowadays printers prefer to use their own custom marks instead of ones created by the document author. PDF solves this issue by moving these graphics operations to separate draw contexts (specifically "form XObjects", which are not actually forms, though they are XObjects) that can then be "glued" on top of the page. These annotations are shown in PDF viewer applications but they are not printed. I have no experience with high end RIP software, but presumably the operator can choose to either print the document's annotations or replace them with their custom QA marks.

As usual, to find out what features CapyPDF has and how to use them, look up either the public C header or the Python code used in unit tests.

Tuesday, January 2, 2024

C++ module tooling emulator playground

Developing tooling for C++ modules is challenging to say the least. Module implementation maturity in compilers varies, they all work slightly (well massively) differently, there are bugs and you also need a code base that uses modules. Because of these and other reasons there are maybe five people in the entire world who even think about this issue. This is bad, because it is supposed to be future foundational technology. It would benefit from more eyes.

Something ought to be done about this. So I did.

I created a fake "module only" C++ compiler, a fake linker, a fake module scanner and fake project generator. The compiler does not produce any code. It only does three things:

  1. Reads export statements from sources and writes corresponding fake module files
  2. Reads import statements from source files.
  3. Validates that all module files a source file imports exist on disk at the time of invocation
This did not take much effort. In fact the whole project is approximately 300 lines of Python and can be obtained from this repository.

With this it is possible for anyone interested to try to come up with a module building scheme that does not require generating O(N²) command line arguments on the fly. The current scanner implementation is taken almost directly from Meson and it works one build target at a time as opposed to a single source file at a time. With this approach the overhead of scanning is one process invocation per build target. A per-source approach would take two processes per source scanning: one for invoking the compiler to generate the standard JSON-format dependency file and another for converting that to the dyndep format that Ninja understands.

The setup assumes that the compiler supports a "write module files to this directory" command line argument. It is mandatory to avoid generating compiler arguments dynamically.

Or maybe it isn't and there is a way to make it work in some other way. At least now the tooling is available so anyone reading this can try to solve this problem.