Sunday, October 6, 2024

CapyPDF 0.12.0 released

I have just made the 0.12 release of CapyPDF. It does not really have new features, but the API has been overhauled. It is almost guaranteed that no code developed against 0.11 will work without code changes. Such is the joy of not having any users.

Experimental C++ wrapper

CapyPDF has a plain C API. This makes it stable and easy to use from any programming language. That also makes it cumbersome to use. Here is what you need to write to create a PDF file that has a single rectangle:

Given that this is C, you can't really do better. However I was asked if I could create something more ergonomic for C++ users. This seemed like an interesting challenge, so I did some experimentation, which ships with the 0.12 release.

The requirements

The C++ wrapper should fulfill the following requirements:

  • Fully type safe
  • No manual memory management
  • Ideally a single header
  • Zero overhead
  • Fast to compile
  • All objects are move-only
  • IDE code completion friendly
  • Does not need to maintain API or ABI stability (those who need that have to use the C API anyway)

The base

After trying a bunch of different things I eventually came up with this design:

Basically it just stores an underlying CapyPDF object type and a helper class used to deallocate it. The operators merely mean that the object can be cast to a pointer of the underlying type. Typically you want conversion operators to be explicit but these are not, because begin implicit removes a ton of boilerplate code.

For example let's look what the Color object wrapper looks like. First you need the deleter:

The class definition itself is simple:

The class has no other members than the one from the base class. The last thing we need before we can start calling into CapyPDF functions is an error handler. Because all functions in the API have the exact same form, this can be done with a single macro:

This makes the constructor look like this:

Method calls bring all of these things together.

Because the wrapper types are implicitly convertible to the underlying pointer types you can pass them directly to the C API functions. Otherwise the wrapper code would be filled with .get() method calls or the like.

The end result

With all that done, the C code from the beginning of this post can be written like this:

Despite using templates, inheritance and stdlib types the end result can be proven to have zero overhead:

After this what remains is mostly the boring work of typing out all the wrapper calls.

Thursday, September 12, 2024

On M and S Type Processes in Software Development

I wrote a post about so called "M type" and "S type" processes in software development. Unfortunately it discusses the concept of human sexuality. Now, just to be sure, it does not have any of the "good stuff" as the kids might say. Nonetheless this blog is syndicated in places where such topics might be considered controversial or even unacceptable.

Thus I can't really post the text here. Those of you who are of legal age (whatever that means in your jurisdiction) and out of their own free will want to read such material, can access the PDF version of the article via this link.

Wednesday, August 21, 2024

Meson's New Option Setup ‒ The Largest Refactoring

The problem

Meson has had togglable options from almost the very beginning. These split into two camps. The first one is "common options" like optimizations, warning level, language standard version and so on. The second one is "per project" options that are specific to each project, such as which backend to use. For a long time things were quite nice but as people started using subprojects more and more, the need to configure common options on a per-subproject basis became more and more important.

Meson added a limited way of setting some options per subproject, but it was never really felt like a proper integrated solution. Doing it properly turns out to have a lot of requirements because you want to be able to:

  • Override any shared option for any subproject
  • Do this at runtime from the command line
  • You want to unset any override given
  • Convert existing per-project settings to the new override format
  • Provide an UI that is readable and sensible
  • Do all of this without needing to edit subproject build files

The last one of these is important. It means that you can use deps directly (i.e. from WrapDB) without any local patches.

What benefits do you get out of it?

The benefits are most easily seen via examples. Let's say you are developing a program that uses a dependency that does heavy number crunching. You need to build that (and only that) subproject with optimizations enabled, otherwise your development experience is intolerably slow. This is done by defining an augment, like so:

meson configure -Acruncher:optimization=2

A stronger version of this would be to compile all subprojects with optimizations but the top level project without them. This is how you'd do it:

meson configure -Doptimization=2 -A:optimization=0

Augments can be unset:

meson configure -Usubproject:option

This scheme permits you to do all sorts of useful things, like disable -Werror on specific projects, build some subprojects with a different language version (such as gnu99), compiling LGPL deps as shared libraries and everything else as a static library, and so on.

Implementing

This is a big internal change. How big? Big! This is the largest refactoring operation I have done in my life. It is big enough that it took me over two years of procrastination before I managed to gather enough strength to start work on this. Pretty much all of my Meson work in the last six months or so has been spent on this one issue. The feature is still not done, but the merge request already has 80 commits and 1700+ new lines and even that is an understatement. I have chopped off bits of the change and merged them on their own. All in all this meant that the schedule for most days of my summer vacation went like this:

  • Wake up
  • Work on Meson refactoring branch until fed up
  • Work on my next book until fed up
  • Maybe do something else
  • Sleep
FTR I don't recommend this style of working for anyone else. Or even to myself. But sometimes you just gotta.

The main reason this change is so complex lies in the architecture. In existing code each built target "knew" the option settings needed for it (options could and can be overridden in build files on a per-target basis). This does not work any more. Instead the code needs one place that encapsulates all option data and provides methods like "what is the value of option X when building target Y in subproject Z". Option code was everywhere, so changing this meant touching the entire code base and that the huge change blob must be landed in master atomically.

The only thing that made this change even remotely feasible was that Meson has an extensive test suite. The main code changes were done months ago, and all work since then has gone into making existing unit tests pass. They still don't pass, so work continues. Without this test suite there would have been hundreds of regressing projects, people would be angry and everyone would pin their Meson to an old version and refuse to update. These are the sorts of breakages that kill projects dead. So, write tests, even if it does not seem fun. Without them every project will eventually end up in a fork in the road where the choice is between "death by stagnation" and "death by breaking end users". Most projects are not Python 3. They probably won't survive a similar level of breakage.

Refactoring, types and Python

Python is, at the same time, my favourite programming language and very much not my favourite programming language. Python in the small is nice, readable, wonderful and productive. As the project size grows, the lack of static types becomes aggravating and eventually you end up debugging cases like "why does this argument that should be a dict is an array one out of 500 times at random". Types make these problems go away and make refactoring easy.

But not always.

For this very specific case the complete lack of types actually made the refactoring easier. Meson currently supports more than one hundred different compilers. I needed to change the way compiler classes work, but I did not know how. Thus I started by just using the GNU C compiler. I could change that (and its base class) as much as I wanted without having to care about any other compiler class. As long as I did not use any other compiler their code was not called and it did not matter that their method signatures were completely different. In a static language all type changed would need to be done up front just to make the dang thing compile.

Still, you can have my types when you drag them from my cold, dead fingers. But maybe this is something for language designers of the future to consider. It would be kind of cool to have a strictly typed language where you could add a compiler flag to say "convert all variables into Python style variant dictionaries and make all type checks, method invocations etc work at runtime". Yes, people would abuse the crap out of this feature, but the same can be said about every new feature.

When will this land?

It is not done yet, so we don't know. At the earliest this will be in the next release, but more likely in the one after that.

If you like trying out new things and living dangerously, you can try the code from this MR. Be sure to post comments on that page if you do.

Saturday, August 10, 2024

Refactoring Python dicts to proper classes

When doing a major refactoring in Meson, I came up with a interesting refactoring technique, which I have not seen before. Some search engineing did not find suitable hits. Obviously it is entirely possible that this is a known refactoring but I don't know its name. In any case, here's my version of it.

The problem

Suppose you have a Python class with the following snippet

class Something:
    def __init__(self):
        self.store = {}

Basically you have a dictionary as a member variable. This is then used all around the class that grows and grows. Then you either find a bug in how the dict is used or you want to add some functionality like, to pick an arbitrary requirement, all keys for this object that are strings, must begin with "s_".

Now you have a problem because you need to do arbitrary changes all around the code. You can't easily debug this. You can't add a breakpoint inside this specific dictionary's setter function (or maybe Python's debugger can do that but I don't know how to do that). Reading code that massages dictionaries directly is tricky, because it's all brackets and open code rather than calling named methods like do_operation_x.

The solution, step one

Create a Python class that looks like this:

class MeaningfulName:
    def __init__(self, *args, **kwargs):
        self.d = dict(*args, **kwargs)

    def contains(self, key):
        return key in self.d

    def __getitem__(self, key):
        return self.d[key]

    def __setitem__(self, key, value):
        self.d[key] = value

    ...

Basically you implement all the special methods that do nothing else than forward to the underlying dictionary. Then replace the self.store dictionary with this object. Nothing should have changed. Run tests to make sure. Then commit this to main. Let it sit in the code base for a while in case there are untested code paths that use functionality that you did not write.

Just doing this gives an advantage: it is easy to add breakpoints to methods that mutate the objects's state.

Step two

Pick any of the special dunder methods and rename it to a more meaningful name. Add validation code if you need. Run tests. Fix all errors by rewriting the calling code to use the new named method. Some methods might need to be replaced with multiple new methods that do slightly different things. For example you might want to add methods like set_value and update_if_changed.

Step three

Repeat step two until all dunder methods are gone.

Wednesday, July 17, 2024

Why refactoring is harder than you think, a pictorial representation

Suppose you are working on a legacy code base. It looks like this:

Within it you find a piece of functionality that does a single thing, but is implemented in a hideously complicated way. Like so.

You look at that and think: "That's awful. I know how to do it better. Faster. Easier. More maintainable." Then you set out to do just that. And you succeed in producing this thing.

Wow. Look at that slick and smooth niceness. Time to excise the bad implementation from your code base.

The new one is put in:

And now you are don ... oh, my!



Friday, June 21, 2024

Advanced text features and PDF

The basic text model of PDF is quite nice. On the other hand its basic design was a very late 80s "ASCII is everything everyone really needs, but we'll be super generous and provide up to 255 glyphs using a custom encoding that is not in use everywhere else". As you can probably guess, this causes a fair bit of problems in the modern world.

To properly understand the text that follows you should know that there are four different ways in which text and letters need to be represented to get things working:

  • Source text is the "original" written text in UTF-8 (typically)
  • Unicode codepoints represent unique Unicode IDs as specified by the Unicode standard
  • A glyph id uniquely specifies a glyph (basically a series of drawing operations), these are arbitrary and typically unique for each font
  • ActualText is sort of like an AltText for PDF but uses UTF-16BE as was the way of the future in the early 90s

Kerning

The most common advanced typography feature in use is probably kerning, that is, custom spacing between certain letter pairs like "AV" and "To". The PDF text model has native support for kerning and it even supports vertical and horizontal kerning. Unfortunately the way things are set up means that you can only specify horizontal kerning when laying out horizontal text and vertical kerning for vertical text. If your script requires both, you are not going to have a good time.

There are several approaches one can take. The simplest is to convert all text to path drawing operations, which can be placed anywhere with arbitrary precision. This works great for printed documents but also means that document sizes balloon and you can't copypaste text from the document, use screen readers or do any other operation that needs the actual text those shapes represent.

An alternative is to render each glyph as its own text object with exact coordinates. While verbose this works, but since every letter is separate, text selection becomes wonky again. PDF readers seem to have custom heuristics to try to detect these issues and fix text selection in post-processing. Sometimes it works better than at other times.

Everything in PDF drawing operations is based on matrices. Text has its own transform matrix that defines where the next glyph will go. We could specify kerning manually with a custom translation matrix that translates the rendering location by the amount needed. There are two main downsides to this. First of all it would mean that instead of having a stream of glyphs to render, you'd need to define 9 floating point numbers (actually 6 due to reasons) between every pair of glyphs. This would increase the size of you output by a factor of roughly ten. The other downside is that unlike for all other matrices, PDF does not permit you to multiply an existing text state matrix with a new one. You can only replace it completely. So the actual code path would become "tell PDF to draw a glyph, work out what changes it would make to the currently active text matrix, undo that, multiply that matrix with one that has the changes that you wanted to happen and proceed to the next glyph".

Glyph substitution

Most of the time (in most scripts anyway) source text's Unicode codepoints get mapped 1:1 to a font glyph in the final output. Perhaps the most common case where this does not happen is ligatures.

The actual rules when and how this happens are script, font and language dependent. This is something you do not want to do yourself, instead use a shaping engine like Harfbuzz. If you give it the source text as UTF-8 and a font that has the ffi ligature, it will return a list of four glyph ids in the font to use, the way they map back to the original text, kerning (if any) and all of that good stuff.

What it won't give you is the information of what ligatures it replaced your source text with. In this example it will tell you the glyph id of the ffi ligature (2132) but not which Unicode codepoint it corresponds to (0xFB03). You need to tell that number in PDF metadata for the text to work properly in copypaste operations. At first this does not seem like such a big problem, because we have access to the original font file and Freetype. You'd think you can just ask Freetype for the Unicode codepoint for a given font glyph, but you can't. There is a function for finding a glyph for a given Unicode codepoint but mot the other way around. The stackoverflow recommended way of doing this is to iterate over all glyphs until you find the one that is mapped to the desired codepoint. For extra challenge you need to write an ActualText tag in the PDF command stream so that when users copypaste that text they get the original form with each individual letter rather than the ffi Unicode glyph.

All of this means that glyph lookup is basically a O(n^2) operation if it was possible to do. Sometimes it isn't, as we shall now find out.

Alternate forms

OpenType fonts can have multiple different glyphs for the same Unicode codepoint, for example the small caps versions of Noto Serif look like this.

These are proper hand-drawn versions of the glyphs, not ones obtained by scaling down upper case letters. Using these is simple, you tell Harfbuzz to use the small caps versions when shaping and then it does everything for you. For this particular font upper case small caps glyphs are the same as regular upper case glyphs. The lower case ones have their own glyphs in the font. However, according to Freetype at least, those glyphs are not mapped to any Unicode codepoint. Conceptually a small caps lower case "m" should be the same as a regular lower case "m". For some reason it is not and, unless I'm missing something, there is no API that can tell you that. The only way to do it "properly" is to track this yourself based on your input text and requirements.

How does CapyPDF handle all this?

In the same way pretty much all PDF generator libraries do: by ignoring all of it. CapyPDF only provides the means to express all underlying functionality in the PDF library. It is the responsibility of the client application to form glyph sequences and related PDF metadata in the way that makes sense for their application and document structure.

Tuesday, May 14, 2024

Generative non-AI

In last week's episode of the Game Scoop podcast an idea was floated that modern computer game names are uninspiring and that better ones could be made by picking random words from existing NES titles. This felt like a fun programming challenge so I went and implemented it. Code and examples can be found in this GH repo.

Most of the game names created in this way are word salad gobbledigook or literally translated obscure anime titles (Prince Turtles Blaster Family). Running it a few times does give results that are actually quite interesting. They range from games that really should exist (Operation Metroid) to surprisingly reasonable (Gumshoe Foreman's Marble Stadium), to ones that actually made me laugh out loud (Punch-Out! Kids). Here's a list of some of my favourites:

  • Ice Space Piano
  • Castelian Devil Rainbow Bros.
  • The Lost Dinosaur Icarus
  • Mighty Hoops, Mighty Rivals
  • Rad Yoshi G
  • Snake Hammerin'
  • MD Totally Heavy
  • Disney's Die! Connors
  • Monopoly Ransom Manta Caper!
  • Revenge Marble
  • Kung-Fu Hogan's F-15
  • Sinister P.O.W.
  • Duck Combat Baseball

I emailed my findings back to the podcast host and they actually discussed it in this week's show (video here starting at approximately 35 minutes). All in all this was an interesting exercise. However pretty quickly after finishing the project I realized that doing things yourself is no longer what the cool kids are doing. Instead this is the sort of thing that is seemingly tailor-made for AI. All you have to do is to type in a prompt like "create 10 new titles for video games by only taking words from existing NES games" and post that to tiktokstagram.

I tried that and the results were absolute garbage. Since the prompt has to have the words "video game" and "NES", and LLMs work solely on the basis of "what is the most common thing (i.e. popular)", the output consists almost entirely of the most well known NES titles with maybe some words swapped. I tried to guide it by telling it to use "more random" words. The end result was a list of ten games of which eight were alliterative. So much for randomness.

But more importantly every single one of the recommendations the LLM created was boring. Uninspired. Bland. Waste of electricity, basically.

Thus we find that creating a list of game names with an LLM is easy but the end result is worthless and unusable. Doing the same task by hand did take a bit more effort but the end result was miles better because it found new and interesting combinations that a "popularity first" estimator seem to not be able to match. Which matches the preconception I had about LLMs from prior tests and seeing how other people have used them.