Tuesday, April 21, 2026

CapyPDF is approaching feature sufficiency

In the past I have written many blog posts on implementing various PDF features in CapyPDF. Typically they explain the feature being implemented, how confusing the documentation is, what perverse undocumented quirks one has to work around to get things working and so on. To save the effort of me writing and you reading yet another post of the same type, let me just say that you can now use CapyPDF to generate PDF forms that have widgets like text fields and radio buttons.

What makes this post special is that forms and widget annotations were pretty much the last major missing PDF feature Does that mean that it supports everything? No. Of course not. There is a whole bunch of subtlety to consider. Let's start with the fact that the PDF spec is massive, close to 1000 pages. Among its pages are features that are either not used or have been replaced by other features and deprecated.

The implementation principle of CapyPDF thus far has been "implement everything that needs special tracking, but only to the minimal level needed". This seems complicated but is in fact quite simple. As an example the PDF spec defines over 20 different kinds of annotations. Specifying them requires tracking each one and writing out appropriate entries in the document metadata structures. However once you have implemented that for one annotation type, the same code will work for all annotation types. Thus CapyPDF has only implemented a few of the most common annotations and the rest can be added later when someone actually needs them.

Many objects have lots of configuration options which are defined by adding keys and values to existing dictionaries. Again, only the most common ones are implemented, the rest are mostly a matter of adding functions to set those keys. There is no cross-referencing code that needs to be updated or so on. If nobody ever needs to specify the color with which a trim box should be drawn in a prepress preview application, there's no point in spending effort to make it happen.

The API should be mostly done, especially for drawing operations. The API for widgets probably needs to change. Especially since form submission actions are not done. I don't know if anything actually uses those, though. That work can be done based on user feedback.

Friday, April 17, 2026

Multi merge sort, or when optimizations aren't

In our previous episode we wrote a merge sort implementation that runs a bit faster than the one in stdlibc++. The question then becomes, could it be made even faster. If you go through the relevant literature one potential improvement is to do a multiway merge. That is, instead of merging two arrays into one, you merge four into one using, for example, a priority queue.

This seems like a slam dunk for performance.

  • Doubling the number of arrays to merge at a time halves the number of total passes needed
  • The priority queue has a known static maximum size, so it can be put on the stack, which is guaranteed to be in the cache all the time
  • Processing an element takes only log(#lists) comparisons
Implementing multimerge was conceptually straightforward but getting all the gritty details right took a fair bit of time. Once I got it working the end result was slower. And not by a little, either, but more than 30% slower. Trying some optimizations made it a bit faster but not noticeably so.

Why is this so? Maybe there are bugs that cause it to do extra work? Assuming that is not the case, what actually is? Measuring seems to indicate that a notable fraction of the runtime is spent in the priority queue code. Beyond that I got very little to nothing.

The best hypotheses I could come up with has to with the number of comparisons made. A classical merge sort does two if statements per output elements. One to determine which of the two lists has a smaller element at the front and one to see whether removing the element exhausted the list. The former is basically random and the latter is always false except when the last element is processed. This amounts to 0.5 mispredicted branches per element per round.

A priority queue has to do a bunch more work to preserve the heap property. The first iteration needs to check the root and its two children. That's three comparisons for value and two checks whether the children actually exist. Those are much less predictable than the comparisons in merge sort. Computers are really efficient at doing simple things, so it may be that the additional bookkeeping is so expensive that it negates the advantage of fewer rounds.

Or maybe it's something else. Who's to say? Certainly not me. If someone wants to play with the code, the implementation is here. I'll probably delete it at some point as it does not have really any advantage over the regular merge sort.

Monday, April 6, 2026

Sorting performance rabbit hole

In an earlier blog post we found out that Pystd's simple sorting algorithm implementations were 5-10% slower than their stdlibc++ counterparts. The obvious follow up nerd snipe is to ask "can we make the Pystd implementation faster than stdlibc++?"

For all tests below the data set used was 10 million consecutive 64 bit integers shuffled in a random order. The order was the same for all algorithms.

Stable sort

It turns out that the answer for stable sorting is "yes, surprisingly easily". I made a few obvious tweaks (whose details I don't even remember any more) and got the runtime down to 0.86 seconds. This is approximately 5% faster than std::stable_sort. Done. Onwards to unstable sort.

Unstable sort

This one was not, as they say, a picnic. I suspect that stdlib developers have spent more time optimizing std::sort than std::stable_sort simply because it is used a lot more.

After all the improvements I could think of were done, Pystd's implementation was consistently 5-10% percent slower. At this point I started cheating and examined how stdlibc++'s implementation worked to see if there are any optimization ideas to steal. Indeed there were, but they did not help.

Pystd's insertion sort moves elements by pairwise swaps. Stdlibc++ does it by moving the last item to a temporary, shifting the array elements onwards and then moving the stored item to its final location. I implemented that. It made things slower.

Stdlibc++'s moves use memmove instead of copying (at least according to code comments).  I implemented that. It made things slower.

Then I implemented shell sort to see if it made things faster. It didn't. It made them a lot slower. So did radix sort.

Then I reworked the way pivot selection is done and realized that if you do it in a specific way, some elements move to their correct partitions as a side effect of median selection. I implemented that and it did not make things faster. It did not make them slower, either, but the end result should be more resistant against bad pivot selection so I left it in.

At some point the implementation grew a bug which only appeared with very large data sets. For debugging purposes I reduce the limit where introsort switches from qsort to insertion sort from 16 to 8. I got the bug fixed but the change made sorting a lot slower. As it should.

But this raises a question, namely would increasing the limit from 16 to 32 make things faster? It turns out that it did. A lot. Out of all perf improvements I implemented, this was the one that yielded the biggest improvement by a fairly wide margin. Going to 64 elements made it even faster, but that made other algorithms using insertion sort slower, so 32 it is. For now at least.

After a few final tweaks I managed to finally beat stdlibc++. By how much you ask? Pystd's best observed time was 0.754 seconds while stdlibc++'s was 0.755 seconds. And it happened only once. But that's enough for me.