Tuesday, December 17, 2024

Meson build definitions merged into Git's git repo

The developers of Git have been considering switchibg build systems for a while. No definitive decision have been made as of yet, but they gave merged Meson build definitions in the main branch. Thus it now possible, and even semi-supported, to develop and build Git with Meson instead of the vintage Makefile setup (which, AFAICT, remains as the default build system for now).

The most interesting thing about this conversion is that the devs were very thorough in their evaluation of all the different possibilities. Those who are interested in the details or are possibly contemplating a build system switch on their own are recommended to read the merge's commit message.

Huge congratulations for everyone involved and thank you for putting in the work (FTR i did not work on this myself). 

Friday, December 13, 2024

CMYK me baby one more time!

Did you know that Jpeg supports images in the CMYK colorspace? And that people are actually using them in the wild? This being the case I needed to add support to them into CapyPDF. The development steps are quite simple, first you create a CMYK Jpeg file, then you create a test document that embeds it and finally look at the result in a PDF renderer.

Off to a painter application then. This is what the test image looks like.

Then we update the Jpeg parsing code to detect cmyk images and write the corresponding metadata to the output PDF. What does then end result look like then?

Aaaaand now we have a problem. Specifically one of an arbitrary color remapping. It might seem this is just a case of inverted colors. It's not (I checked), something weirder is going on. For reference Acrobat Reader's output looks identical.

At this point rather than poke things at random and hoping for the best, a good strategy is to get more test data. Since Scribus is pretty much a gold standard on print quality PDF production I went about recreating the test document in it.

Which failed immediately on loading the image.

Here we have Gwenview and Scribus presenting their interpretations of the exact same image. If you use Scribus to generate a PDF, it will convert the Jpeg into some three channel (i.e. RGB) ICC profile.

Take-home exercise

Where is the bug (or a hole in the spec) in this case:

  • The original CMYK jpeg is correct, but Scribus and PDF renderers read it in incorrectly?
  • The original image is incorrect and Gwenview has a separate inverse bug that cancel each other out?
  • The image is correct but the metadata written in the file by CapyPDF is incorrect?
  • The PDF spec has a big chunk of UB here and the final result can be anything?
  • Aliens?
I don't know the correct answer. If someone out there does, do let me know.

Thursday, December 5, 2024

Compiler daemon thought experiment

According to information I have picked up somewhere (but can't properly confirm via web searches ATM)  there was a compiler in the 90s (the IBM VisualAge compiler maybe?) which had a special caching daemon mode. The basic idea was that you would send your code to that process and then it could return cached compile results without needing to reparse and reprocess same bits of code over and over. A sort of an in-compiler CCache, if you will. These compilers no longer seem to exist, probably because you can't just send snippets of code to be compiled, you have to send the entire set of code up to the point you want to compile. If it is different, for example because some headers are included in a different order, the results can not be reused. You have to send everything over and at that point it becomes distcc.

I was thinking about this some time ago (do not ask why, I don't know) and while this approach does not work in the general case, maybe it could be made to work for a common special case. However I am not a compiler developer so I have no idea if the following idea could work or not. But maybe someone skilled in the art might want to try this or maybe some university professor could make their students test the approach for course credit.

The basic idea is quite simple. Rather than trying to cache compiler internal state to disk somehow, persist it in a process without even attempting to be general.

The steps to take

Create a C++ project with a dozen source files or so. Each of those sources include some random set of std headers and have a single method that does something simple like returns the sum of its arguments. What they do is irrelevant, they just have to be slow to compile.

Create a PCH file that has all the std headers used in the source files. Compile that to a file.

Start compiling the actual sources one by one. Do not use parallelism to emphasize the time difference.

When the first compilation starts, read the PCH file contents into memory in the usual way. Then fork the process. One of the processes carries on compiling as usual. The second process opens a port and waits for connections, this process is the zygote server process.

When subsequent compilations are done, they connect to the port opened by the zygote process, send the compilation flags over the socket and wait for the server process to finish.

The zygote process reads the command line arguments over the socket and then forks itself. One process starts waiting on the socket again whereas the other compiles code according to the command line arguments it was given.

The performance boost comes from the fact that the zygote process already has stdlib headers in memory in compiler native data structures. In the optimal case loading the PCH file takes effectively zero time. What makes this work (in this test at least) is that the PCH file is the same for all compilations and it is the first thing the compiler starts processing. Thus it is always the same for all compilations. Conceptually at least, the actual compiler might do something else. There may be a dozen other reasons it might not work.

If someone tries this out, do let us know whether it actually worked.

Friday, November 8, 2024

PDF/AAAARGH

Note: the PDF/A specification is not freely available so everything here is based on reverse engineering. It might be complete bunk.

There are many different "subspecies" of PDF. The most common are PDF/X and PDF/A. CapyPDF can already do PDF/X, so I figured it's time to look into PDF/A. Like, how much worse could it possibly be?

Specifying that a PDF file is PDF/X is straightforward. Each PDF has a Catalog dictionary that defines properties of the document. All you need to do is to add an OutputIntent dictionary and link it to the Catalog. The dictionary has a key that specifies the subtype. Setting that to /GTS_PDFX does the trick. There are many different versions of PDF/X so you need to define that as well. A simple solution would be to have a second key in that dictionary for specifying the subtype. Half of that expectation is correct. There is indeed a key you can set, but it is in a completely different part of the object tree called the Information dictionary. It's a bit weird but you implement it once and then forget it.

PDF/A has four different versions, namely 1, 2, 3, 4 and each of these have several conformance levels that are specified with a single letter. Thus the way you specify that the file is a PDF/A document is that you write the value /GTS_PDFA1 to the intent dictionary. Yes. regardless of which version of PDF/A you want, this dictionary will say it is PDFA1.

What would be the mechanism, then, to specify the sub version:

  1. In the Information dictionary, just like with PDF/X?
  2. In some other PDF object dictionary?
  3. In a standalone PDF object that is in fact an embedded XML document?
  4. Something even worse?
Depending on your interpretation, the correct answer is either 3 or 4. Here is the XML file in question as generated by LibreOffice. The payload parts are marked with red arrows.

The other bits are just document metadata replicated. PDF version 2.0 has gone even further and deprecated storing PDF metadata in PDF's own data structures. The sructures that have been designed specifically for PDF documents, which all PDF processing software already know how to handle and which tens of billions (?) of documents already use and which can thus never be removed? Those ones. As Sun Tzu famously said:

A man with one metadata block in his file format always knows what his document is called.

A man with two can never be sure. 

Thus far we have only been at level 3. So what more could possibly be added to this to make it even worse?

Spaces.

Yes, indeed. The screen shot does not show it, but the recommend way to use this specific XML format is to add a whole lot of whitespace below the XML snippet so it can be edited in place later if needed. This is highly suspicious for PDF/A for two main reasons. First of all PDF/A is meant for archiving usage. Documents in it should not be edited afterwards. That is the entire point. Secondly, the PDF file format already has a way of replacing objects with newer versions.

The practical outcome of all this is that every single PDF/A document has approximately 5 kilobytes of fluff to represent two bytes of actual information. Said object can not even be compressed because the RDF document must be stored uncompressed to be editable. Even though in PDF/A documents it will never be edited.

Wednesday, October 30, 2024

Happenings at work

A few months ago this happened.

Which, for those of you not up to date on your 1960s British television, is to say that I've resigned. I'm currently enjoying the unemployed life style. No, that is not me being cheeky or ironic. I'm actually enjoying being able to focus on my own free time projects and sleeping late.

Since I'm not a millionaire at some point I'll probably have to get a job again. But not for at least six months. Maybe more, maybe less, we'll see what happens.

This should not affect Meson users in any significant way. I plan to spend some time to work on some fundamental issues in the code base to make things better all round. But the most important thing for now is to land the option refactor monster.

Sunday, October 6, 2024

CapyPDF 0.12.0 released

I have just made the 0.12 release of CapyPDF. It does not really have new features, but the API has been overhauled. It is almost guaranteed that no code developed against 0.11 will work without code changes. Such is the joy of not having any users.

Experimental C++ wrapper

CapyPDF has a plain C API. This makes it stable and easy to use from any programming language. That also makes it cumbersome to use. Here is what you need to write to create a PDF file that has a single rectangle:

Given that this is C, you can't really do better. However I was asked if I could create something more ergonomic for C++ users. This seemed like an interesting challenge, so I did some experimentation, which ships with the 0.12 release.

The requirements

The C++ wrapper should fulfill the following requirements:

  • Fully type safe
  • No manual memory management
  • Ideally a single header
  • Zero overhead
  • Fast to compile
  • All objects are move-only
  • IDE code completion friendly
  • Does not need to maintain API or ABI stability (those who need that have to use the C API anyway)

The base

After trying a bunch of different things I eventually came up with this design:

Basically it just stores an underlying CapyPDF object type and a helper class used to deallocate it. The operators merely mean that the object can be cast to a pointer of the underlying type. Typically you want conversion operators to be explicit but these are not, because being implicit removes a ton of boilerplate code.

For example let's look what the Color object wrapper looks like. First you need the deleter:

The class definition itself is simple:

The class has no other members than the one from the base class. The last thing we need before we can start calling into CapyPDF functions is an error handler. Because all functions in the API have the exact same form, this can be done with a single macro:

This makes the constructor look like this:

Method calls bring all of these things together.

Because the wrapper types are implicitly convertible to the underlying pointer types you can pass them directly to the C API functions. Otherwise the wrapper code would be filled with .get() method calls or the like.

The end result

With all that done, the C code from the beginning of this post can be written like this:

Despite using templates, inheritance and stdlib types the end result can be proven to have zero overhead:

After this what remains is mostly the boring work of typing out all the wrapper calls.

Thursday, September 12, 2024

On M and S Type Processes in Software Development

I wrote a post about so called "M type" and "S type" processes in software development. Unfortunately it discusses the concept of human sexuality. Now, just to be sure, it does not have any of the "good stuff" as the kids might say. Nonetheless this blog is syndicated in places where such topics might be considered controversial or even unacceptable.

Thus I can't really post the text here. Those of you who are of legal age (whatever that means in your jurisdiction) and out of their own free will want to read such material, can access the PDF version of the article via this link.