Monday, December 23, 2024

CapyPDF 0.14 is out

I have just released version 0.14 of CapyPDF. This release has a ton of new functionality. So much, in fact, that I don't even remember them all. The reason for this is that it is actually starting to see real world usage, specifically as the new color managed PDF exporter for Inkscape. It has required a lot of refactoring work in the color code of Inkscape proper. This work has been done mostly by Doctormo, who has several videos on the issue.

The development cycle has consisted mostly of him reporting missing features like "specifying page labels is not supported", "patterns can be used for fill, but not for stroke" and "loading CMYK TIFF images with embedded color profiles does not work" and me then implementing said features or finding out how how setjmp/longjmp actually works and debugging corrupted stack traces when it doesn't.

Major change coming in the next version

The API for CapyPDF is not stable, but in the next release it will be extra unstable. The reason is C strings. Null terminated UTF-8 strings are a natural text format for PDF, as strings in PDF must not contain the zero glyph. Thus there are many functions like this in the public C API:

void do_something(const char *text);

This works and is simple, but there is a common use case it can't handle. All strings must be zero terminated so you can't point to a middle of an existing buffer, because it is not guaranteed to be zero terminated. Thus you always have to make a copy of the text you want to pass. In other words this means that you can't use C++'s string_view (or any equivalent string) as a source of text data. The public API should support this use case.

Is this premature optimization? Maybe. But is is also a usability issue as string views seem to be fairly common nowadays. There does not seem to be a perfect solution, but the best one I managed to crib seems to be to do this:

void do_something(const char *text, int32_t len_or_negative);

If the last argument is positive, use it as the length of the buffer. If i is negative then treat the char data as a zero terminated plain string. This requires changing all functions that take strings and makes the API more unpleasant to use.

If someone has an idea for a better API, do post a comment here.

Tuesday, December 17, 2024

Meson build definitions merged into Git's git repo

The developers of Git have been considering switchibg build systems for a while. No definitive decision have been made as of yet, but they gave merged Meson build definitions in the main branch. Thus it now possible, and even semi-supported, to develop and build Git with Meson instead of the vintage Makefile setup (which, AFAICT, remains as the default build system for now).

The most interesting thing about this conversion is that the devs were very thorough in their evaluation of all the different possibilities. Those who are interested in the details or are possibly contemplating a build system switch on their own are recommended to read the merge's commit message.

Huge congratulations for everyone involved and thank you for putting in the work (FTR i did not work on this myself). 

Friday, December 13, 2024

CMYK me baby one more time!

Did you know that Jpeg supports images in the CMYK colorspace? And that people are actually using them in the wild? This being the case I needed to add support to them into CapyPDF. The development steps are quite simple, first you create a CMYK Jpeg file, then you create a test document that embeds it and finally look at the result in a PDF renderer.

Off to a painter application then. This is what the test image looks like.

Then we update the Jpeg parsing code to detect cmyk images and write the corresponding metadata to the output PDF. What does then end result look like then?

Aaaaand now we have a problem. Specifically one of an arbitrary color remapping. It might seem this is just a case of inverted colors. It's not (I checked), something weirder is going on. For reference Acrobat Reader's output looks identical.

At this point rather than poke things at random and hoping for the best, a good strategy is to get more test data. Since Scribus is pretty much a gold standard on print quality PDF production I went about recreating the test document in it.

Which failed immediately on loading the image.

Here we have Gwenview and Scribus presenting their interpretations of the exact same image. If you use Scribus to generate a PDF, it will convert the Jpeg into some three channel (i.e. RGB) ICC profile.

Take-home exercise

Where is the bug (or a hole in the spec) in this case:

  • The original CMYK jpeg is correct, but Scribus and PDF renderers read it in incorrectly?
  • The original image is incorrect and Gwenview has a separate inverse bug that cancel each other out?
  • The image is correct but the metadata written in the file by CapyPDF is incorrect?
  • The PDF spec has a big chunk of UB here and the final result can be anything?
  • Aliens?
I don't know the correct answer. If someone out there does, do let me know.

Thursday, December 5, 2024

Compiler daemon thought experiment

According to information I have picked up somewhere (but can't properly confirm via web searches ATM)  there was a compiler in the 90s (the IBM VisualAge compiler maybe?) which had a special caching daemon mode. The basic idea was that you would send your code to that process and then it could return cached compile results without needing to reparse and reprocess same bits of code over and over. A sort of an in-compiler CCache, if you will. These compilers no longer seem to exist, probably because you can't just send snippets of code to be compiled, you have to send the entire set of code up to the point you want to compile. If it is different, for example because some headers are included in a different order, the results can not be reused. You have to send everything over and at that point it becomes distcc.

I was thinking about this some time ago (do not ask why, I don't know) and while this approach does not work in the general case, maybe it could be made to work for a common special case. However I am not a compiler developer so I have no idea if the following idea could work or not. But maybe someone skilled in the art might want to try this or maybe some university professor could make their students test the approach for course credit.

The basic idea is quite simple. Rather than trying to cache compiler internal state to disk somehow, persist it in a process without even attempting to be general.

The steps to take

Create a C++ project with a dozen source files or so. Each of those sources include some random set of std headers and have a single method that does something simple like returns the sum of its arguments. What they do is irrelevant, they just have to be slow to compile.

Create a PCH file that has all the std headers used in the source files. Compile that to a file.

Start compiling the actual sources one by one. Do not use parallelism to emphasize the time difference.

When the first compilation starts, read the PCH file contents into memory in the usual way. Then fork the process. One of the processes carries on compiling as usual. The second process opens a port and waits for connections, this process is the zygote server process.

When subsequent compilations are done, they connect to the port opened by the zygote process, send the compilation flags over the socket and wait for the server process to finish.

The zygote process reads the command line arguments over the socket and then forks itself. One process starts waiting on the socket again whereas the other compiles code according to the command line arguments it was given.

The performance boost comes from the fact that the zygote process already has stdlib headers in memory in compiler native data structures. In the optimal case loading the PCH file takes effectively zero time. What makes this work (in this test at least) is that the PCH file is the same for all compilations and it is the first thing the compiler starts processing. Thus it is always the same for all compilations. Conceptually at least, the actual compiler might do something else. There may be a dozen other reasons it might not work.

If someone tries this out, do let us know whether it actually worked.