Wednesday, May 24, 2023

Advanced dependency management and building Python wheels with Meson

One of the most complex pieces of developing C and C++ programs (and most other languages) is dependency management. When developing A4PDF I have used Ubuntu's default distro dependencies. This is very convenient because you typically don't need to fiddle with getting them built and they are battle tested and almost always work.

Unfortunately you can't use those in most dev environments, especially Windows. So let's see how much work it takes to build the whole thing on Windows using only Visual Studio and to bundle the whole thing into a Python wheel so it can be installed and distributed. I would have wanted to also put it in Pypi but currently there is a lockdown caused by spammers so no go on that front.

Seems like a lot of effort? Let's start by listing all the dependencies:

  • fmt
  • Freetype
  • LibTIFF
  • LibPNG
  • LibJPEG (turbo)
  • LittleCMS2
  • Zlib
These are all available via WrapDB, so each one can be installed by executing a command like the following:

meson wrap install fmt

With that done Meson will automatically download and compile the dependencies from source. No changes need to be done in the main project's files. Linux builds will keep using system deps as if nothing happened.

Next we need to build a Python extension package. This is different from a Python extension module, as the project uses ctypes for Python <-> C interop. Fortunately thanks to the contributors of Meson-Python this comes down to writing an 18 line toml file. Everything else is automatically handled for you. Installing the package is then a question of running this command:

pip install .

After a minute or two of compilation the module is installed. Here is a screenshot of the built libraries in the system Python's site-packages.

Now we can open a fresh terminal and start using the module.

Random things of note

  • Everything here uses only Meson. There are no external dependency managers, unix userland emulators or special terminals that you have to use.
  • In theory this could work on macOS too, but the code is implemented in C++23 and Apple's toolchain is too old to support it.
  • The build definitions for A4PDF take only 155 lines (most of which is source and program name listings).
  • If, for whatever reason, you can't use WrapDB, you can host your own.

Saturday, May 20, 2023

Annotated PDF, HTML, exporters

If one were to compare PDF to HTML, one interesting thing that comes up fairly quickly is that their evolution has been the exact opposite of each other.

HTML was originally about structure, with its h1 and p and ul tags and the like. Given this structure the web browser was then (mostly) free to lay out the text however it saw fit given the current screen size and browser window orientation. As usage grew the appearance of pages became more and more important and thus they had to invent a whole new syntax for specifying how the semantic content should be laid out visually. This eventually became CSS and the goal became, roughly, "pixel perfect layout everywhere".

PDF was originally just a page description language, basically a sequence of vector and raster paint operations. Its main goal was perfect visual fidelity as one of the main use cases was professional printing. The document had no understanding of what it contained. You either read as it was on the monitor or, more often, printed it out on paper and then read it. As PDF's popularity grew, new requirements appeared. First people wanted to copypaste text from PDF document (contrary to what you might expect, this is surprisingly tricky to implement). Then they needed to integrate it with screen readers and other accessibility tools, reformat the text to read it on mobile devices and so on. The solution was to create a from-scratch syntax in the PDF data stream alongside existing code to express the semantic contents of the document in a machine-readable format. This became known as tagged PDF, introduced in PDF 1.4.

The way they chose to implement it was to add new commands to the PDF data stream that are used to wrap existing draw operations. For example, drawing a single line of text looks like this:

(Hello, world!) Tj

Adding structural information means writing it like this instead:

/P << /MCID 0 >>
    (Hello, world!) Tj

and then adding a bunch of bookkeeping information in document metadata dictionaries. This is a bit tedious, but since you can't really write PDF files by hand, you implement this once in the PDF generation code and then use that.

Why so many PDF exporters?

There are many PDF generator libraries. Almost every "serious" program, like Scribus and Libreoffice, has its own bespoke library. This is not just because developers love reimplementing things from scratch. The underlying issue is that PDF as a specification is big. and generating valid PDFs have very specific semantic requirements that require the application developer to understand PDFs document model down to the finest detail.

Existing PDF generator libraries either ignore all of this and just do basic RGB graphics (Cairo) or provide their own document model which it can then convert to PDF constructs (HexaPDF). The latter works nicely for simple cases but it requires that you have to map your app's internal document structure to the library's document structure so that it then maps to PDF's document structure in the way you need it to. Often it becomes simpler to cut out the middle man and generate PDF directly.

The way A4PDF aims to solve the issue is to stay low level enough. It merely exposes PDF primitives directly and ensures that the syntax of the generated PDF is correct. It is up to the caller to ensure that the output is semantically correct. This should afford every application to easily and directly map their document output to PDF in just they way they want it (color managed or not, annotated or not and so on).

Thursday, May 11, 2023

The real reason why open source software is better

The prevailing consensus at the current time seems to be that open source software is of higher quality than corresponding proprietary ones. Several reasons have been put forth on why this is. One main reason given is that with open source any programmer in the world can inspect the code and contribute fixes. Closely tied to this is the fact that it is plain not possible to hide massive blunders in open source projects whereas behind closed walls it is trivial.

All of these and more are valid reasons for improved quality. But there are other, more sinister reasons that are usually not spoken of. In order to understand one of them, we need to first do a slight detour.

Process invocation on Windows

In this Twitter thread Bruce Dawson explains an issue he discovered while developing Chrome on Windows. The tl/dr version:

  • Invoking a new process on Windows takes 16 ms (in the thread other people report values of 30-60 ms)
  • 10 ms of this is taken by Windows Defender that scans the binary to be executed
  • Said executables are stored in a Defender excluded directory
  • 99% of the time the executable is either the C++ compiler or Python
  • The scan results are not cached, so every scan except the first one is a waste of time and energy
  • The Defender scanner process seems to be single threaded making it a bottleneck (not verified, might not be the case)
  • In a single build of Chrome, this wastes over 14 minutes of CPU and wall clock time
Other things of note in the thread and comments:

  • This issue has been reported to Microsoft decades ago
  • The actual engineers working on this can't comment officially but it is implied that they want to fix it but are blocked by office politics
  • This can only be fixed by Microsoft and as far as anyone knows, there is no work being done despite this being a major inconvenience affecting a notable fraction of developers

What is the issue?

We don'r know for sure due to corporate confidentiality reasons, but we can make some educated guesses. The most probable magic word above is "office politics". A fairly probable case is that somewhere in the middle damagement chain of Microsoft there is a person who has decreed that it is more important to create new features than spend resources on this as the code "already works" and then sets team priorities accordingly. Extra points if said person insists on using a Macbook as their work computer so their MBA friends won't make fun of them at the country club. If this is true then what we have is a case where a single person is making life miserable for tens (potentially hundreds) of thousands of people without really understanding the consequences of their decision. If this was up to the people on "the factory floor", the issue would probably have been fixed or at least mitigated years ago.

To repeat: I don't know if this is the case for Microsoft so the above dramatization is conjecture on my part. On the other hand I have seen exactly this scenario play out many times behind the scenes of various different corporations. For whatever reason this seems to a common occurrence in nonpublic hierarchical organisations. Thus we can postulate why open source leads to better code in the end.

In open source development the technologically incompetent can not prevent the technologically competent from improving the product.

Shameless self-promotion

If you enjoyed this text and can read Finnish, you might enjoy my brand new book in which humanity's first interplanetary space travel is experienced as an allegory to a software startup. You can purchase it from at least these online stores: Link #1Link #2. Also available at your local library.

Appendix: the cost and carbon footprint

Windows 11 has a bunch of helpful hints on how to reduce your carbon footprint. One of these is warning if your screen blanking timeout is less than the computer suspend timeout. At the same time they have in the core of their OS the gross inefficiency discussed above. Let us compare and contrast the two.

The display on a modern laptop consumes fairly little power. OTOH running a CPU flat out takes a lot of juice. Suspending earlier saves power when consumption is at its lowest, whereas virus scanners add load at the point of maximum resource usage. Many people close the lid of their computer when not using it so they would not benefit that much from different timeout settings. For developers avoiding process invocations is not possible, that is what the computer is expected to do.

Even more importantly, this also affects every single cloud computer running Windows including every Windows server, CI pipeline and, well, all of Azure. All of those machines are burning oil recomputing pointless virus checks. It is left as an exercise to the reader to compute how much energy has been wasted in, say, the last ten years of cloud operations over the globe (unless Microsoft runs Azure jobs with virus scanners disabled for efficiency, but surely they would not do that). Fixing the issue properly would take a lot of engineering effort and risk breaking existing applications, but MS would recoup the money investment in electricity savings from their own Azure server operations alone fairly quickly. I'm fairly sure there are ex-Googlers around who can give them pointers on how to calculate the exact break-even point.

All of this is to say that having said energy saving tips in the Windows UI is roughly equivalent to a Bitcoin enthusiast asking people to consider nature before printing their emails.

Saturday, April 29, 2023

The unbearable tightness of printing

Let's say you want to print a full colour comic book in the best possible quality. For simplicity we'll use this image as an example.

As you can probably guess, just putting this image in a PDF does not work, even if it had sufficient resolution. Instead what you need to do is to create two images. One for linework that is monochrome and has least 600 PPI and one for colours, which is typically a 300 PPI colour managed CMYK TIFF.

The colour image is drawn first and then the monochrome image is drawn on top of it. In this way you get both smooth colours and crisp linework. Most people would stop here, but this where the actual work begins. It is also where things start to wander into undocumented (or, rather, "implementation defined") territory.

Printing in true black

In computer monitors the blackest colour possible is when all colour components are off, or (0, 0, 0) in RGB values. Thus you might expect that the blackest CMYK colour is either (0, 0, 0, 1) or (1, 1, 1, 1). Surprisingly it is neither. The former looks grayish when printed whereas the latter can't be printed at all because of physical limitations. If you put too much ink in one place on the page, the underlying proper gets too wet, warps and might even rip. And tear.

Instead what you need to do is to use a colour called rich black. Each print shop has their own values for this, as the exact amount of inks to use to get the deepest black colour is dependent on the inks, paper and printing machine used. We'll use the value (0.1, 0.1, 0.1, 1.0) for rich black in this text.

Thus we need three different images rather than two.

First the colour image is laid down, then the image holding the areas that should be printed in rich black. This is a 300PPI colour image with the colour value (0.1, 0.1, 0.1, 0) on pixels that should be painted with rich black. Finally the line work is drawn on top the other two. The first two images can be combined into one. This is usually done by graphic artists when preparing their artwork to print. However the middle image can be automatically generated from the linework image with some Python so we're doing that to reduce manual work and reduce the possibility of human error.

If you create a PDF with these images you are still not done. In fact the output would be identical to the previous setup. There are still more quirks to handle.

Trapping and overprinting

Since all of the colours are printed separately they are suspect to misregistration. That is, the various colours might shift relative to each other during the printing process. This causes visual artifacts in the edges between two colours. This is a fairly complicated topic, Wikipedia has more details. This issue can be fixed by trapping, that is, "spreading" the colour under the "edge" of the linework. Like so:

If you look closely at the middle image, the gray area is slightly smaller than in the previous picture. This shrunk image can be automatically generated from the linework image with morphological erode/dilate operations. Now we have everything needed to print things properly, but if you actually try it it still won't work.

The way the PDF imaging model works is that if you draw on the canvas with any colour, all colour channels of the existing colour on the page get affected. That is, if the existing colour on the canvas is (0.1, 0.1, 0.1, 0) and you draw on top of it with (0, 0, 0, 1) the output is (0, 0, 0, 1). All the work we did getting the proper rich black colour under the linework gets erased as if it was never there.

PDF has a feature called overprinting to handle this exact case (you could also use the "multiply" filter but it requires the use of transparency, which is still prohibited in some workflows). It does pretty much what it says on the tin. When overprinting is enabled any draw operations accumulate over the existing inks. Thus the final step is to enable overprinting for the final line work image and then Bob's your uncle?

In theory yes. In practice lol no, because this part of the PDF specification is about as hand-wavy as things go. There are several toggles that affect how overprinting gets handled. What they actually do is only explained in descriptive text. One of the outcomes of this is that every single generally available PDF viewer renders the output incorrectly. Poppler, Ghostscript, Apple Preview and even Adobe Acrobat Reader all produce outputs that are incorrect in different ways. They don't even warn you that the PDF uses overprinting and that the output might be incorrect. This makes development and debugging this use case somewhat challenging.

The only way to get correct output is to use Adobe Acrobat Pro and tell it to enable overprint simulation. Fortunately I have a friend who has a 10 year old version (remember, back when you could actually buy software once and keep using it as opposed to a monthly license that can get yanked at any time?). After pestering him with an endless flow of test PDFs I finally managed to work out the exact steps needed to make this work:

  • Create a 300 PPI image with the colours, a 300 or 600 PPI monochrome image with the rich black areas and a 600 DPI monochrome image for the linework (the rich black image can be autogenerated from the linework image and/or precomposited in the colour image)
  • Load and draw the colour image as usual
  • Load the rich black image and store it as a PDF ImageMask rather than a plain image
  • Set nonstroke colour to (0.1, 0.1, 0.1, 0), set the rich black image as a stencil and fill it
  • Load the linework image as an imagemask
  • Enable overprinting mode
  • Set overprinting mode to 1
  • Set nonstroke colour to (0, 0, 0, 1)
  • Draw the line image as a stencil

If you deviate from any of the above steps, the output will be silently wrong. If you process the resulting PDF with anything except Adobe's tool suite the end result might become silently wrong. As an example here is the output of colour separation using Adobe Acrobat and Ghostscript.

Acrobat has preserved the rich black values under the linework whereas Ghostscript has instead cleared the colour value to zero losing the "rich" part of black. Interestingly Ghostscript seems to handle overprinting correctly in basic PDF shape drawing operations but not in stencil drawing operations.

Or maybe it does and Acrobat is incorrect here. The only way to know for sure would be to print test samples on a dozen or so commercial offset printing presses, inspecting the plates manually and then seeing what ends up on paper. Sadly I don't have the resources for that.

Sunday, April 16, 2023

PDF forms, the standard that seemingly isn't

Having gotten the basic graphical output or A4PDF working I wanted to see if I could make PDF form generation work.

This was of course a terrible idea but sadly I lacked foresight.

After a lot of plumbing code it was time to start defining form widgets. I chose to start simple and create a form with a single togglable check button. This does not seem like an impossibly difficult problem and the official PDF specification even has a nice code sample for this:

The basic idea is simple. You define the "widget" and give it two "state objects" that contain PDF drawing operations. The idea is that the PDF renderer will draw one of the two on top of the basic PDF document depending on whether the checkbox is toggled or not. The code above sets things up so that the appearance of the checkbox is one of two different DingBat symbols. Their values are not shown in the specification, but presumably they are a checked box and an empty square.

I created a test PDF with LibreOffice's form designer and then set about trying to recreate it. LO's form generator uses OpenSymbol for the checked status of the checkbox and an empty appearance for the off state. A4PDF uses the builtin Helvetica "X" character.The actual files can be downloaded here.

What we have here is a failure to communicate

No matter how much I tried I could not make form generation actually work. The output was always broken in weird ways I could not explain. Unfortunately this part of the PDF spec is not very helpful, because it does not give out full examples, only snippets, and search engines are worthless at finding any technical content when using "PDF" as a search term. It may even be that information about this is not available in public web sites. Who knows?

Anyhow, when regular debugging does not work, it's time approach things sideways. Let's start by opening the LibreOffice test document with Okular:

This might seem to be working just fine, but people with sharp eyes might notice a problem. That check mark is not from OpenSymbol. FWICT it is the standard Qt checkbox widget. Still, the checkbox works and its appearance is passable. But what happens if you increase the zoom level?

Oh dear. That "Yes" text is the PDF-internal label given to the "on" state. Why is it displayed? No idea. It's time to bring out the heavy guns and see how things work in The Gold Standard of PDF Rendering, Adobe Reader.

Nope, that's not the OpenSymbol checkmark either. Adobe Reader seems to be ignoring the spec and drawing its own checkmarks instead. After seeing this I knew I had to try this on every PDF renderer I could reasonably get my hands on. Here's the results:

  • Okular
    • LO: incorrect appearance, breaks when zooming
    • A4PDF: shows both the "correct" checkmark as well as the Qt widget on top of each other, takes a noticeable amount of time after clicking until the widget state is updated
  • Evince
    • LO: does not respond to clicks
    • A4PDF: works correctly
  • Adobe Reader win64
    • LO: incorrect appearance
    • A4PDF: incorrect appearance, does not always respond to button clicks
  • Firefox
    • LO: Incorrect appearance
    • A4PDF: Incorrect appearance
  • Chromium
    • LO: Incorrect appearance
    • A4PDF: works correctly
  • Apple Preview:
    • LO: works correctly (though the offset is a bit wonky, probably an issue in the drawing commands themselves)
    • A4PDF: works correctly

The only viewer that seems to be working correctly in all cases is Apple Preview.

PDF has a ton of toggleable flags and the like to make things invisible when printing and so on. It is entirely possible that the PDF files are "incorrect" in some way. But still, either the behaviour should either be the same on all viewers or they should report format errors. No errors are reported, though, even by this online validator.

Monday, April 3, 2023

Some details about creating print-quality PDFs

At its core, PDF is an image file format. In theory it is not at all different from the file formats of Gimp, Krita, Photoshop and the like. It consists of a bunch of raster and vector objects on top of each other. In practice there are several differences, the biggest of which is the following:

In PDF you can have images that have different color spaces and resolutions (that is, PPI values). This is by design as it is necessary to achieve high quality printing output.

As a typical example, comic books that are printed in color consist of two different images. The "bottom" one contains only the colors and is typically 300 PPI. On top of that you have the black linework, which is a 1 bit image at 600 or even 1200 PPI. Putting both the linework and colors in the same image file would not work. In the printout the lines would be fuzzy, even if the combined image did contain 1200 PPI.

A deeper explanation can be found in the usual places but the short version is that these two different image types need to be handled in completely opposite ways to make them look good when printed. When converting colors images to printing plates the processing software prioritizes smoothness.  On the other hand for monochrome images the system prioritizes sharpness. Doing this wrong means either getting color images that are blocky or linework that is fuzzy.

When working on A4PDF it was clear from the start that it needs to be able to create PDF files that can be used for commercial quality print jobs. To test this I wrote a Python script that recreates the cover of my recently published book originally typeset with Scribus. The end result was about 100 lines of code in total.

The background image

The main image is a single file without any adornments. It was provided by the illustrator as a single 8031 by 5953 image file. A fully color managed workflow demands the image to be in CMYK format and have a corresponding ICC color profile. There is basically only one file format that supports this use case: TIFF. Interestingly the specification for this file format was finalized in 1992. It is left as an exercise to the reader to determine how many image file formats have been introduced since that time.

A4PDF extracts the embedded ICC profile and stores it in the PDF file. It could also convert the image from the image's ICC colorspace to the specified output color space if they are different, but currently does not.

Text objects

All text color is white and is specified in CMYK colorspace, though it could also be specified in DeviceGray. Defining any object in RGB (even if the actual color was full white) could make the printing house reject the file as invalid and thus unsuitable for printing.

The author name in the front cover uses PDF's character spacing to "spread out" the text. The default character spacing in this font is too tight for use in covers.

PDF can only produce horizontal text. Creating vertical text, as in the spine, requires you to modify the drawing state's transformation matrix stack. In practice this is almost identical to OpenGL, though the PostScript drawing model that PDF uses predates OpenGL by 8 years or so. In this case the text needs a rotate + translate. 

The bar code

Ideally this should be defined with PDF's vector drawing operations. Unfortunately that would require me to implement reading SVG files somehow. It turned out to be a lot less effort to export the SVG from Inkscape at 600 PPI and then convert that to a 1 bit image with the Gimp. The end result is pretty much the same.

This approach works in Scribus as well, but not in LibreOffice. It converts all 1 bit images to 8 bit grayscale meaning that it might be fuzzy when printed. LO used to do this correctly but the behaviour was intentionally changed at some point.

The logo

This is the logo used in the publisher's sci-fi books. You can probably guess that the first book in the series was Philip K. Dick's Do Androids Dream of Electric Sheep?

Like the bar code, this should optimally be defined with vector operations, but again for simplicity a raster image is used instead. However it is different from the bar code image in that it has an alpha channel so that the background starfield shows through. When you export a 1 bit image that has an alpha channel to a PNG, Gimp writes it out as an indexed image with 4 colors (black opaque, white opaque, black transparent, white transparent). A4PDF detects files of this type and stores them in the PDF as 1 bit monochrome images with a 1 bit alpha channel.

This is something even Scribus does not seem to handle correctly. In my testing it seemed to convert these kinds images to 8 bit grayscale instead.

Saturday, April 1, 2023

Got the Star Trek - The Motion Picture Director's Edition box set? You might wan to check your discs


Star Trek The Motion Picture The Complete Adventure box set claims to contain a special, longer cut of the film. However it seems that this is not the case for some editions. The British edition does contain the longer cut, but the Scandinavian one seems not to. The back of the box still claims that the box set does contain the longer cut.

The claim and evidence

This is the box set in question.

At the back of the box we find the following text:

I bought this at the end of last year but I could not find the longer edition anywhere in the menus. I mentioned this to a friend of mine who has the same box set and he had found it immediately. After a lot of debugging we discovered that he has the British edition of the box set whereas I have the Scandinavian one.

The British box set, which has the longer edition, consists of the following discs:

  • EU151495BLB Blu-ray The Director's Edition Bonus Disc
  • EU151496ULB 4K UltraHD The Director's Edition Feature Film, Special Features
  • EU151460ULB 4K UltraHD Feature Film, Special Features (Special Longer Version, Theatrical Version)
  • EU151496BLB Blu-ray The Director's Edition Feature Film, Special Features
  • EC150885BLB Blu-ray Feature Film, Special Features

My Scandinavian box set has the following discs:

  • eu151495blb bonus disc
  • eu151496ulb director's edition 4k ultrahd, Feature film
  • eu150884ulb 4k, regular version, feature film
  • eu151496blb blu-ray, director's edition
  • eu150885blb blu-ray, regular version, feature film

The only differences here are that one disc has a serial number with letters EC instead of EU and that the British edition has this disc:

Note how it says "special longer version" in microscopic letters under the Paramount logo. This text does not appear on any disc in the Scandinavian edition.

The Scandinavian edition does not have this disc. Instead it has the disc with the product id EU150884ULB whereas the corresponding British disc has the product id EU151460ULB .The missing content can not be on any of the other discs, because they are the same in both editions (the EU/EC issue notwithstanding). 

I reported this to the store I bought the box set from. After a lot of convincing they agreed to order a replacement box set. When it arrived we inspected the contained discs and they were the same as in the box set I already had. This would imply that the entire print run is defective. After more convincing we managed to get the Finnish importer to report the issue to the distributor in Sweden.

They eventually replied that the longer edition is in "the extra disc in the set that is in a cardboard sleeve". This is not the case, as that disc contains only extras, is the same in the British box set and further is a regular Blu-Ray, not a  4k UHD one as it should be and as it is on the British edition.

We reported all this back up the chain and have not heard anything back for several weeks now.

What to do if you have a non-British (and presumably non-US) edition of this box set?

Check whether your discs contain the special longer edition. The British disc has this this fairly unambiguous selection screen for it when you insert the disc:

If your box set does not contain this version of the film, report the issue to where you bought the box set from. If this really is a larger issue (as it would seem), the more bug reports they get the faster things will get fixed.

The film store said that they have sold tens of these box sets and that I was the first to notice the issue so don't assume that other people have already reported it. This is not an issue of the physical copy that I have because the replacement box set was defective in the same way.

Speculation: why is it broken?

The British disc that has the special longer edition does not contain Finnish subtitles (or possibly any non-English languages). When the box was being assembled someone found this out, decided that they can't ship a disc without localisation and replaced the disc with a regular version of the film that does not have the longer cut. But they did not change the back of the box, which states that the box set contains the longer edition which it does not seem to have.