Tuesday, May 30, 2023

A4PDF release 0.2.0

I have just tagged relase 0.2.0 of A4PDF, the fully color managed PDF generation library.

There are not that many new exposed features added in the public API since 0.1.0. The main goal of this release has been to make the Python integration work and thus the release is also available in Pypi. Due to reasons the binary wheel is available only for Windows.

Trying it out

On Linux you probably want to do a Git checkout and run it from there.

On Windows the easiest way is to install the package via Pip.

On macOS you can't do anything, because Apple's compiler toolchain is too old to build the code.

What functionality does is provide?

There is no actual documentation, so the best bet is to look at the unit test file. There is a lot more functionality in the C++ code, but it is not exposed in the public API yet. These include things like (basics of) tagged PDF generation, annotations and external file embedding.

Impending name change

There is an official variant of PDF called PDF/A. Thre are several versions of it including PDF/A-4. I did not know that when deciding the name. Because having a library called A4PDF that produces PDF/A-4:s is confusing, the name needs to be changed. The new name has not been decided yet, suggestions welcome.

Wednesday, May 24, 2023

Advanced dependency management and building Python wheels with Meson

One of the most complex pieces of developing C and C++ programs (and most other languages) is dependency management. When developing A4PDF I have used Ubuntu's default distro dependencies. This is very convenient because you typically don't need to fiddle with getting them built and they are battle tested and almost always work.

Unfortunately you can't use those in most dev environments, especially Windows. So let's see how much work it takes to build the whole thing on Windows using only Visual Studio and to bundle the whole thing into a Python wheel so it can be installed and distributed. I would have wanted to also put it in Pypi but currently there is a lockdown caused by spammers so no go on that front.

Seems like a lot of effort? Let's start by listing all the dependencies:

  • fmt
  • Freetype
  • LibTIFF
  • LibPNG
  • LibJPEG (turbo)
  • LittleCMS2
  • Zlib
These are all available via WrapDB, so each one can be installed by executing a command like the following:

meson wrap install fmt

With that done Meson will automatically download and compile the dependencies from source. No changes need to be done in the main project's meson.build files. Linux builds will keep using system deps as if nothing happened.

Next we need to build a Python extension package. This is different from a Python extension module, as the project uses ctypes for Python <-> C interop. Fortunately thanks to the contributors of Meson-Python this comes down to writing an 18 line toml file. Everything else is automatically handled for you. Installing the package is then a question of running this command:

pip install .

After a minute or two of compilation the module is installed. Here is a screenshot of the built libraries in the system Python's site-packages.

Now we can open a fresh terminal and start using the module.

Random things of note

  • Everything here uses only Meson. There are no external dependency managers, unix userland emulators or special terminals that you have to use.
  • In theory this could work on macOS too, but the code is implemented in C++23 and Apple's toolchain is too old to support it.
  • The build definitions for A4PDF take only 155 lines (most of which is source and program name listings).
  • If, for whatever reason, you can't use WrapDB, you can host your own.

Saturday, May 20, 2023

Annotated PDF, HTML, exporters

If one were to compare PDF to HTML, one interesting thing that comes up fairly quickly is that their evolution has been the exact opposite of each other.

HTML was originally about structure, with its h1 and p and ul tags and the like. Given this structure the web browser was then (mostly) free to lay out the text however it saw fit given the current screen size and browser window orientation. As usage grew the appearance of pages became more and more important and thus they had to invent a whole new syntax for specifying how the semantic content should be laid out visually. This eventually became CSS and the goal became, roughly, "pixel perfect layout everywhere".

PDF was originally just a page description language, basically a sequence of vector and raster paint operations. Its main goal was perfect visual fidelity as one of the main use cases was professional printing. The document had no understanding of what it contained. You either read as it was on the monitor or, more often, printed it out on paper and then read it. As PDF's popularity grew, new requirements appeared. First people wanted to copypaste text from PDF document (contrary to what you might expect, this is surprisingly tricky to implement). Then they needed to integrate it with screen readers and other accessibility tools, reformat the text to read it on mobile devices and so on. The solution was to create a from-scratch syntax in the PDF data stream alongside existing code to express the semantic contents of the document in a machine-readable format. This became known as tagged PDF, introduced in PDF 1.4.

The way they chose to implement it was to add new commands to the PDF data stream that are used to wrap existing draw operations. For example, drawing a single line of text looks like this:

(Hello, world!) Tj

Adding structural information means writing it like this instead:

/P << /MCID 0 >>
  BDC
    (Hello, world!) Tj
  EMC

and then adding a bunch of bookkeeping information in document metadata dictionaries. This is a bit tedious, but since you can't really write PDF files by hand, you implement this once in the PDF generation code and then use that.

Why so many PDF exporters?

There are many PDF generator libraries. Almost every "serious" program, like Scribus and Libreoffice, has its own bespoke library. This is not just because developers love reimplementing things from scratch. The underlying issue is that PDF as a specification is big. and generating valid PDFs have very specific semantic requirements that require the application developer to understand PDFs document model down to the finest detail.

Existing PDF generator libraries either ignore all of this and just do basic RGB graphics (Cairo) or provide their own document model which it can then convert to PDF constructs (HexaPDF). The latter works nicely for simple cases but it requires that you have to map your app's internal document structure to the library's document structure so that it then maps to PDF's document structure in the way you need it to. Often it becomes simpler to cut out the middle man and generate PDF directly.

The way A4PDF aims to solve the issue is to stay low level enough. It merely exposes PDF primitives directly and ensures that the syntax of the generated PDF is correct. It is up to the caller to ensure that the output is semantically correct. This should afford every application to easily and directly map their document output to PDF in just they way they want it (color managed or not, annotated or not and so on).

Thursday, May 11, 2023

The real reason why open source software is better

The prevailing consensus at the current time seems to be that open source software is of higher quality than corresponding proprietary ones. Several reasons have been put forth on why this is. One main reason given is that with open source any programmer in the world can inspect the code and contribute fixes. Closely tied to this is the fact that it is plain not possible to hide massive blunders in open source projects whereas behind closed walls it is trivial.

All of these and more are valid reasons for improved quality. But there are other, more sinister reasons that are usually not spoken of. In order to understand one of them, we need to first do a slight detour.

Process invocation on Windows

In this Twitter thread Bruce Dawson explains an issue he discovered while developing Chrome on Windows. The tl/dr version:

  • Invoking a new process on Windows takes 16 ms (in the thread other people report values of 30-60 ms)
  • 10 ms of this is taken by Windows Defender that scans the binary to be executed
  • Said executables are stored in a Defender excluded directory
  • 99% of the time the executable is either the C++ compiler or Python
  • The scan results are not cached, so every scan except the first one is a waste of time and energy
  • The Defender scanner process seems to be single threaded making it a bottleneck (not verified, might not be the case)
  • In a single build of Chrome, this wastes over 14 minutes of CPU and wall clock time
Other things of note in the thread and comments:

  • This issue has been reported to Microsoft decades ago
  • The actual engineers working on this can't comment officially but it is implied that they want to fix it but are blocked by office politics
  • This can only be fixed by Microsoft and as far as anyone knows, there is no work being done despite this being a major inconvenience affecting a notable fraction of developers

What is the issue?

We don'r know for sure due to corporate confidentiality reasons, but we can make some educated guesses. The most probable magic word above is "office politics". A fairly probable case is that somewhere in the middle damagement chain of Microsoft there is a person who has decreed that it is more important to create new features than spend resources on this as the code "already works" and then sets team priorities accordingly. Extra points if said person insists on using a Macbook as their work computer so their MBA friends won't make fun of them at the country club. If this is true then what we have is a case where a single person is making life miserable for tens (potentially hundreds) of thousands of people without really understanding the consequences of their decision. If this was up to the people on "the factory floor", the issue would probably have been fixed or at least mitigated years ago.

To repeat: I don't know if this is the case for Microsoft so the above dramatization is conjecture on my part. On the other hand I have seen exactly this scenario play out many times behind the scenes of various different corporations. For whatever reason this seems to a common occurrence in nonpublic hierarchical organisations. Thus we can postulate why open source leads to better code in the end.

In open source development the technologically incompetent can not prevent the technologically competent from improving the product.

Shameless self-promotion

If you enjoyed this text and can read Finnish, you might enjoy my brand new book in which humanity's first interplanetary space travel is experienced as an allegory to a software startup. You can purchase it from at least these online stores: Link #1Link #2. Also available at your local library.

Appendix: the cost and carbon footprint

Windows 11 has a bunch of helpful hints on how to reduce your carbon footprint. One of these is warning if your screen blanking timeout is less than the computer suspend timeout. At the same time they have in the core of their OS the gross inefficiency discussed above. Let us compare and contrast the two.

The display on a modern laptop consumes fairly little power. OTOH running a CPU flat out takes a lot of juice. Suspending earlier saves power when consumption is at its lowest, whereas virus scanners add load at the point of maximum resource usage. Many people close the lid of their computer when not using it so they would not benefit that much from different timeout settings. For developers avoiding process invocations is not possible, that is what the computer is expected to do.

Even more importantly, this also affects every single cloud computer running Windows including every Windows server, CI pipeline and, well, all of Azure. All of those machines are burning oil recomputing pointless virus checks. It is left as an exercise to the reader to compute how much energy has been wasted in, say, the last ten years of cloud operations over the globe (unless Microsoft runs Azure jobs with virus scanners disabled for efficiency, but surely they would not do that). Fixing the issue properly would take a lot of engineering effort and risk breaking existing applications, but MS would recoup the money investment in electricity savings from their own Azure server operations alone fairly quickly. I'm fairly sure there are ex-Googlers around who can give them pointers on how to calculate the exact break-even point.

All of this is to say that having said energy saving tips in the Windows UI is roughly equivalent to a Bitcoin enthusiast asking people to consider nature before printing their emails.