Wednesday, May 24, 2023

Advanced dependency management and building Python wheels with Meson

One of the most complex pieces of developing C and C++ programs (and most other languages) is dependency management. When developing A4PDF I have used Ubuntu's default distro dependencies. This is very convenient because you typically don't need to fiddle with getting them built and they are battle tested and almost always work.

Unfortunately you can't use those in most dev environments, especially Windows. So let's see how much work it takes to build the whole thing on Windows using only Visual Studio and to bundle the whole thing into a Python wheel so it can be installed and distributed. I would have wanted to also put it in Pypi but currently there is a lockdown caused by spammers so no go on that front.

Seems like a lot of effort? Let's start by listing all the dependencies:

  • fmt
  • Freetype
  • LibTIFF
  • LibPNG
  • LibJPEG (turbo)
  • LittleCMS2
  • Zlib
These are all available via WrapDB, so each one can be installed by executing a command like the following:

meson wrap install fmt

With that done Meson will automatically download and compile the dependencies from source. No changes need to be done in the main project's meson.build files. Linux builds will keep using system deps as if nothing happened.

Next we need to build a Python extension package. This is different from a Python extension module, as the project uses ctypes for Python <-> C interop. Fortunately thanks to the contributors of Meson-Python this comes down to writing an 18 line toml file. Everything else is automatically handled for you. Installing the package is then a question of running this command:

pip install .

After a minute or two of compilation the module is installed. Here is a screenshot of the built libraries in the system Python's site-packages.

Now we can open a fresh terminal and start using the module.

Random things of note

  • Everything here uses only Meson. There are no external dependency managers, unix userland emulators or special terminals that you have to use.
  • In theory this could work on macOS too, but the code is implemented in C++23 and Apple's toolchain is too old to support it.
  • The build definitions for A4PDF take only 155 lines (most of which is source and program name listings).
  • If, for whatever reason, you can't use WrapDB, you can host your own.

Saturday, May 20, 2023

Annotated PDF, HTML, exporters

If one were to compare PDF to HTML, one interesting thing that comes up fairly quickly is that their evolution has been the exact opposite of each other.

HTML was originally about structure, with its h1 and p and ul tags and the like. Given this structure the web browser was then (mostly) free to lay out the text however it saw fit given the current screen size and browser window orientation. As usage grew the appearance of pages became more and more important and thus they had to invent a whole new syntax for specifying how the semantic content should be laid out visually. This eventually became CSS and the goal became, roughly, "pixel perfect layout everywhere".

PDF was originally just a page description language, basically a sequence of vector and raster paint operations. Its main goal was perfect visual fidelity as one of the main use cases was professional printing. The document had no understanding of what it contained. You either read as it was on the monitor or, more often, printed it out on paper and then read it. As PDF's popularity grew, new requirements appeared. First people wanted to copypaste text from PDF document (contrary to what you might expect, this is surprisingly tricky to implement). Then they needed to integrate it with screen readers and other accessibility tools, reformat the text to read it on mobile devices and so on. The solution was to create a from-scratch syntax in the PDF data stream alongside existing code to express the semantic contents of the document in a machine-readable format. This became known as tagged PDF, introduced in PDF 1.4.

The way they chose to implement it was to add new commands to the PDF data stream that are used to wrap existing draw operations. For example, drawing a single line of text looks like this:

(Hello, world!) Tj

Adding structural information means writing it like this instead:

/P << /MCID 0 >>
  BDC
    (Hello, world!) Tj
  EMC

and then adding a bunch of bookkeeping information in document metadata dictionaries. This is a bit tedious, but since you can't really write PDF files by hand, you implement this once in the PDF generation code and then use that.

Why so many PDF exporters?

There are many PDF generator libraries. Almost every "serious" program, like Scribus and Libreoffice, has its own bespoke library. This is not just because developers love reimplementing things from scratch. The underlying issue is that PDF as a specification is big. and generating valid PDFs have very specific semantic requirements that require the application developer to understand PDFs document model down to the finest detail.

Existing PDF generator libraries either ignore all of this and just do basic RGB graphics (Cairo) or provide their own document model which it can then convert to PDF constructs (HexaPDF). The latter works nicely for simple cases but it requires that you have to map your app's internal document structure to the library's document structure so that it then maps to PDF's document structure in the way you need it to. Often it becomes simpler to cut out the middle man and generate PDF directly.

The way A4PDF aims to solve the issue is to stay low level enough. It merely exposes PDF primitives directly and ensures that the syntax of the generated PDF is correct. It is up to the caller to ensure that the output is semantically correct. This should afford every application to easily and directly map their document output to PDF in just they way they want it (color managed or not, annotated or not and so on).

Thursday, May 11, 2023

The real reason why open source software is better

The prevailing consensus at the current time seems to be that open source software is of higher quality than corresponding proprietary ones. Several reasons have been put forth on why this is. One main reason given is that with open source any programmer in the world can inspect the code and contribute fixes. Closely tied to this is the fact that it is plain not possible to hide massive blunders in open source projects whereas behind closed walls it is trivial.

All of these and more are valid reasons for improved quality. But there are other, more sinister reasons that are usually not spoken of. In order to understand one of them, we need to first do a slight detour.

Process invocation on Windows

In this Twitter thread Bruce Dawson explains an issue he discovered while developing Chrome on Windows. The tl/dr version:

  • Invoking a new process on Windows takes 16 ms (in the thread other people report values of 30-60 ms)
  • 10 ms of this is taken by Windows Defender that scans the binary to be executed
  • Said executables are stored in a Defender excluded directory
  • 99% of the time the executable is either the C++ compiler or Python
  • The scan results are not cached, so every scan except the first one is a waste of time and energy
  • The Defender scanner process seems to be single threaded making it a bottleneck (not verified, might not be the case)
  • In a single build of Chrome, this wastes over 14 minutes of CPU and wall clock time
Other things of note in the thread and comments:

  • This issue has been reported to Microsoft decades ago
  • The actual engineers working on this can't comment officially but it is implied that they want to fix it but are blocked by office politics
  • This can only be fixed by Microsoft and as far as anyone knows, there is no work being done despite this being a major inconvenience affecting a notable fraction of developers

What is the issue?

We don'r know for sure due to corporate confidentiality reasons, but we can make some educated guesses. The most probable magic word above is "office politics". A fairly probable case is that somewhere in the middle damagement chain of Microsoft there is a person who has decreed that it is more important to create new features than spend resources on this as the code "already works" and then sets team priorities accordingly. Extra points if said person insists on using a Macbook as their work computer so their MBA friends won't make fun of them at the country club. If this is true then what we have is a case where a single person is making life miserable for tens (potentially hundreds) of thousands of people without really understanding the consequences of their decision. If this was up to the people on "the factory floor", the issue would probably have been fixed or at least mitigated years ago.

To repeat: I don't know if this is the case for Microsoft so the above dramatization is conjecture on my part. On the other hand I have seen exactly this scenario play out many times behind the scenes of various different corporations. For whatever reason this seems to a common occurrence in nonpublic hierarchical organisations. Thus we can postulate why open source leads to better code in the end.

In open source development the technologically incompetent can not prevent the technologically competent from improving the product.

Shameless self-promotion

If you enjoyed this text and can read Finnish, you might enjoy my brand new book in which humanity's first interplanetary space travel is experienced as an allegory to a software startup. You can purchase it from at least these online stores: Link #1Link #2. Also available at your local library.

Appendix: the cost and carbon footprint

Windows 11 has a bunch of helpful hints on how to reduce your carbon footprint. One of these is warning if your screen blanking timeout is less than the computer suspend timeout. At the same time they have in the core of their OS the gross inefficiency discussed above. Let us compare and contrast the two.

The display on a modern laptop consumes fairly little power. OTOH running a CPU flat out takes a lot of juice. Suspending earlier saves power when consumption is at its lowest, whereas virus scanners add load at the point of maximum resource usage. Many people close the lid of their computer when not using it so they would not benefit that much from different timeout settings. For developers avoiding process invocations is not possible, that is what the computer is expected to do.

Even more importantly, this also affects every single cloud computer running Windows including every Windows server, CI pipeline and, well, all of Azure. All of those machines are burning oil recomputing pointless virus checks. It is left as an exercise to the reader to compute how much energy has been wasted in, say, the last ten years of cloud operations over the globe (unless Microsoft runs Azure jobs with virus scanners disabled for efficiency, but surely they would not do that). Fixing the issue properly would take a lot of engineering effort and risk breaking existing applications, but MS would recoup the money investment in electricity savings from their own Azure server operations alone fairly quickly. I'm fairly sure there are ex-Googlers around who can give them pointers on how to calculate the exact break-even point.

All of this is to say that having said energy saving tips in the Windows UI is roughly equivalent to a Bitcoin enthusiast asking people to consider nature before printing their emails.

Saturday, April 29, 2023

The unbearable tightness of printing

Let's say you want to print a full colour comic book in the best possible quality. For simplicity we'll use this image as an example.

As you can probably guess, just putting this image in a PDF does not work, even if it had sufficient resolution. Instead what you need to do is to create two images. One for linework that is monochrome and has least 600 PPI and one for colours, which is typically a 300 PPI colour managed CMYK TIFF.

The colour image is drawn first and then the monochrome image is drawn on top of it. In this way you get both smooth colours and crisp linework. Most people would stop here, but this where the actual work begins. It is also where things start to wander into undocumented (or, rather, "implementation defined") territory.

Printing in true black

In computer monitors the blackest colour possible is when all colour components are off, or (0, 0, 0) in RGB values. Thus you might expect that the blackest CMYK colour is either (0, 0, 0, 1) or (1, 1, 1, 1). Surprisingly it is neither. The former looks grayish when printed whereas the latter can't be printed at all because of physical limitations. If you put too much ink in one place on the page, the underlying proper gets too wet, warps and might even rip. And tear.

Instead what you need to do is to use a colour called rich black. Each print shop has their own values for this, as the exact amount of inks to use to get the deepest black colour is dependent on the inks, paper and printing machine used. We'll use the value (0.1, 0.1, 0.1, 1.0) for rich black in this text.

Thus we need three different images rather than two.

First the colour image is laid down, then the image holding the areas that should be printed in rich black. This is a 300PPI colour image with the colour value (0.1, 0.1, 0.1, 0) on pixels that should be painted with rich black. Finally the line work is drawn on top the other two. The first two images can be combined into one. This is usually done by graphic artists when preparing their artwork to print. However the middle image can be automatically generated from the linework image with some Python so we're doing that to reduce manual work and reduce the possibility of human error.

If you create a PDF with these images you are still not done. In fact the output would be identical to the previous setup. There are still more quirks to handle.

Trapping and overprinting

Since all of the colours are printed separately they are suspect to misregistration. That is, the various colours might shift relative to each other during the printing process. This causes visual artifacts in the edges between two colours. This is a fairly complicated topic, Wikipedia has more details. This issue can be fixed by trapping, that is, "spreading" the colour under the "edge" of the linework. Like so:

If you look closely at the middle image, the gray area is slightly smaller than in the previous picture. This shrunk image can be automatically generated from the linework image with morphological erode/dilate operations. Now we have everything needed to print things properly, but if you actually try it it still won't work.

The way the PDF imaging model works is that if you draw on the canvas with any colour, all colour channels of the existing colour on the page get affected. That is, if the existing colour on the canvas is (0.1, 0.1, 0.1, 0) and you draw on top of it with (0, 0, 0, 1) the output is (0, 0, 0, 1). All the work we did getting the proper rich black colour under the linework gets erased as if it was never there.

PDF has a feature called overprinting to handle this exact case (you could also use the "multiply" filter but it requires the use of transparency, which is still prohibited in some workflows). It does pretty much what it says on the tin. When overprinting is enabled any draw operations accumulate over the existing inks. Thus the final step is to enable overprinting for the final line work image and then Bob's your uncle?

In theory yes. In practice lol no, because this part of the PDF specification is about as hand-wavy as things go. There are several toggles that affect how overprinting gets handled. What they actually do is only explained in descriptive text. One of the outcomes of this is that every single generally available PDF viewer renders the output incorrectly. Poppler, Ghostscript, Apple Preview and even Adobe Acrobat Reader all produce outputs that are incorrect in different ways. They don't even warn you that the PDF uses overprinting and that the output might be incorrect. This makes development and debugging this use case somewhat challenging.

The only way to get correct output is to use Adobe Acrobat Pro and tell it to enable overprint simulation. Fortunately I have a friend who has a 10 year old version (remember, back when you could actually buy software once and keep using it as opposed to a monthly license that can get yanked at any time?). After pestering him with an endless flow of test PDFs I finally managed to work out the exact steps needed to make this work:

  • Create a 300 PPI image with the colours, a 300 or 600 PPI monochrome image with the rich black areas and a 600 DPI monochrome image for the linework (the rich black image can be autogenerated from the linework image and/or precomposited in the colour image)
  • Load and draw the colour image as usual
  • Load the rich black image and store it as a PDF ImageMask rather than a plain image
  • Set nonstroke colour to (0.1, 0.1, 0.1, 0), set the rich black image as a stencil and fill it
  • Load the linework image as an imagemask
  • Enable overprinting mode
  • Set overprinting mode to 1
  • Set nonstroke colour to (0, 0, 0, 1)
  • Draw the line image as a stencil

If you deviate from any of the above steps, the output will be silently wrong. If you process the resulting PDF with anything except Adobe's tool suite the end result might become silently wrong. As an example here is the output of colour separation using Adobe Acrobat and Ghostscript.

Acrobat has preserved the rich black values under the linework whereas Ghostscript has instead cleared the colour value to zero losing the "rich" part of black. Interestingly Ghostscript seems to handle overprinting correctly in basic PDF shape drawing operations but not in stencil drawing operations.

Or maybe it does and Acrobat is incorrect here. The only way to know for sure would be to print test samples on a dozen or so commercial offset printing presses, inspecting the plates manually and then seeing what ends up on paper. Sadly I don't have the resources for that.

Sunday, April 16, 2023

PDF forms, the standard that seemingly isn't

Having gotten the basic graphical output or A4PDF working I wanted to see if I could make PDF form generation work.

This was of course a terrible idea but sadly I lacked foresight.

After a lot of plumbing code it was time to start defining form widgets. I chose to start simple and create a form with a single togglable check button. This does not seem like an impossibly difficult problem and the official PDF specification even has a nice code sample for this:

The basic idea is simple. You define the "widget" and give it two "state objects" that contain PDF drawing operations. The idea is that the PDF renderer will draw one of the two on top of the basic PDF document depending on whether the checkbox is toggled or not. The code above sets things up so that the appearance of the checkbox is one of two different DingBat symbols. Their values are not shown in the specification, but presumably they are a checked box and an empty square.

I created a test PDF with LibreOffice's form designer and then set about trying to recreate it. LO's form generator uses OpenSymbol for the checked status of the checkbox and an empty appearance for the off state. A4PDF uses the builtin Helvetica "X" character.The actual files can be downloaded here.

What we have here is a failure to communicate

No matter how much I tried I could not make form generation actually work. The output was always broken in weird ways I could not explain. Unfortunately this part of the PDF spec is not very helpful, because it does not give out full examples, only snippets, and search engines are worthless at finding any technical content when using "PDF" as a search term. It may even be that information about this is not available in public web sites. Who knows?

Anyhow, when regular debugging does not work, it's time approach things sideways. Let's start by opening the LibreOffice test document with Okular:

This might seem to be working just fine, but people with sharp eyes might notice a problem. That check mark is not from OpenSymbol. FWICT it is the standard Qt checkbox widget. Still, the checkbox works and its appearance is passable. But what happens if you increase the zoom level?

Oh dear. That "Yes" text is the PDF-internal label given to the "on" state. Why is it displayed? No idea. It's time to bring out the heavy guns and see how things work in The Gold Standard of PDF Rendering, Adobe Reader.

Nope, that's not the OpenSymbol checkmark either. Adobe Reader seems to be ignoring the spec and drawing its own checkmarks instead. After seeing this I knew I had to try this on every PDF renderer I could reasonably get my hands on. Here's the results:

  • Okular
    • LO: incorrect appearance, breaks when zooming
    • A4PDF: shows both the "correct" checkmark as well as the Qt widget on top of each other, takes a noticeable amount of time after clicking until the widget state is updated
  • Evince
    • LO: does not respond to clicks
    • A4PDF: works correctly
  • Adobe Reader win64
    • LO: incorrect appearance
    • A4PDF: incorrect appearance, does not always respond to button clicks
  • Firefox
    • LO: Incorrect appearance
    • A4PDF: Incorrect appearance
  • Chromium
    • LO: Incorrect appearance
    • A4PDF: works correctly
  • Apple Preview:
    • LO: works correctly (though the offset is a bit wonky, probably an issue in the drawing commands themselves)
    • A4PDF: works correctly

The only viewer that seems to be working correctly in all cases is Apple Preview.

PDF has a ton of toggleable flags and the like to make things invisible when printing and so on. It is entirely possible that the PDF files are "incorrect" in some way. But still, either the behaviour should either be the same on all viewers or they should report format errors. No errors are reported, though, even by this online validator.

Monday, April 3, 2023

Some details about creating print-quality PDFs

At its core, PDF is an image file format. In theory it is not at all different from the file formats of Gimp, Krita, Photoshop and the like. It consists of a bunch of raster and vector objects on top of each other. In practice there are several differences, the biggest of which is the following:

In PDF you can have images that have different color spaces and resolutions (that is, PPI values). This is by design as it is necessary to achieve high quality printing output.

As a typical example, comic books that are printed in color consist of two different images. The "bottom" one contains only the colors and is typically 300 PPI. On top of that you have the black linework, which is a 1 bit image at 600 or even 1200 PPI. Putting both the linework and colors in the same image file would not work. In the printout the lines would be fuzzy, even if the combined image did contain 1200 PPI.

A deeper explanation can be found in the usual places but the short version is that these two different image types need to be handled in completely opposite ways to make them look good when printed. When converting colors images to printing plates the processing software prioritizes smoothness.  On the other hand for monochrome images the system prioritizes sharpness. Doing this wrong means either getting color images that are blocky or linework that is fuzzy.

When working on A4PDF it was clear from the start that it needs to be able to create PDF files that can be used for commercial quality print jobs. To test this I wrote a Python script that recreates the cover of my recently published book originally typeset with Scribus. The end result was about 100 lines of code in total.

The background image

The main image is a single file without any adornments. It was provided by the illustrator as a single 8031 by 5953 image file. A fully color managed workflow demands the image to be in CMYK format and have a corresponding ICC color profile. There is basically only one file format that supports this use case: TIFF. Interestingly the specification for this file format was finalized in 1992. It is left as an exercise to the reader to determine how many image file formats have been introduced since that time.

A4PDF extracts the embedded ICC profile and stores it in the PDF file. It could also convert the image from the image's ICC colorspace to the specified output color space if they are different, but currently does not.

Text objects

All text color is white and is specified in CMYK colorspace, though it could also be specified in DeviceGray. Defining any object in RGB (even if the actual color was full white) could make the printing house reject the file as invalid and thus unsuitable for printing.

The author name in the front cover uses PDF's character spacing to "spread out" the text. The default character spacing in this font is too tight for use in covers.

PDF can only produce horizontal text. Creating vertical text, as in the spine, requires you to modify the drawing state's transformation matrix stack. In practice this is almost identical to OpenGL, though the PostScript drawing model that PDF uses predates OpenGL by 8 years or so. In this case the text needs a rotate + translate. 

The bar code

Ideally this should be defined with PDF's vector drawing operations. Unfortunately that would require me to implement reading SVG files somehow. It turned out to be a lot less effort to export the SVG from Inkscape at 600 PPI and then convert that to a 1 bit image with the Gimp. The end result is pretty much the same.

This approach works in Scribus as well, but not in LibreOffice. It converts all 1 bit images to 8 bit grayscale meaning that it might be fuzzy when printed. LO used to do this correctly but the behaviour was intentionally changed at some point.

The logo

This is the logo used in the publisher's sci-fi books. You can probably guess that the first book in the series was Philip K. Dick's Do Androids Dream of Electric Sheep?

Like the bar code, this should optimally be defined with vector operations, but again for simplicity a raster image is used instead. However it is different from the bar code image in that it has an alpha channel so that the background starfield shows through. When you export a 1 bit image that has an alpha channel to a PNG, Gimp writes it out as an indexed image with 4 colors (black opaque, white opaque, black transparent, white transparent). A4PDF detects files of this type and stores them in the PDF as 1 bit monochrome images with a 1 bit alpha channel.

This is something even Scribus does not seem to handle correctly. In my testing it seemed to convert these kinds images to 8 bit grayscale instead.

Saturday, April 1, 2023

Got the Star Trek - The Motion Picture Director's Edition box set? You might wan to check your discs

TL/DR

Star Trek The Motion Picture The Complete Adventure box set claims to contain a special, longer cut of the film. However it seems that this is not the case for some editions. The British edition does contain the longer cut, but the Scandinavian one seems not to. The back of the box still claims that the box set does contain the longer cut.

The claim and evidence

This is the box set in question.

At the back of the box we find the following text:

I bought this at the end of last year but I could not find the longer edition anywhere in the menus. I mentioned this to a friend of mine who has the same box set and he had found it immediately. After a lot of debugging we discovered that he has the British edition of the box set whereas I have the Scandinavian one.

The British box set, which has the longer edition, consists of the following discs:

  • EU151495BLB Blu-ray The Director's Edition Bonus Disc
  • EU151496ULB 4K UltraHD The Director's Edition Feature Film, Special Features
  • EU151460ULB 4K UltraHD Feature Film, Special Features (Special Longer Version, Theatrical Version)
  • EU151496BLB Blu-ray The Director's Edition Feature Film, Special Features
  • EC150885BLB Blu-ray Feature Film, Special Features

My Scandinavian box set has the following discs:

  • eu151495blb bonus disc
  • eu151496ulb director's edition 4k ultrahd, Feature film
  • eu150884ulb 4k, regular version, feature film
  • eu151496blb blu-ray, director's edition
  • eu150885blb blu-ray, regular version, feature film

The only differences here are that one disc has a serial number with letters EC instead of EU and that the British edition has this disc:

Note how it says "special longer version" in microscopic letters under the Paramount logo. This text does not appear on any disc in the Scandinavian edition.

The Scandinavian edition does not have this disc. Instead it has the disc with the product id EU150884ULB whereas the corresponding British disc has the product id EU151460ULB .The missing content can not be on any of the other discs, because they are the same in both editions (the EU/EC issue notwithstanding). 

I reported this to the store I bought the box set from. After a lot of convincing they agreed to order a replacement box set. When it arrived we inspected the contained discs and they were the same as in the box set I already had. This would imply that the entire print run is defective. After more convincing we managed to get the Finnish importer to report the issue to the distributor in Sweden.

They eventually replied that the longer edition is in "the extra disc in the set that is in a cardboard sleeve". This is not the case, as that disc contains only extras, is the same in the British box set and further is a regular Blu-Ray, not a  4k UHD one as it should be and as it is on the British edition.

We reported all this back up the chain and have not heard anything back for several weeks now.

What to do if you have a non-British (and presumably non-US) edition of this box set?

Check whether your discs contain the special longer edition. The British disc has this this fairly unambiguous selection screen for it when you insert the disc:


If your box set does not contain this version of the film, report the issue to where you bought the box set from. If this really is a larger issue (as it would seem), the more bug reports they get the faster things will get fixed.

The film store said that they have sold tens of these box sets and that I was the first to notice the issue so don't assume that other people have already reported it. This is not an issue of the physical copy that I have because the replacement box set was defective in the same way.

Speculation: why is it broken?

The British disc that has the special longer edition does not contain Finnish subtitles (or possibly any non-English languages). When the box was being assembled someone found this out, decided that they can't ship a disc without localisation and replaced the disc with a regular version of the film that does not have the longer cut. But they did not change the back of the box, which states that the box set contains the longer edition which it does not seem to have.

Wednesday, March 22, 2023

In which you find out that everything you assumed was wrong

The previous post listed an algorithm for converting TrueType glyph advances into PDF glyph widths. It worked, but was weird, kludgy, complicated and all around bad. It was also completely wrong. As in "not even remotely in the direction of the correct solution" but it happened to work by accident with the examples I had. Trying it with some more fonts yields the now familiar error.

What makes debugging this issue harder is that Freetype exposes the same information multiple times. There are three or four different places where glyph advancement can be read and multiple multipliers that could potentially be used. No combination of these provides a value that would be even relatively close to the correct one.

So let's work backwards. What is the multiplier that we need to get? A single example says that a glyph with font advancement 1229 should have a PDF width value of 600. Therefore 1229/600 = 2.048333, and we should divide by roughly half.

Waitaminute!

Hmmmmm.

After some debugging one can find a struct entry that is set to 1000 for fonts that work and 2048 for those that don't: face->units_per_EM. This scaling value is arbitrary but most (but not all) Adobe Type 1 fonts use 1000 while most (but not all) TrueType fonts use 2048. PDF uses 1000, probably to mach Type 1 conventions. This is even documented in the PDF specification 1.7, section 9.2.4, page 241: "For all font types except Type 3, the units of glyph space are one-thousandth of a unit of text space". This is incredibly easy to miss when reading the 700+ pages of text in the specification. I know because I did it. Several times.

Monday, March 20, 2023

The joy of font debugging

Remember how in the previous blog post it was said that creating text in PDF would be "just a matter of setting some parameters"?

Well let's start by creating the main text in two justified columns.

Ok, nice. Next we add an author name.

Whoopsies. After some debugging one can find out that this only happens if you only use fewer than 32 characters from the font you are subsetting. Obviously. But no matter, after fixing this minor niggle we only need to display the title, author names and finally an email address. This is more of the same so nothing can go wrong.

After some hours of debugging it becomes clear that the values of left side bearings are sometimes read from the source file using incorrect offsets (while still being 100% memory safe, all accesses are inside the source data). Good Now that that's fully fix...

This is where things get extremely weird. No matter where you look or how deeply you peruse the binary data files, nothing seems to be incorrect. Maybe this is a bug in the Noto Mono font used here? So you try Liberation Mono. It fails too. And then, just to be sure, you try Ubuntu Mono. It works correctly. As does Free Mono.

Hmmmmmhmhm.

Opening the file in Fontforge says that the width of all characters is 1228 font units. That is also what Freetype reports. Which is comforting because in the TrueType file format struct fields that are designated as 32 bit integers might be either a) 32 bit integer b) 26.6. fixed point or c) 16.16 fixed point. You can't ever really be sure which, though, because it depends on values of bitfields far and away from the actual structs themselves.

Things get even weirder if you test exporting a PDF that uses those broken fonts either with Cairo or LibreOffice. In the generated PDF files the width of characters in this font to 600. Not 1228. Trying to read their source to find out how and why they do this is problematic, because they support many different font formats and thus convert the input data to their internal representation and then generate the output from those. Trying to understand how the input data correlates with the output data can give you a major headache without even trying too hard.

The actual solution is even weirder than all of the above. TrueType fonts store the horizontal metrics of glyphs in a table called hmtx. It stores both the glyph advance and left side bearings. As a special case you can only specify the latter and use a common value for the former. This provides space savings of UP TO 2 BYTES PER CHARACTER but the downside is more complex parsing. Further, going through Freetype's public structs reveals that they contain a field called x_scale. After a lot of trial error you can eventually decipher the actual algorithm needed:

If the character has both glyph advance and left side bearings defined than you use them as-is but if it only has left side bearings defined then you must divide the default width value with the scale.

Then finally.

Addendum: Freetype

Freetype has many flags and knobs to specify whether you want metrics in original font coordinates or "output coordinates". I could not come up with a combination that would have provided consistent values, some postprocessing seems to always be needed. This might be a bug, it might be a limitation of the TrueType format, it might be something completely different. I don't really know, and I don't have the energy to dig further to uncover the underlying issue.

Thursday, March 16, 2023

The PDF text model is quite nice, actually

As was discussed earlier, the way PDF handles fonts and glyphs is arcane and tedious. It takes a lot of boilerplate and hitting your shins against sharp stones to get working. However once you do and can turn to the higher level text functionality, things become a lot nicer. (Right-to-left, vertical and calligraphic scripts might be more difficult, but I don't know any of those.)

The PDF text drawing model provides a fairly wide selection of text operations.

If you, for example, want to produce a paragraph of justified text, first you need to calculate how the text should be split in lines and the additional word and character scaling parameters needed. Then the text can be rendered with the following pseudocode:

  • Create a text object with the correct font and position.
  • Set spacing values for the current line.
  • Output the current line of text (add kerning manually if it is in a format Freetype does not handle)
  • Repeat the above two steps until the paragraph is done
  • Close the text object
This shifts the burden of having to compute each letter's exact location from you to the PDF renderer. If you need more precision than this, then you need to dig out Harfbuzz, and draw the glyphs one by one to precomputed coordinates.

Sunday, March 12, 2023

First A4PDF release, version 0.1.0 "embarrasment"

The time has come to make the first technical preview release of A4PDF, nicknamed embarrasment. The name stems from this statement.

If you're not embarrassed by the first version of your product, you've launched too late.

It does not do much yet, but the basics are there to draw shapes, text and images using a plain C API:

A pkg-config file is also provided. There is also a Python wrapper to run scripts like these:

Distro packaging

People should probably not do any distro packaging yet, as the library is neither ABI nor even API stable. However if someone wants to build deb packages, the source portion of Debian control file would look something like this:

Source: a4pdf
Maintainer: Bob McBob <bob@example.org>
Section: misc
Priority: optional
Standards-Version: 3.9.2
Build-Depends: debhelper (>= 10),
  liblcms2-dev,
  libpng-dev,
  libjpeg-dev,
  libgtk-4-dev,
  libfmt-dev,
  libfreetype-dev,
  meson,
  python3-pil,
  fonts-noto-core,
  ghostscript

Saturday, March 11, 2023

My book is finally available for purchase

A major difference between software and book projects is that the latter have a point when they can be considered complete and finished. For my debut novel, that time has come.

The text block has been created with a "mini-LaTeX" DTP program that I wrote basically from scratch. This caused "fun" things to happen. For example I got an email from the printing house some four days before the unveiling event that the book contains words that were not hyphenated according to recommended style guides. I was aware of said style guides, had added handling for those and even had unit tests to ensure that they work. And yet in production they did not work. This lead to a very stressful debugging session where you know that the only person in the world that can fix it is you, and that there is a very strict and personal deadline.

The actual PDF generation was done with Cairo and Pango. Surprisingly there were zero issues with them, the printer accepted them just fine and the printout looks great. The cover was made with Scribus and it did have several issues none of which had anything to do with Scribus itself, just that doing a full color managed print job is to this day a bit tricky. did have to postprocess Cairo's output with Ghostscript because Cairo only produces PDFs in the RGB colorspace whereas printers require grayscale PDFs.

The "back blurb"

Humanity has managed to create the technology needed for interstellar travel and civilizations from outer space have invited them to visit. The people of Earth immediately begin work on creating a space ship suitable for the journey, with stylish appearance being their number one priority. Eventually the ship gets under way commandeered by an egomaniacal captain and staffed by a nerve wrecked crew. What they don't know is who they are actually going to meet, what they should do once they get there and why the ships has an ice rink.

[Name of book would go here, but I could not come up with a proper translation as the original is a pun] mixes classical space sci-fi, scientifically accurate technology and dark comedy into a hypergolic stew, whose blast wave nothing can survive intact — not even space sex.

Where to get it?

Every now and then people ask me how they could support Meson financially. Buying this book is by far the best way to do that at the current time. Yes, it is in Finnish, so most people reading this blog post can't comprehend it, but reading it is optional, you can just buy it to keep on your coffee table for maximal hipster street cred. :)

Finnish people who prefer getting their books via libraries can request it via online forms such as this one.

Sunday, March 5, 2023

The code functionality tipping point

Software development is weirdly nonlinear. When you start working on a new project at first it does not really do much. Adding more and more code does not seem to help. The "end user visible" functionality is pretty poor and it does not seem to get visibly better. You can do something, but nothing that would be actually useful.

This goes on for some amount of time that can't be predicted.

And then, unexpectedly, the pieces come together and useful functionality jumps from "almost nothing" to "quite a lot, actually".

Case in point. Up until yesterday a4pdf was pretty much useless. But today you can take this piece of Python code:

to produce a PDF that looks like this:


Monday, February 27, 2023

Unit testing PDF generation

How would you test PDF generation?

This turns out to be unexpectedly difficult because you need to check that the files are both syntactically and semantically valid. The former could be tested with existing PDF validators and converters but the latter is more difficult.

If you, say, try to render a red square, the end result should be that the PDF command stream has two commands, a re command and an f command. That could be verified simply by grepping the command stream with some regexps. It would not be very useful, though, as there is no guarantee that those commands actually produce a red square in the output document. There are dozens of ways to make the output stream not produce a red square in the intended location without breaking file "validity" in any way.

What even is the truth?

The only reliable way is to render the PDF file into an image and compare it to a ground truth image. Assuming the output is "close enough" then the generator can be said to have worked correctly. The problem, as is often the case, lies inside those quote marks. Fuzzy image equality is a difficult problem. Those interested in details can look it up online. For our case we'll just ignore it and require pixel perfect reproduction. This means that we can have test failures if we change the PDF rendering backend, run it on a different operating system or even just upgrade it to a new version.

The other problem comes from the desire to have a plain C API. Writing unit tests in C is cumbersome to say the least. Fortunately there is a simpler solution. Since the project ships its own Python bindings, we can write all of these tests using Python. This affords us all the niceties that exist in Python such as an extensive unit testing suite, the ability to easily spawn external processes and image difference operators (via PIL). After some boilerplate, writing an unit tests reduces to this:

@validate_image('python_simple', 480, 640)
def test_simple(self, ofilename):
    ofile = pathlib.Path(ofilename)
    with a4pdf.Generator(ofile) as g:
        with g.page_draw_context() as ctx:
            ctx.set_rgb_nonstroke(1.0, 0.0, 0.0)
            ctx.cmd_re(10, 10, 100, 100)
            ctx.cmd_f()

Behind the scenes this will generate the PDF, render it with Ghostscript and compare the result to an existing image. If the output is not bitwise identical the test fails.

Get the code

The project is now available here.

Friday, February 17, 2023

PDF output in images

Generating PDF files is mostly (but not entirely) a serialization problem where you keep repeating the following loop:

  • Find out what functionality PDF has
  • Read the specification to find out how it is expressed using PDF document syntax
  • Come up with some sort of an API to express same
  • Serialize the latter into the former
  • Debug

This means that you have to spend a fair bit of time without much to show for it apart from documents with various black boxes in them. However once you have enough foundational code, then suddenly you can generate all sorts of fun images. Let's look at some now.

Paths are easy to define with lines, beziers and the like, as are path paint styles like line caps and joints. Choosing between nonzero and even-odd winding rules is just a question of choosing a different paint operator. 

PDF allows you to set any draw object as a "clipping path" which behaves like a stencil. Subsequent drawing operations are only applied to those pixels that are inside the specified clipping area. The painting model is uniform, text and paths are mostly interchangeable so text can be used as a clipping path. The gradient is a PNG image, not a vector object.

This color wheel looks fairly average, but it is defined in L*a*b* color space. Did you know that PDF has native support for L*a*b* colors without needing any ICC profiles? I sure didn't until I read the spec.

And finally here are some shadings and patterns. The first two are your standard linear and spherical gradients, but the latter two are more interesting. In PDF you can specify a pattern, which is basically just a rectangular area. You can draw on it with almost all the same operators as on a page (you can't use patterns within patterns, though). You can then use said pattern to paint other objects and the PDF renderer will fill the space by tiling the pattern (yes, of course there is a transformation matrix you can specify). As text is not special you can draw a single character and fill it with a repeated instance of a different character.

Using it in Python

The code needed to generate an empty PDF document looks approximately like this:

import a4pdf
o = a4pdf.Options()
g = a4pdf.Generator('out.pdf', o)
with g.page_draw_context() as ctx:
    # Drawing commands would go here.
    pass

This snippet utilizes almost 100% of all available API thus far. So there's not much you can do with it yet.

Monday, February 13, 2023

Plain C API design, the real world Kobayashi Maru test

Designing APIs is hard. Designing good APIs that future people will not instantly classify as "total crap" is even harder. There are typically many competing requirements such as:

  • API stability
  • ABI stability (if you are into that sort of thing, some are not)
  • Maximize the amount of functionality supported
  • Minimize the number of functions exposed
  • Make the API as easy as possible to use
  • Make the API as difficult as possible to use incorrectly (preferably it should be impossible)
  • Make the API as easy as possible to use from scripting languages

Recently I have been trying to create a proper API for PDF generation so let's use that as an example.

Cairo, simple but limited

The API that Cairo exposes is on the whole pretty good. It has a fair bit of functions, but only one main "painter", the Cairo context. Cairo is a general drawing library with many backends, but the drawing commands map very closely to the ones in PDF. This is probably because Cairo's drawing model is patterned after PostScript, which is almost the same as PDF. Having only one context type means that the users do not have to manually keep track of life times between different object types, which is the source of many C bugs.

This approach works nicely with Cairo but not so well if you want to expose the full functionality of PDF directly, specifically patterns. In PDF you can specify a "pattern object". The basic use case for it is if you need to draw a repeating shape, like a brick wall, by specifying how to draw a single tile and then telling the PDF interpreter to "fill in" the area you specify with this pattern. (Cairo also has pattern support which behaves mostly the same but is ideologically slightly different. We'll ignore those for the rest of this text.)

When defining a pattern you can use almost but not exactly the same drawing commands as when doing regular painting on page surfaces. There are also at least two different pattern types with slightly varying semantics. Since we want to expose PDF functionality directly, we need to have one function for each command, like pdf_draw_cmd_l(ctx, x, y) to draw a line. The question then becomes how does one expose all this as types and functions.

Keep everything in a single object

The simplest thing objectwise would be to keep everything in a single god object and have functions like pdf_draw_page_cmd_l, pdf_draw_pattern1_cmd_l and pdf_draw_pattern2_cmd_l. This is a terrible API because everything is smooshed together and you need to remember to finish patterns before using them. Don't do this.

Fully separate object types

Another approach is to make each concept their own separate type. Then you can have functions like pdf_page_cmd_l(page, x, y), pdf_pattern_cmd_l(pattern, x, y) and so on. This also makes it easy to prevent using commands that are not supported. If, say, a command called bob is not supported on patterns, then all you have to do is to not implement the corresponding function pdf_pattern_cmd_bob.

The big downside is that there are a lot of drawing commands in PDF and in this approach almost all of them need to be defined three times, once for each context type. Their implementations are identical, so they all need to call a fourth function or the code needs to be triplicated.

A common context class

One approach is to abstract this have a PaintContext class that internally knows whether it is used for page or pattern painting. This reduces the number of functions back to one. pdf_ctx_cmd_l(ctx, x, y). The main downside is that now it is possible to accidentally call a function that requires a page drawing context with a pattern drawing context and the type system will not stop you.

A second problem is that you can call the aforementioned bob command with a pattern context. The library needs to detect that and return an error code if it happens. What this means is that a bunch of functions that previously could not fail, can now return error codes. For consistency you might want to change all paint commands to return error codes instead, but then >90% of them never return anything except success.

A common base class

The "object oriented" way of doing this would be to have a common base class for the painting functionality and then inherit that. In this approach functions that can take any context would have names like pdf_ctx_cmd_l(ctx, x, y) wheres functions that don't get specializations like pdf_page_cmd_bob. Since C does not have any OO functionality this would need to be reimplemented from scratch, probably using some Gobject-style preprocessor macro hackery like pdf_ctx_cmd_l(PDF_CTX(page), x, y) or alternatively pdf_ctx_cmd_l(pdf_page_get_ctx(page), x, y). This works, but means a lot of typing for end users and macros are type unsafe even by C standards. If you use the wrong type, woe is you. Macros make providing wrappers harder because they require you to always compile some glue code rather than using something simple like Python's ctypes.

Is there a way to cheat?

I have not managed to come up with a way. Do let me know if you do.

Wednesday, February 8, 2023

More PDF, C API and Python

After a whole lot of bashing my head against the desk I finally managed to find out what Acrobat reader's "error 14" means, I managed to make both font subsetting and graphics generation work. Which means you can now do things like this:

After this improving the code to handle full PDF graphics seems mostly to be a question of adding functions for all primitives in the PDF graphics model. The big unknown thing is PDF form support, of which I know nothing. Being a new design it probably is a lot less convoluted than PDF fonts were.

Dependencies

The code is a few thousand lines of C++ 20. It requires surprisingly few dependencies:

  • Fmt
  • Zlib
  • Libpng
  • Freetype
  • LittleCMS
Some of these are not actually necessary. Fmtlib will be in the standard library in C++23. Libpng is only used to load PNG images from disk. The library could require its users to load graphics themselves and pass images in as pixel arrays. Interestingly doing font subsetting requires parsing the raw data of TrueType files by hand, so Freetype is not strictly mandatory, though it does make some things easier.

The only things you'd actually need are Zlib and LittleCMS. If one wanted to support CCIT Group 4 compression for 1 bit images, you'd need a dependency on libtiff.

A plain C API

The unfortunate side of library development is that if you want your library to be widely used, you have to provide a plain C API. For PDF it's not all that bad as you can mostly copy what Cairo does as its C API is quite nice to use. You might want to design this early on as getting the C API as easy and reliable to use as possible has effects on how the internal architecture works. As an example you should make all objects independent of each other. If the end user has to do things like "be sure to destroy all objects of type X before calling function F on object Y", then, because this is C, they are going to get it wrong and cause segfaults (at best).

Python integration

Once you have the C API, though, you can do all sorts of fun things, such as using Python's ctypes module. It takes a bit of typing and drudgery, but eventually you can create a "dependencyless" Python wrapper. With it you can do this to create an empty PDF file:

o = PdfOptions()
g = PdfGenerator(b"python.pdf", o)
g.new_page()

That's all you can do ATM, as these are the only methods exposed in the C API. Just implementing these made it very clear that the API is not good and needs to be changed.


Wednesday, February 1, 2023

PDF with font subsetting and a look in the future

After several days of head scratching, debugging and despair I finally got font subsetting working in PDF. The text renders correctly in Okular, goes througg Ghostscript without errors and even passes an online PDF validator I found. But not Acrobat Reader, which chokes on it completely and refuses to show anything. Sigh.

The most likely cause is that the subset font that gets generated during this operation is not 100% valid. The approach I use is almost identical to what LO does, but for some reason their PDFs work. Opening both files in FontForge seems to indicate that the .notdef glyph definition is somehow not correct, but offers no help as to why.

In any case it seems like there would be a need for a library for PDF generation. Existing libs either do not handle non-RGB color spaces or are implemented in Java, Ruby or other languages that are hard to use from languages other than themselves. Many programs, like LO and Scribus, have their own libraries for generating PDF files. It would be nice if there could be a single library for that.

Is this a reasonable idea that people would actually be interested in? I don't know, but let's try to find out. I'm going to spend the next weekend in FOSDEM. So if you are going too and are interested in PDF generation, write a comment below or send me an email, I guess? Maybe we can have a shadow meeting in the cafeteria.

Wednesday, January 25, 2023

Typesetting an entire book part V: Getting it published

Writing a book is not that difficult. Sure, it is laborious, but if you merely keep typing away day after day, eventually you end up with a manuscript. Writing a book that is "good" or one that other people would want to read is a lot harder. Still, even that is easy compared to trying to get a book published. According to various unreferenced sources on the Internet, out of all manuscripts submitted only 1 in 1000 to 1 in 10 000 gets accepted for publication. Probabilitywise this is roughly as unlikely casting five dice and getting six with all of them.

Having written a manuscript I went about tying to get it published. The common approach in most countries is that first you have to pitch your manuscript to a literary agent, and if you succeed, they will then try to pitch it to publishers. In Finland the the procedure is simpler, anyone can submit their manuscripts directly to book publishing houses without a middle man. While this makes things easier, it does not help with deciding how much the manuscript should be polished before submission. The more you polish the bigger your chances of getting published, but the longer it takes and the more work you have to do if the publisher wants to make changes to the content.

Eventually I ended up with a sort-of-agile approach. I first gathered a list of all book publishers that have published scifi recently (there were not many). Then I polished the manuscript enough so it had no obvious deficiencies and sent it to the first publisher on the list. Then I did a full revision of the text and sent it to the next one and so on. Eventually I had sent it to all of them. Very soon thereafter I had received either a rejection email or nothing at all from each one.

It's not who you are, but who you know

Since the content did not succeed in selling itself, it was time to start using connections. I have known Pertti Jarla, who is one of Finland's most popular cartoonists, for several years. He runs a small scale publishing company. Its most famous book thus far has been a re-translated version of Philip K. Dick's Do Androids Dream of Electric Sheep. I reached out to him and, long story short, the book should available in Finnish bookstores in a few weeks. The front cover looks like this.

More information in Finnish can be found on the publisher's web site. As for the obvious question of what would the book's title be in English, unfortunately the answer is "it's quite complicated to translate, actually". Basically what the title says is "A giant leap for mankind" but also not, and I have not managed to come up with a description or a translation that would not be a spoiler.

So you'll just have to wait for part VI: Getting translated. Which is an order of magnitude more difficult than getting published.