Tuesday, September 19, 2023

Circles do not exist

Many logos, drawings and other graphical designs have the following shape in it. What is this shape?

If you thought: "Ah-ha! I'm smart and read the title of this blog post so I know that this is most definitely not a circle."

Well it is. Specifically it is a raster image of a circle that I created with the Gimp just for this use.

However almost every "circle" you can see in printed media (and most purely digital ones) are not, in fact, circles. Why is this?

Since roughly the mid 80s all "high quality" print jobs have been done either in PostScript or, nowadays almost exclusively, in PDF. They use the same basic drawing model, which does not have a primitive for circles (or circle arcs). The only primitives they have are straight line segments, rectangles and Bézier curves. None of these can be used to express a circle accurately. You can only do an approximation of a circle but it is always slightly eccentric. The only way to create a proper circle is to have a raster image like the one above.

Does this matter in the real world?

For printing probably not. Almost nobody can tell the difference between a real circle and one that has been approximated with a Bézier curve with just four points. Furthermore, the human vision system is a bit weird and perfect circles look vertically elongated. You have to make them non-circular for people to consider them properly circular.

But suppose you want to use one of these things:

This is a laser cutter that takes its "print job" as a PDF file and uses its vector drawing commands to drive the cutting head. This means that it is impossible to use it to print a wheel. You'd need to attach the output to a lathe and sand it down to be round so it actually functions as a wheel rather than as a vibration source.

Again one might ask whether this has any practical impact. For this case, again, probably not. But did you know that one of the cases PDF is being considered (and, based on Internet rumors, is already being used) is as an interchange format for CAD drawings? Now it suddenly starts mattering. If you have any component where getting a really accurate circle shape is vital (like pistons and their holes), suddenly all your components are slightly misshaped. Which would not be fun.

Extra bonus information

Even though it is impossible to construct a path that is perfectly circular, PDF does provide a way to draw a filled circle. Here is the relevant snippet from the PDF 2.0 spec, subsection 8.5.3.2:

If a subpath is degenerate (consists of a single-point closed path or of two or more points at the same coordinates), the S operator shall paint it only if round line caps have been specified, producing a filled circle centred at the single point.

Who is willing to put money on the line that every PDF rendering implementation actually uses circles rather than doing the simple thing of approximating it with Béziers?

Sunday, September 10, 2023

A logo for CapyPDF

The two most important things about any software project are its logo and mascot. Here is a proposal for both for CapyPDF.

As you can probably tell I'm not a professional artist, but you gotta start somewhere. The original idea was to have a capybara head which is wearing the PDF logo much like a bow tie around its ear. The gist of it should come across, though it did look much better inside my brain. The PDF squiggle logo is hard to mold to the desired shape.

The font is Nimbus Sans, which is one of the original PostScript Core Fonts. More precisely it is a freely licensed metrically compatible version of Helvetica. This combines open source with the history of PDF quite nicely.

Friday, September 1, 2023

CapyPDF 0.5 is out

There are no actual release notes, but a bunch of stuff got added. Code is here. Many of these were things needed by Inkscape for its new CMYK PDF exporter. More info about that can be found on DoctorMo's YouTube channel.

Wednesday, August 16, 2023

The least convenient way of defining an alpha value

PDF's drawing model inherits from PostScript, which was originally designed in the early 80s. It works and is really nice but the one glaring hole it has is missing transparency support. The original PDF spec from 1993 has no transparency support either, it was added in version 1.4 that came out mere 8 years later in 2001.

Setting drawing operation colours in PDF is fairly straightforward. You list the values you want and call the color setting command, like this:

1.0 0.1 0.4 rg    % Sets RGB nonstroke colour
0.2 0.8 0.0 0.2 K % Sets CMYK stroke colour

The property name for alpha in PDF is CA and ca for stroking and nonstroking operations, respectively. This would imply that you should be able to do this:

0.9 ca  % Sets nonstroke alpha? [no]
0.2 CA  % Sets stroke alpha?    [no]

But of course you can't. Instead the alpha value can only be set via a Graphics State dictionary, which is a top level PDF object that you then need to create, use, link to the current page's used resources and all that jazz.

/Alpha01State gs % Set graphics state to the dictionary with the given name

This works fine for use cases such as compositing layers that do not themselves use transparency. In that case you render the layer to a separate XObject, create a graphics dictionary with the given composition parameters, use it and then render the layer XObject. It does not work at all in pretty much any other case.

The more observant among you might have already noticed that you need to create a separate graphics state dictionary for every transparency value used in your document. If you are, say, a vector graphics application and allow artists to adjust the transparency for each object separately using a slider (as you should), then it means that you may easily have hundreds or thousands of different alpha values in your document. Which means that you need to create hundreds or thousands of different transparency graphics state dictionaries if you want to preserve document fidelity.

Why is it implemented like this?

As is common, we don't really know (or at least I don't). The Finger of Speculation points towards the most usual of suspects: backwards compatibility.

Suppose the alpha command had been added to the PDF spec. It would mean that trying to open documents with those commands in older PDF renderers would cause crashes due to unknown commands. Yes, even if the document had specified a newer minimum PDF version, because PDF renderers would try to open and show them anyway rather than popping an error message saying "Document version is too new". OTOH adding new entries to a graphics state dictionary is backwards compatible because the renderer would just ignore keys it does not understand. This renders the document as if no transparency had ever been defined. The output is not correct, but at least mostly correct. Sadly the same can't be done in the command stream because it is a structureless stream of tokens following The One True Way of RPN.

If the hypothesis above is true, then the inescapable conclusion is that no new commands can be added to the PDF command stream. Ever.

Friday, July 21, 2023

Creating a PDF/X-3 document with CapyPDF

The original motivation for creating CapyPDF was understanding how fully color managed and print-ready PDF files are actually put together. A reasonable measure of this would be being able to generate fully spec conforming PDF/X-3 files. Most of the building blocks were already there so this was mostly a question of exposing all that in Python and adding the missing bits.

I have now done this and it seems to work. The script itself can be found in CapyPDF's Git repo and the generated PDF document can be downloaded using this link. It looks like this in Acrobat Pro.

It is not graphically the most flashy, but it requires quite a lot of functionality:

  • The document is A4 size but it is defined on a larger canvas. After printing it needs to be cut into the final shape. This is needed as the document has a bleed.
  • To obtain this the PDF has a bleed box (the blue box) as well as a trim box (the green box). These are not painted in the document itself, Acrobat Pro just displays them for visualisation purposes.
  • All colors in graphical operations are specified in CMYK.
  • The image is a color managed CMYK TIFF. It has an embedded ICC profile that is different from the one used in final printing. This is due to laziness as I had this image laying around already. You don't want to do this in a real print job.
  • The heading has both a stroke and a fill
  • The printer marks are just something I threw on the page at random.
Here is a screenshot of the document passing PDF-X/3 validation in Acrobat Pro.

Sunday, July 16, 2023

PDF transparency groups and composition

The PDF specification has the following image as an example of how to do transparent graphics composition.

This seems simple but actually requires quite a lot of functionality:

  • Specifying CMYK gradients
  • Setting the blend mode for paint operations
  • Specifying transparency group xobjects
  • Specifying layer composition parameters
I had to implement a bunch of new functionality to get this working, but here is the same image reproduced with CapyPDF. (output PDF here)

The only difference here is that out of laziness I used a simple two color linear gradient rather than a rainbow one.

Nonportability

This is again one of those features that only Acrobat Reader can handle. I tried Okular, Ghostscript, Firefox, Chromium and Apple Preview and they all rendered the file incorrectly. There was no consistency, each one was broken in a different way.

Wednesday, July 5, 2023

CapyPDF 0.4 release and presenter tool

I have just released version 0.4 of CapyPDF. You can get it either via Github or PyPI. The target of this release was to be able to create a pure Python script that can be used to generate PDF slides to be used in presentations. It does not read any input, just always produces the same hardcoded output, but since the point of this was to create a working backend, that does not really matter.

  • The source code for the script is here. It is roughly 200 lines long.
  • The PDF it creates can be accessed here.
  • A screenshot showing all pages looks like the following.

What the screenshot does not tell, however, is that the file uses PDF transition effects. They are used both for between-page transitions as well as within-page transitions, specifically for the bullet points on page 2. This is, as far as I can tell, the first time within-page navigation functionality has been implemented in an open source package, possibly even the first time ever outside of Adobe (dunno if MS Office exports transitions, I don't have it so can't test it). As you can probably tell this took a fair bit of black box debugging because all PDF tools and validators simply ignore transition dictionaries. For example transcoding PDFs with Ghostscript scrubs all transitions from the output.

The only way to see the transitions correctly is to open the PDF in Acrobat Reader and then enable presenter mode. PDF viewer developers who want to add support for presentation effects might want to use that as an example document. I don't guarantee that it's correct, though, only that it makes Acrobat Reader display transition effects.

Thursday, June 29, 2023

PDF and embedded videos

PDF supports playing back video content since version 1.5. I could do the whole shtick and shpiel routine of "surely this is working technology as the specification is over 20 years old by now". But you already know that it is not the case. Probably you came here just to see how badly broken things are. So here you go:

Time specification (or doing pointlessly verbose nested dictionaries before it was cool)

A video in PDF is a screen annotation. It is defined with a PDF object dicrionary that houses many a different subdictionaries as keys. In addition to playing back the entire video you can also specify to play back only a subsection of it. To do the latter you need an annotation dictionary with an action subdictionary of subtype rendition, which has a subdictionary of type rendition (no, that is not a typo), which has a subdictionary of a mediasource, which has three subdictionaries: a MediaClip and the start and end times.

There are three different ways of specifying the start and end times: time (in seconds), frames or bookmarks. We'll ignore the last one and look at frames first. As most people are probably not familiar with the PDF dictionary syntax, I'm going to write those out in JSON. The actual content is the same, it is just written in a different syntax. Here's what it looks like:

{
  "B": {
    "S": "F",
    "F": 1000
  }
}

Here we define the start time "B", which is a MediaOffset. It has a subtype that uses frames ("S" is "F") and finally we have the frame number of 1000. Thus you might expect that specifying the time in seconds would mean changing the value of S to T and the key–value pair "F" -> 1000 to something like "V" -> 33.4. Deep in your heart you know that not to be true, because the required format is this.

{
  "B": {
    "S": "T",
    "T": {
      "S": "S",
      "V": 33.4
    }
  }
}

Expressing the time in seconds requires an entire whole new dictionary just to hold the number. The specification says that the value of that dictionary's "S" must always be "S". Kinda makes you wonder why they chose to do this. Probably to make it possible to add other time formats in the future. But in the 20 years since this specification was written no such functionality has appeared. Even if it did, you could easily support it by adding a new value type in the outer dictionary (something like "T2" in addition to "T").

Most people probably have heard the recommendation of not doing overly general solutions to problems you don't have yet. This is one of the reasons why.

Does it work in practice?

No. Here's what happens when I tried using a plain h264 MP4 file (which is listed as a supported format on Adobe's site and which plays back in Windows and macOS using the system's native video player).

Okular


Instead of a video, Okular displays a screenshot of the user desktop with additional rendering glitches. Those are probably courtesy of bugs in either DRM code or the graphic card's device driver.

Evince

Evince shows the first frame of the video and even a player GUI, but clicking on it does nothing and the video can't be played.

Apple Preview

Displays nothing. Shows no error message.

Firefox

Displays nothing. Shows no error message.

Chromium

Displays nothing. Shows no error message.

Acrobat Reader, macOS

Plays back videos correctly but only if you play it back in its entirety. If you specify a subrange it instead prints an error message saying that the video can't be presented "in the way the author intended" and does nothing.

Acrobat Reader, Windows

Video playback works. Even the subrange file that the macOS version failed on plays and the playback is correctly limited to the time span specified.

But!

Video playback is always wonky. Either the video is replaced with a line visualisation of the audio track or the player displays a black screen until the video stream hits a key frame and plays correctly after that.

In conclusion

Out of the major PDF implementations, 100% are broken in one way or another.

Friday, June 23, 2023

PDF subpage navigation

A common presentation requirement is that you want to have a list of bullet points that appear one by one as you click forward. Almost all PDF presentations that do this fake it by having multiple pages, one for each state. So if you have a presentation with one page and five bullet points, the PDF has six pages, one for the empty state and a further one for each bullet point appearing.

This need not be so. The PDF specification has a feature called subpage navigation (PDF 2.0 spec, 12.4.4.2). This kind of makes you wonder. Why are people not using this clearly useful functionality? After trying to implement it in CapyPDF the answer became obvious. Quite quickly.

While most of the PDF specification is fairly readable, this section is not. It is very confusing. Once you get over the initial bafflement you realize that the spec is actually self-contradictory. I was so badly confused that eventually I filed a bug report against the PDF specification itself.

That does not really help in the short term so I did the only thing that can be done under these circumstances, namely looking at how existing software handles the issue. The answer turned out to be: they don't. The only PDF viewer that does anything with subpage navigation is Acrobat Reader. Even more annoyingly just getting your hands on a PDF document that has subpage navigation defined is next to impossible. None of the presentation software I tried exports subpage navigation tags. Trying to use a search engine leads to disappointment as well. They won't let you search for "a PDF whose internal data structures contain the word PresSteps" but instead will helpfully search for "PDF files whose payload text contains the word PresSteps". As you might expect this finds various versions of the PDF specification and nothing much else.

The most probable reason why PDF subpage navigation is not used is that nobody else is using it.

How does it actually work then?

Nobody knows for real. But after feeding Acrobat Reader a bunch of sample documents and seeing what it actually does we can create a hypothesis.

The basic operations used are optional content groups and navigation nodes. The former are supported by all PDF viewers, the latter are not. Basically each navigation node represents a state and when the user navigates forwards or backwards the PDF viewer runs a specified action that can be used to either hide or display an optional content group. A reasonable assumption for getting this working, then, would be to set the default visibility of all optional content groups to hidden and then have transitions that enable the bullet points one by one.

That does not work. Instead you have to do this:

Writing arbitrary state machines as graphs just to for appearing bullet points? We're in the big leagues now! The reason the root node exists is that when you enter a page, the root node's transition is performed automatically (in one part of the spec but not another, hence the self-inconsistency). Note especially how there is no forward navigation for state 2 or backwards navigation for state 0. This is intentional, that is how you get the subpage navigation algorithm to leave the current page. I think. It kinda works in Acrobat Reader but you have to press the arrow key twice to leave the page. 

On the other hand you could implement a full choose-your-own-adventure book as a single page PDF using only subpage navigation.

Wednesday, June 14, 2023

Functionality currently implemented in CapyPDF

CapyPDF has a fair bit of functionality and it might be difficult to tell from the outside what works and what does not. Here is a rough outline of implemented functionality.

In the public C API (and Python)

  • Basic draw commands in RGB, gray and CMYK
  • ICC profile support
  • Loading PNG, JPG and TIFF images (including CMYK TIFFs)
  • Embed JPG files directly without unpacking first
  • Using images as a paint mask
  • Handle color profiles embedded in images
  • Using builtin PDF fonts
  • Using TrueType fonts with font subsetting
  • All PDF font operators like extra padding, raise/lower and set as clipping path
  • Page transitions

Implemented but not exposed

The items listed here vary in implementation completeness from "sort of ready, just not exposed yet" to "the Mmest of MVP". You can only get to them by poking at the innards directly.

  • Document navigation tree
  • Overprinting
  • Structured and annotated PDF
  • Additional color channels (called separations in the PDF spec)
  • Form XObjects
  • File embedding
  • Annotations (only a few types)
  • L*a*b* color space support
  • ICC colors in primitive paint operations
  • Type 2, 3 and 4 shadings (i.e. gradients)
  • Color patterns

Saturday, June 10, 2023

A4PDF has been renamed to CapyPDF

As alluded to in the previous post, A4PDF has changed its name. The new project name is CapyPDF. The name refers to capybaras.


Original picture from here. I was in the process of drawing a proper mascot logo, but work on that stalled. Hopefully it'll get done at some point.

There is a new release, but it does not provide new functionality, just the renaming and some general code cleanups.

Tuesday, May 30, 2023

A4PDF release 0.2.0

I have just tagged relase 0.2.0 of A4PDF, the fully color managed PDF generation library.

There are not that many new exposed features added in the public API since 0.1.0. The main goal of this release has been to make the Python integration work and thus the release is also available in Pypi. Due to reasons the binary wheel is available only for Windows.

Trying it out

On Linux you probably want to do a Git checkout and run it from there.

On Windows the easiest way is to install the package via Pip.

On macOS you can't do anything, because Apple's compiler toolchain is too old to build the code.

What functionality does is provide?

There is no actual documentation, so the best bet is to look at the unit test file. There is a lot more functionality in the C++ code, but it is not exposed in the public API yet. These include things like (basics of) tagged PDF generation, annotations and external file embedding.

Impending name change

There is an official variant of PDF called PDF/A. Thre are several versions of it including PDF/A-4. I did not know that when deciding the name. Because having a library called A4PDF that produces PDF/A-4:s is confusing, the name needs to be changed. The new name has not been decided yet, suggestions welcome.

Wednesday, May 24, 2023

Advanced dependency management and building Python wheels with Meson

One of the most complex pieces of developing C and C++ programs (and most other languages) is dependency management. When developing A4PDF I have used Ubuntu's default distro dependencies. This is very convenient because you typically don't need to fiddle with getting them built and they are battle tested and almost always work.

Unfortunately you can't use those in most dev environments, especially Windows. So let's see how much work it takes to build the whole thing on Windows using only Visual Studio and to bundle the whole thing into a Python wheel so it can be installed and distributed. I would have wanted to also put it in Pypi but currently there is a lockdown caused by spammers so no go on that front.

Seems like a lot of effort? Let's start by listing all the dependencies:

  • fmt
  • Freetype
  • LibTIFF
  • LibPNG
  • LibJPEG (turbo)
  • LittleCMS2
  • Zlib
These are all available via WrapDB, so each one can be installed by executing a command like the following:

meson wrap install fmt

With that done Meson will automatically download and compile the dependencies from source. No changes need to be done in the main project's meson.build files. Linux builds will keep using system deps as if nothing happened.

Next we need to build a Python extension package. This is different from a Python extension module, as the project uses ctypes for Python <-> C interop. Fortunately thanks to the contributors of Meson-Python this comes down to writing an 18 line toml file. Everything else is automatically handled for you. Installing the package is then a question of running this command:

pip install .

After a minute or two of compilation the module is installed. Here is a screenshot of the built libraries in the system Python's site-packages.

Now we can open a fresh terminal and start using the module.

Random things of note

  • Everything here uses only Meson. There are no external dependency managers, unix userland emulators or special terminals that you have to use.
  • In theory this could work on macOS too, but the code is implemented in C++23 and Apple's toolchain is too old to support it.
  • The build definitions for A4PDF take only 155 lines (most of which is source and program name listings).
  • If, for whatever reason, you can't use WrapDB, you can host your own.

Saturday, May 20, 2023

Annotated PDF, HTML, exporters

If one were to compare PDF to HTML, one interesting thing that comes up fairly quickly is that their evolution has been the exact opposite of each other.

HTML was originally about structure, with its h1 and p and ul tags and the like. Given this structure the web browser was then (mostly) free to lay out the text however it saw fit given the current screen size and browser window orientation. As usage grew the appearance of pages became more and more important and thus they had to invent a whole new syntax for specifying how the semantic content should be laid out visually. This eventually became CSS and the goal became, roughly, "pixel perfect layout everywhere".

PDF was originally just a page description language, basically a sequence of vector and raster paint operations. Its main goal was perfect visual fidelity as one of the main use cases was professional printing. The document had no understanding of what it contained. You either read as it was on the monitor or, more often, printed it out on paper and then read it. As PDF's popularity grew, new requirements appeared. First people wanted to copypaste text from PDF document (contrary to what you might expect, this is surprisingly tricky to implement). Then they needed to integrate it with screen readers and other accessibility tools, reformat the text to read it on mobile devices and so on. The solution was to create a from-scratch syntax in the PDF data stream alongside existing code to express the semantic contents of the document in a machine-readable format. This became known as tagged PDF, introduced in PDF 1.4.

The way they chose to implement it was to add new commands to the PDF data stream that are used to wrap existing draw operations. For example, drawing a single line of text looks like this:

(Hello, world!) Tj

Adding structural information means writing it like this instead:

/P << /MCID 0 >>
  BDC
    (Hello, world!) Tj
  EMC

and then adding a bunch of bookkeeping information in document metadata dictionaries. This is a bit tedious, but since you can't really write PDF files by hand, you implement this once in the PDF generation code and then use that.

Why so many PDF exporters?

There are many PDF generator libraries. Almost every "serious" program, like Scribus and Libreoffice, has its own bespoke library. This is not just because developers love reimplementing things from scratch. The underlying issue is that PDF as a specification is big. and generating valid PDFs have very specific semantic requirements that require the application developer to understand PDFs document model down to the finest detail.

Existing PDF generator libraries either ignore all of this and just do basic RGB graphics (Cairo) or provide their own document model which it can then convert to PDF constructs (HexaPDF). The latter works nicely for simple cases but it requires that you have to map your app's internal document structure to the library's document structure so that it then maps to PDF's document structure in the way you need it to. Often it becomes simpler to cut out the middle man and generate PDF directly.

The way A4PDF aims to solve the issue is to stay low level enough. It merely exposes PDF primitives directly and ensures that the syntax of the generated PDF is correct. It is up to the caller to ensure that the output is semantically correct. This should afford every application to easily and directly map their document output to PDF in just they way they want it (color managed or not, annotated or not and so on).

Thursday, May 11, 2023

The real reason why open source software is better

The prevailing consensus at the current time seems to be that open source software is of higher quality than corresponding proprietary ones. Several reasons have been put forth on why this is. One main reason given is that with open source any programmer in the world can inspect the code and contribute fixes. Closely tied to this is the fact that it is plain not possible to hide massive blunders in open source projects whereas behind closed walls it is trivial.

All of these and more are valid reasons for improved quality. But there are other, more sinister reasons that are usually not spoken of. In order to understand one of them, we need to first do a slight detour.

Process invocation on Windows

In this Twitter thread Bruce Dawson explains an issue he discovered while developing Chrome on Windows. The tl/dr version:

  • Invoking a new process on Windows takes 16 ms (in the thread other people report values of 30-60 ms)
  • 10 ms of this is taken by Windows Defender that scans the binary to be executed
  • Said executables are stored in a Defender excluded directory
  • 99% of the time the executable is either the C++ compiler or Python
  • The scan results are not cached, so every scan except the first one is a waste of time and energy
  • The Defender scanner process seems to be single threaded making it a bottleneck (not verified, might not be the case)
  • In a single build of Chrome, this wastes over 14 minutes of CPU and wall clock time
Other things of note in the thread and comments:

  • This issue has been reported to Microsoft decades ago
  • The actual engineers working on this can't comment officially but it is implied that they want to fix it but are blocked by office politics
  • This can only be fixed by Microsoft and as far as anyone knows, there is no work being done despite this being a major inconvenience affecting a notable fraction of developers

What is the issue?

We don'r know for sure due to corporate confidentiality reasons, but we can make some educated guesses. The most probable magic word above is "office politics". A fairly probable case is that somewhere in the middle damagement chain of Microsoft there is a person who has decreed that it is more important to create new features than spend resources on this as the code "already works" and then sets team priorities accordingly. Extra points if said person insists on using a Macbook as their work computer so their MBA friends won't make fun of them at the country club. If this is true then what we have is a case where a single person is making life miserable for tens (potentially hundreds) of thousands of people without really understanding the consequences of their decision. If this was up to the people on "the factory floor", the issue would probably have been fixed or at least mitigated years ago.

To repeat: I don't know if this is the case for Microsoft so the above dramatization is conjecture on my part. On the other hand I have seen exactly this scenario play out many times behind the scenes of various different corporations. For whatever reason this seems to a common occurrence in nonpublic hierarchical organisations. Thus we can postulate why open source leads to better code in the end.

In open source development the technologically incompetent can not prevent the technologically competent from improving the product.

Shameless self-promotion

If you enjoyed this text and can read Finnish, you might enjoy my brand new book in which humanity's first interplanetary space travel is experienced as an allegory to a software startup. You can purchase it from at least these online stores: Link #1Link #2. Also available at your local library.

Appendix: the cost and carbon footprint

Windows 11 has a bunch of helpful hints on how to reduce your carbon footprint. One of these is warning if your screen blanking timeout is less than the computer suspend timeout. At the same time they have in the core of their OS the gross inefficiency discussed above. Let us compare and contrast the two.

The display on a modern laptop consumes fairly little power. OTOH running a CPU flat out takes a lot of juice. Suspending earlier saves power when consumption is at its lowest, whereas virus scanners add load at the point of maximum resource usage. Many people close the lid of their computer when not using it so they would not benefit that much from different timeout settings. For developers avoiding process invocations is not possible, that is what the computer is expected to do.

Even more importantly, this also affects every single cloud computer running Windows including every Windows server, CI pipeline and, well, all of Azure. All of those machines are burning oil recomputing pointless virus checks. It is left as an exercise to the reader to compute how much energy has been wasted in, say, the last ten years of cloud operations over the globe (unless Microsoft runs Azure jobs with virus scanners disabled for efficiency, but surely they would not do that). Fixing the issue properly would take a lot of engineering effort and risk breaking existing applications, but MS would recoup the money investment in electricity savings from their own Azure server operations alone fairly quickly. I'm fairly sure there are ex-Googlers around who can give them pointers on how to calculate the exact break-even point.

All of this is to say that having said energy saving tips in the Windows UI is roughly equivalent to a Bitcoin enthusiast asking people to consider nature before printing their emails.

Saturday, April 29, 2023

The unbearable tightness of printing

Let's say you want to print a full colour comic book in the best possible quality. For simplicity we'll use this image as an example.

As you can probably guess, just putting this image in a PDF does not work, even if it had sufficient resolution. Instead what you need to do is to create two images. One for linework that is monochrome and has least 600 PPI and one for colours, which is typically a 300 PPI colour managed CMYK TIFF.

The colour image is drawn first and then the monochrome image is drawn on top of it. In this way you get both smooth colours and crisp linework. Most people would stop here, but this where the actual work begins. It is also where things start to wander into undocumented (or, rather, "implementation defined") territory.

Printing in true black

In computer monitors the blackest colour possible is when all colour components are off, or (0, 0, 0) in RGB values. Thus you might expect that the blackest CMYK colour is either (0, 0, 0, 1) or (1, 1, 1, 1). Surprisingly it is neither. The former looks grayish when printed whereas the latter can't be printed at all because of physical limitations. If you put too much ink in one place on the page, the underlying proper gets too wet, warps and might even rip. And tear.

Instead what you need to do is to use a colour called rich black. Each print shop has their own values for this, as the exact amount of inks to use to get the deepest black colour is dependent on the inks, paper and printing machine used. We'll use the value (0.1, 0.1, 0.1, 1.0) for rich black in this text.

Thus we need three different images rather than two.

First the colour image is laid down, then the image holding the areas that should be printed in rich black. This is a 300PPI colour image with the colour value (0.1, 0.1, 0.1, 0) on pixels that should be painted with rich black. Finally the line work is drawn on top the other two. The first two images can be combined into one. This is usually done by graphic artists when preparing their artwork to print. However the middle image can be automatically generated from the linework image with some Python so we're doing that to reduce manual work and reduce the possibility of human error.

If you create a PDF with these images you are still not done. In fact the output would be identical to the previous setup. There are still more quirks to handle.

Trapping and overprinting

Since all of the colours are printed separately they are suspect to misregistration. That is, the various colours might shift relative to each other during the printing process. This causes visual artifacts in the edges between two colours. This is a fairly complicated topic, Wikipedia has more details. This issue can be fixed by trapping, that is, "spreading" the colour under the "edge" of the linework. Like so:

If you look closely at the middle image, the gray area is slightly smaller than in the previous picture. This shrunk image can be automatically generated from the linework image with morphological erode/dilate operations. Now we have everything needed to print things properly, but if you actually try it it still won't work.

The way the PDF imaging model works is that if you draw on the canvas with any colour, all colour channels of the existing colour on the page get affected. That is, if the existing colour on the canvas is (0.1, 0.1, 0.1, 0) and you draw on top of it with (0, 0, 0, 1) the output is (0, 0, 0, 1). All the work we did getting the proper rich black colour under the linework gets erased as if it was never there.

PDF has a feature called overprinting to handle this exact case (you could also use the "multiply" filter but it requires the use of transparency, which is still prohibited in some workflows). It does pretty much what it says on the tin. When overprinting is enabled any draw operations accumulate over the existing inks. Thus the final step is to enable overprinting for the final line work image and then Bob's your uncle?

In theory yes. In practice lol no, because this part of the PDF specification is about as hand-wavy as things go. There are several toggles that affect how overprinting gets handled. What they actually do is only explained in descriptive text. One of the outcomes of this is that every single generally available PDF viewer renders the output incorrectly. Poppler, Ghostscript, Apple Preview and even Adobe Acrobat Reader all produce outputs that are incorrect in different ways. They don't even warn you that the PDF uses overprinting and that the output might be incorrect. This makes development and debugging this use case somewhat challenging.

The only way to get correct output is to use Adobe Acrobat Pro and tell it to enable overprint simulation. Fortunately I have a friend who has a 10 year old version (remember, back when you could actually buy software once and keep using it as opposed to a monthly license that can get yanked at any time?). After pestering him with an endless flow of test PDFs I finally managed to work out the exact steps needed to make this work:

  • Create a 300 PPI image with the colours, a 300 or 600 PPI monochrome image with the rich black areas and a 600 DPI monochrome image for the linework (the rich black image can be autogenerated from the linework image and/or precomposited in the colour image)
  • Load and draw the colour image as usual
  • Load the rich black image and store it as a PDF ImageMask rather than a plain image
  • Set nonstroke colour to (0.1, 0.1, 0.1, 0), set the rich black image as a stencil and fill it
  • Load the linework image as an imagemask
  • Enable overprinting mode
  • Set overprinting mode to 1
  • Set nonstroke colour to (0, 0, 0, 1)
  • Draw the line image as a stencil

If you deviate from any of the above steps, the output will be silently wrong. If you process the resulting PDF with anything except Adobe's tool suite the end result might become silently wrong. As an example here is the output of colour separation using Adobe Acrobat and Ghostscript.

Acrobat has preserved the rich black values under the linework whereas Ghostscript has instead cleared the colour value to zero losing the "rich" part of black. Interestingly Ghostscript seems to handle overprinting correctly in basic PDF shape drawing operations but not in stencil drawing operations.

Or maybe it does and Acrobat is incorrect here. The only way to know for sure would be to print test samples on a dozen or so commercial offset printing presses, inspecting the plates manually and then seeing what ends up on paper. Sadly I don't have the resources for that.

Sunday, April 16, 2023

PDF forms, the standard that seemingly isn't

Having gotten the basic graphical output or A4PDF working I wanted to see if I could make PDF form generation work.

This was of course a terrible idea but sadly I lacked foresight.

After a lot of plumbing code it was time to start defining form widgets. I chose to start simple and create a form with a single togglable check button. This does not seem like an impossibly difficult problem and the official PDF specification even has a nice code sample for this:

The basic idea is simple. You define the "widget" and give it two "state objects" that contain PDF drawing operations. The idea is that the PDF renderer will draw one of the two on top of the basic PDF document depending on whether the checkbox is toggled or not. The code above sets things up so that the appearance of the checkbox is one of two different DingBat symbols. Their values are not shown in the specification, but presumably they are a checked box and an empty square.

I created a test PDF with LibreOffice's form designer and then set about trying to recreate it. LO's form generator uses OpenSymbol for the checked status of the checkbox and an empty appearance for the off state. A4PDF uses the builtin Helvetica "X" character.The actual files can be downloaded here.

What we have here is a failure to communicate

No matter how much I tried I could not make form generation actually work. The output was always broken in weird ways I could not explain. Unfortunately this part of the PDF spec is not very helpful, because it does not give out full examples, only snippets, and search engines are worthless at finding any technical content when using "PDF" as a search term. It may even be that information about this is not available in public web sites. Who knows?

Anyhow, when regular debugging does not work, it's time approach things sideways. Let's start by opening the LibreOffice test document with Okular:

This might seem to be working just fine, but people with sharp eyes might notice a problem. That check mark is not from OpenSymbol. FWICT it is the standard Qt checkbox widget. Still, the checkbox works and its appearance is passable. But what happens if you increase the zoom level?

Oh dear. That "Yes" text is the PDF-internal label given to the "on" state. Why is it displayed? No idea. It's time to bring out the heavy guns and see how things work in The Gold Standard of PDF Rendering, Adobe Reader.

Nope, that's not the OpenSymbol checkmark either. Adobe Reader seems to be ignoring the spec and drawing its own checkmarks instead. After seeing this I knew I had to try this on every PDF renderer I could reasonably get my hands on. Here's the results:

  • Okular
    • LO: incorrect appearance, breaks when zooming
    • A4PDF: shows both the "correct" checkmark as well as the Qt widget on top of each other, takes a noticeable amount of time after clicking until the widget state is updated
  • Evince
    • LO: does not respond to clicks
    • A4PDF: works correctly
  • Adobe Reader win64
    • LO: incorrect appearance
    • A4PDF: incorrect appearance, does not always respond to button clicks
  • Firefox
    • LO: Incorrect appearance
    • A4PDF: Incorrect appearance
  • Chromium
    • LO: Incorrect appearance
    • A4PDF: works correctly
  • Apple Preview:
    • LO: works correctly (though the offset is a bit wonky, probably an issue in the drawing commands themselves)
    • A4PDF: works correctly

The only viewer that seems to be working correctly in all cases is Apple Preview.

PDF has a ton of toggleable flags and the like to make things invisible when printing and so on. It is entirely possible that the PDF files are "incorrect" in some way. But still, either the behaviour should either be the same on all viewers or they should report format errors. No errors are reported, though, even by this online validator.

Monday, April 3, 2023

Some details about creating print-quality PDFs

At its core, PDF is an image file format. In theory it is not at all different from the file formats of Gimp, Krita, Photoshop and the like. It consists of a bunch of raster and vector objects on top of each other. In practice there are several differences, the biggest of which is the following:

In PDF you can have images that have different color spaces and resolutions (that is, PPI values). This is by design as it is necessary to achieve high quality printing output.

As a typical example, comic books that are printed in color consist of two different images. The "bottom" one contains only the colors and is typically 300 PPI. On top of that you have the black linework, which is a 1 bit image at 600 or even 1200 PPI. Putting both the linework and colors in the same image file would not work. In the printout the lines would be fuzzy, even if the combined image did contain 1200 PPI.

A deeper explanation can be found in the usual places but the short version is that these two different image types need to be handled in completely opposite ways to make them look good when printed. When converting colors images to printing plates the processing software prioritizes smoothness.  On the other hand for monochrome images the system prioritizes sharpness. Doing this wrong means either getting color images that are blocky or linework that is fuzzy.

When working on A4PDF it was clear from the start that it needs to be able to create PDF files that can be used for commercial quality print jobs. To test this I wrote a Python script that recreates the cover of my recently published book originally typeset with Scribus. The end result was about 100 lines of code in total.

The background image

The main image is a single file without any adornments. It was provided by the illustrator as a single 8031 by 5953 image file. A fully color managed workflow demands the image to be in CMYK format and have a corresponding ICC color profile. There is basically only one file format that supports this use case: TIFF. Interestingly the specification for this file format was finalized in 1992. It is left as an exercise to the reader to determine how many image file formats have been introduced since that time.

A4PDF extracts the embedded ICC profile and stores it in the PDF file. It could also convert the image from the image's ICC colorspace to the specified output color space if they are different, but currently does not.

Text objects

All text color is white and is specified in CMYK colorspace, though it could also be specified in DeviceGray. Defining any object in RGB (even if the actual color was full white) could make the printing house reject the file as invalid and thus unsuitable for printing.

The author name in the front cover uses PDF's character spacing to "spread out" the text. The default character spacing in this font is too tight for use in covers.

PDF can only produce horizontal text. Creating vertical text, as in the spine, requires you to modify the drawing state's transformation matrix stack. In practice this is almost identical to OpenGL, though the PostScript drawing model that PDF uses predates OpenGL by 8 years or so. In this case the text needs a rotate + translate. 

The bar code

Ideally this should be defined with PDF's vector drawing operations. Unfortunately that would require me to implement reading SVG files somehow. It turned out to be a lot less effort to export the SVG from Inkscape at 600 PPI and then convert that to a 1 bit image with the Gimp. The end result is pretty much the same.

This approach works in Scribus as well, but not in LibreOffice. It converts all 1 bit images to 8 bit grayscale meaning that it might be fuzzy when printed. LO used to do this correctly but the behaviour was intentionally changed at some point.

The logo

This is the logo used in the publisher's sci-fi books. You can probably guess that the first book in the series was Philip K. Dick's Do Androids Dream of Electric Sheep?

Like the bar code, this should optimally be defined with vector operations, but again for simplicity a raster image is used instead. However it is different from the bar code image in that it has an alpha channel so that the background starfield shows through. When you export a 1 bit image that has an alpha channel to a PNG, Gimp writes it out as an indexed image with 4 colors (black opaque, white opaque, black transparent, white transparent). A4PDF detects files of this type and stores them in the PDF as 1 bit monochrome images with a 1 bit alpha channel.

This is something even Scribus does not seem to handle correctly. In my testing it seemed to convert these kinds images to 8 bit grayscale instead.

Saturday, April 1, 2023

Got the Star Trek - The Motion Picture Director's Edition box set? You might wan to check your discs

TL/DR

Star Trek The Motion Picture The Complete Adventure box set claims to contain a special, longer cut of the film. However it seems that this is not the case for some editions. The British edition does contain the longer cut, but the Scandinavian one seems not to. The back of the box still claims that the box set does contain the longer cut.

The claim and evidence

This is the box set in question.

At the back of the box we find the following text:

I bought this at the end of last year but I could not find the longer edition anywhere in the menus. I mentioned this to a friend of mine who has the same box set and he had found it immediately. After a lot of debugging we discovered that he has the British edition of the box set whereas I have the Scandinavian one.

The British box set, which has the longer edition, consists of the following discs:

  • EU151495BLB Blu-ray The Director's Edition Bonus Disc
  • EU151496ULB 4K UltraHD The Director's Edition Feature Film, Special Features
  • EU151460ULB 4K UltraHD Feature Film, Special Features (Special Longer Version, Theatrical Version)
  • EU151496BLB Blu-ray The Director's Edition Feature Film, Special Features
  • EC150885BLB Blu-ray Feature Film, Special Features

My Scandinavian box set has the following discs:

  • eu151495blb bonus disc
  • eu151496ulb director's edition 4k ultrahd, Feature film
  • eu150884ulb 4k, regular version, feature film
  • eu151496blb blu-ray, director's edition
  • eu150885blb blu-ray, regular version, feature film

The only differences here are that one disc has a serial number with letters EC instead of EU and that the British edition has this disc:

Note how it says "special longer version" in microscopic letters under the Paramount logo. This text does not appear on any disc in the Scandinavian edition.

The Scandinavian edition does not have this disc. Instead it has the disc with the product id EU150884ULB whereas the corresponding British disc has the product id EU151460ULB .The missing content can not be on any of the other discs, because they are the same in both editions (the EU/EC issue notwithstanding). 

I reported this to the store I bought the box set from. After a lot of convincing they agreed to order a replacement box set. When it arrived we inspected the contained discs and they were the same as in the box set I already had. This would imply that the entire print run is defective. After more convincing we managed to get the Finnish importer to report the issue to the distributor in Sweden.

They eventually replied that the longer edition is in "the extra disc in the set that is in a cardboard sleeve". This is not the case, as that disc contains only extras, is the same in the British box set and further is a regular Blu-Ray, not a  4k UHD one as it should be and as it is on the British edition.

We reported all this back up the chain and have not heard anything back for several weeks now.

What to do if you have a non-British (and presumably non-US) edition of this box set?

Check whether your discs contain the special longer edition. The British disc has this this fairly unambiguous selection screen for it when you insert the disc:


If your box set does not contain this version of the film, report the issue to where you bought the box set from. If this really is a larger issue (as it would seem), the more bug reports they get the faster things will get fixed.

The film store said that they have sold tens of these box sets and that I was the first to notice the issue so don't assume that other people have already reported it. This is not an issue of the physical copy that I have because the replacement box set was defective in the same way.

Speculation: why is it broken?

The British disc that has the special longer edition does not contain Finnish subtitles (or possibly any non-English languages). When the box was being assembled someone found this out, decided that they can't ship a disc without localisation and replaced the disc with a regular version of the film that does not have the longer cut. But they did not change the back of the box, which states that the box set contains the longer edition which it does not seem to have.

Wednesday, March 22, 2023

In which you find out that everything you assumed was wrong

The previous post listed an algorithm for converting TrueType glyph advances into PDF glyph widths. It worked, but was weird, kludgy, complicated and all around bad. It was also completely wrong. As in "not even remotely in the direction of the correct solution" but it happened to work by accident with the examples I had. Trying it with some more fonts yields the now familiar error.

What makes debugging this issue harder is that Freetype exposes the same information multiple times. There are three or four different places where glyph advancement can be read and multiple multipliers that could potentially be used. No combination of these provides a value that would be even relatively close to the correct one.

So let's work backwards. What is the multiplier that we need to get? A single example says that a glyph with font advancement 1229 should have a PDF width value of 600. Therefore 1229/600 = 2.048333, and we should divide by roughly half.

Waitaminute!

Hmmmmm.

After some debugging one can find a struct entry that is set to 1000 for fonts that work and 2048 for those that don't: face->units_per_EM. This scaling value is arbitrary but most (but not all) Adobe Type 1 fonts use 1000 while most (but not all) TrueType fonts use 2048. PDF uses 1000, probably to mach Type 1 conventions. This is even documented in the PDF specification 1.7, section 9.2.4, page 241: "For all font types except Type 3, the units of glyph space are one-thousandth of a unit of text space". This is incredibly easy to miss when reading the 700+ pages of text in the specification. I know because I did it. Several times.

Monday, March 20, 2023

The joy of font debugging

Remember how in the previous blog post it was said that creating text in PDF would be "just a matter of setting some parameters"?

Well let's start by creating the main text in two justified columns.

Ok, nice. Next we add an author name.

Whoopsies. After some debugging one can find out that this only happens if you only use fewer than 32 characters from the font you are subsetting. Obviously. But no matter, after fixing this minor niggle we only need to display the title, author names and finally an email address. This is more of the same so nothing can go wrong.

After some hours of debugging it becomes clear that the values of left side bearings are sometimes read from the source file using incorrect offsets (while still being 100% memory safe, all accesses are inside the source data). Good Now that that's fully fix...

This is where things get extremely weird. No matter where you look or how deeply you peruse the binary data files, nothing seems to be incorrect. Maybe this is a bug in the Noto Mono font used here? So you try Liberation Mono. It fails too. And then, just to be sure, you try Ubuntu Mono. It works correctly. As does Free Mono.

Hmmmmmhmhm.

Opening the file in Fontforge says that the width of all characters is 1228 font units. That is also what Freetype reports. Which is comforting because in the TrueType file format struct fields that are designated as 32 bit integers might be either a) 32 bit integer b) 26.6. fixed point or c) 16.16 fixed point. You can't ever really be sure which, though, because it depends on values of bitfields far and away from the actual structs themselves.

Things get even weirder if you test exporting a PDF that uses those broken fonts either with Cairo or LibreOffice. In the generated PDF files the width of characters in this font to 600. Not 1228. Trying to read their source to find out how and why they do this is problematic, because they support many different font formats and thus convert the input data to their internal representation and then generate the output from those. Trying to understand how the input data correlates with the output data can give you a major headache without even trying too hard.

The actual solution is even weirder than all of the above. TrueType fonts store the horizontal metrics of glyphs in a table called hmtx. It stores both the glyph advance and left side bearings. As a special case you can only specify the latter and use a common value for the former. This provides space savings of UP TO 2 BYTES PER CHARACTER but the downside is more complex parsing. Further, going through Freetype's public structs reveals that they contain a field called x_scale. After a lot of trial error you can eventually decipher the actual algorithm needed:

If the character has both glyph advance and left side bearings defined than you use them as-is but if it only has left side bearings defined then you must divide the default width value with the scale.

Then finally.

Addendum: Freetype

Freetype has many flags and knobs to specify whether you want metrics in original font coordinates or "output coordinates". I could not come up with a combination that would have provided consistent values, some postprocessing seems to always be needed. This might be a bug, it might be a limitation of the TrueType format, it might be something completely different. I don't really know, and I don't have the energy to dig further to uncover the underlying issue.

Thursday, March 16, 2023

The PDF text model is quite nice, actually

As was discussed earlier, the way PDF handles fonts and glyphs is arcane and tedious. It takes a lot of boilerplate and hitting your shins against sharp stones to get working. However once you do and can turn to the higher level text functionality, things become a lot nicer. (Right-to-left, vertical and calligraphic scripts might be more difficult, but I don't know any of those.)

The PDF text drawing model provides a fairly wide selection of text operations.

If you, for example, want to produce a paragraph of justified text, first you need to calculate how the text should be split in lines and the additional word and character scaling parameters needed. Then the text can be rendered with the following pseudocode:

  • Create a text object with the correct font and position.
  • Set spacing values for the current line.
  • Output the current line of text (add kerning manually if it is in a format Freetype does not handle)
  • Repeat the above two steps until the paragraph is done
  • Close the text object
This shifts the burden of having to compute each letter's exact location from you to the PDF renderer. If you need more precision than this, then you need to dig out Harfbuzz, and draw the glyphs one by one to precomputed coordinates.

Sunday, March 12, 2023

First A4PDF release, version 0.1.0 "embarrasment"

The time has come to make the first technical preview release of A4PDF, nicknamed embarrasment. The name stems from this statement.

If you're not embarrassed by the first version of your product, you've launched too late.

It does not do much yet, but the basics are there to draw shapes, text and images using a plain C API:

A pkg-config file is also provided. There is also a Python wrapper to run scripts like these:

Distro packaging

People should probably not do any distro packaging yet, as the library is neither ABI nor even API stable. However if someone wants to build deb packages, the source portion of Debian control file would look something like this:

Source: a4pdf
Maintainer: Bob McBob <bob@example.org>
Section: misc
Priority: optional
Standards-Version: 3.9.2
Build-Depends: debhelper (>= 10),
  liblcms2-dev,
  libpng-dev,
  libjpeg-dev,
  libgtk-4-dev,
  libfmt-dev,
  libfreetype-dev,
  meson,
  python3-pil,
  fonts-noto-core,
  ghostscript