Wednesday, February 8, 2023

More PDF, C API and Python

After a whole lot of bashing my head against the desk I finally managed to find out what Acrobat reader's "error 14" means, I managed to make both font subsetting and graphics generation work. Which means you can now do things like this:

After this improving the code to handle full PDF graphics seems mostly to be a question of adding functions for all primitives in the PDF graphics model. The big unknown thing is PDF form support, of which I know nothing. Being a new design it probably is a lot less convoluted than PDF fonts were.

Dependencies

The code is a few thousand lines of C++ 20. It requires surprisingly few dependencies:

  • Fmt
  • Zlib
  • Libpng
  • Freetype
  • LittleCMS
Some of these are not actually necessary. Fmtlib will be in the standard library in C++23. Libpng is only used to load PNG images from disk. The library could require its users to load graphics themselves and pass images in as pixel arrays. Interestingly doing font subsetting requires parsing the raw data of TrueType files by hand, so Freetype is not strictly mandatory, though it does make some things easier.

The only things you'd actually need are Zlib and LittleCMS. If one wanted to support CCIT Group 4 compression for 1 bit images, you'd need a dependency on libtiff.

A plain C API

The unfortunate side of library development is that if you want your library to be widely used, you have to provide a plain C API. For PDF it's not all that bad as you can mostly copy what Cairo does as its C API is quite nice to use. You might want to design this early on as getting the C API as easy and reliable to use as possible has effects on how the internal architecture works. As an example you should make all objects independent of each other. If the end user has to do things like "be sure to destroy all objects of type X before calling function F on object Y", then, because this is C, they are going to get it wrong and cause segfaults (at best).

Python integration

Once you have the C API, though, you can do all sorts of fun things, such as using Python's ctypes module. It takes a bit of typing and drudgery, but eventually you can create a "dependencyless" Python wrapper. With it you can do this to create an empty PDF file:

o = PdfOptions()
g = PdfGenerator(b"python.pdf", o)
g.new_page()

That's all you can do ATM, as these are the only methods exposed in the C API. Just implementing these made it very clear that the API is not good and needs to be changed.


Wednesday, February 1, 2023

PDF with font subsetting and a look in the future

After several days of head scratching, debugging and despair I finally got font subsetting working in PDF. The text renders correctly in Okular, goes througg Ghostscript without errors and even passes an online PDF validator I found. But not Acrobat Reader, which chokes on it completely and refuses to show anything. Sigh.

The most likely cause is that the subset font that gets generated during this operation is not 100% valid. The approach I use is almost identical to what LO does, but for some reason their PDFs work. Opening both files in FontForge seems to indicate that the .notdef glyph definition is somehow not correct, but offers no help as to why.

In any case it seems like there would be a need for a library for PDF generation. Existing libs either do not handle non-RGB color spaces or are implemented in Java, Ruby or other languages that are hard to use from languages other than themselves. Many programs, like LO and Scribus, have their own libraries for generating PDF files. It would be nice if there could be a single library for that.

Is this a reasonable idea that people would actually be interested in? I don't know, but let's try to find out. I'm going to spend the next weekend in FOSDEM. So if you are going too and are interested in PDF generation, write a comment below or send me an email, I guess? Maybe we can have a shadow meeting in the cafeteria.

Wednesday, January 25, 2023

Typesetting an entire book part V: Getting it published

Writing a book is not that difficult. Sure, it is laborious, but if you merely keep typing away day after day, eventually you end up with a manuscript. Writing a book that is "good" or one that other people would want to read is a lot harder. Still, even that is easy compared to trying to get a book published. According to various unreferenced sources on the Internet, out of all manuscripts submitted only 1 in 1000 to 1 in 10 000 gets accepted for publication. Probabilitywise this is roughly as unlikely casting five dice and getting six with all of them.

Having written a manuscript I went about tying to get it published. The common approach in most countries is that first you have to pitch your manuscript to a literary agent, and if you succeed, they will then try to pitch it to publishers. In Finland the the procedure is simpler, anyone can submit their manuscripts directly to book publishing houses without a middle man. While this makes things easier, it does not help with deciding how much the manuscript should be polished before submission. The more you polish the bigger your chances of getting published, but the longer it takes and the more work you have to do if the publisher wants to make changes to the content.

Eventually I ended up with a sort-of-agile approach. I first gathered a list of all book publishers that have published scifi recently (there were not many). Then I polished the manuscript enough so it had no obvious deficiencies and sent it to the first publisher on the list. Then I did a full revision of the text and sent it to the next one and so on. Eventually I had sent it to all of them. Very soon thereafter I had received either a rejection email or nothing at all from each one.

It's not who you are, but who you know

Since the content did not succeed in selling itself, it was time to start using connections. I have known Pertti Jarla, who is one of Finland's most popular cartoonists, for several years. He runs a small scale publishing company. Its most famous book thus far has been a re-translated version of Philip K. Dick's Do Androids Dream of Electric Sheep. I reached out to him and, long story short, the book should available in Finnish bookstores in a few weeks. The front cover looks like this.

More information in Finnish can be found on the publisher's web site. As for the obvious question of what would the book's title be in English, unfortunately the answer is "it's quite complicated to translate, actually". Basically what the title says is "A giant leap for mankind" but also not, and I have not managed to come up with a description or a translation that would not be a spoiler.

So you'll just have to wait for part VI: Getting translated. Which is an order of magnitude more difficult than getting published.

Tuesday, January 17, 2023

PDF, text and fonts, a design by The Devil & Associates

PDF has a fairly straightforward model for rendering text. Basically you specify a font to use, define a transformation matrix if needed and specify the text string to render. The PDF renderer will then load the embedded font file, extract the curves needed for your letters and render them on canvas. And it works quite nicely.

Assuming you are using plain ASCII. Because 127 glyphs should be enough for everybody.

If, for some weirdo reason, you belong to that ~80% minority population of planet Earth whose native tongue uses characters beyond ASCII, you are not going to be in a happy place. The whole situation is nicely summarized in this Computerphile video.




It may be comforting to know that nothing has changed since 1993. Dealing with fonts and text in PDF is still very much a case of voodoo incantations and guesswork. It does not really help that phrases containing words like "PDF", "text", and "fonts" are completely ungoogleable because they find a million hits on people trying to create PDF files with existing tools as opposed how font generation works under the hood.

How does one create text with PDF?

What most people generating PDF would want is to be able to specify a font file, starting location and the text to render in UTF-8. Some people might want finer control over text rendering and instead provide an array of Unicode glyphs and coordinates. Cairo provides both of these. PDF provides neither. Even the ASCII subset PDF does provide is mostly useless by itself for the following reasons.

Let's start with font metadata. When you embed a font inside a PDF file you also need to provide the width of each character as a PDF array in the font specification dictionary. This data is exactly the same as is in the font file itself. The PDF renderer must know how to parse said files to get the glyph curves out but the specification still mandates that every PDF creator program must replicate this data. Don't believe me? This is what the PDF specification document says in subsection 9.3.4:

The width information for each glyph shall be stored both in the font dictionary and in the font program itself.

The two sets of widths shall be identical.

Why is this? A reasonable assumption is that the person who was tasked with adding TrueType font support at Adobe back in the day got lazy and just insisted that font embedders must provide this information in PDF metadata format because the code to process those was already implemented. But it gets worse. The glyph width data is specified with three items, FirstChar, LastChar and Widths. The last of these contains the widths of all glyphs between the two endpoints. This means that if you wanted to create a PDF document containing the single word παν語 (the stylised form of "Pango") then you'd need an array with some 35 thousand entries, all but four of which would be unused. You still need to define them, or at least, you would need to if PDF text worked this way. It does not. We'll come back to this later.

As a palette cleanser let's note that even though PDF documentation always speaks of glyph widths, this array does not contain glyph widths. What it actually contains is glyph advancements, but given that the PDF text model predates TrueType font technology we can let that one pass.

What we can't ignore, though, is kerning. If you want your text to look good, you have to do proper kerning and the font file has all the information necessary to do that. PDF does not read that (and, as far as I can tell, you can't make it do so) but instead it requires that you specify kerning inline in your text rendering command. This means that in practice you can't tell PDF to render more than one glyph of text and expect it to do the right thing. Instead you have to manually process each glyph and add kerning directives between each pair of characters as necessary.

The magic 255

Creating PDF text with glyph IDs and locations seems like a reasonable approach then. The PDF documentation even says that you can specify glyph ids directly with one of two different quoting methods: octal numbering or UTF-16BE values. This works fine until you try to specify a glyph number bigger than 255, at which point fails incomprehensibly. I spent days trying to debug this. There are several different indices that could be used, mapped to each other and so on. I could not for the life of me figure out a combination that worked. Every attempt to use glyph indices failed. There is no useful documentation for this, you basically have to read code for existing PDF creators and examine their output with a hex editor or bespoke tools. The various PDF validators I tried were not very helpful either because their error messages were of the type "this PDF file is not valid in some way lol ;-)".

Eventually I tried creating a LO document that had more than 255 unique glyphs and exported that. The generated PDF file had two different subsets of the used font. The first had 255 characters and the second one had the rest. Cairo's PDF output is roughly similar. This leads me to believe that, contrary to what the documentation implies, it is not possible to have more than 255 glyphs in a single font in PDF. Or, if it is possible, then the way you go about getting that done is so bizarrely complicated that nobody has managed to get it working. Instead what you have to do is to create font subsetting code from scratch that packs all used glyphs into blocks of 255, map each input glyph to these packed ids, and write a cmap file so that the PDF renderer can convert the packed glyph ids back to Unicode values. You can't even use Freetype to do the subsetting, since it does not provide font creation functionality, only reading and rendering. Instead you get to do binary data mangling by hand. In big endian. Obviously.

If someone reading this knows for sure whether all of the above is true or if there is a simpler way (converting text to curves by hand does not count), do let me know in the comments.

Thursday, December 29, 2022

A quantitative analysis of the Trade Federation's blockade of Naboo

The events of Star Wars Episode I The Phantom Menace are based around a blockade that the Trade Federation holds over the planet Naboo. The details are not explained in the source material but it is assumed that this means that no ship can take off or land on the planet. The blockade is implemented by having a fleet of heavily armed star ships around the planet. What we would like to find out is what sort of an operation this blockade was.

In this analysis we stick only with primary sources, that is, the actual video material. Details on the blockade are sparse. The best data we have is this image.

This is not much to work with, but let's start by estimating how high above the planet the blockade is (assuming that all ships are roughly the same distance from the planet). In order to calculate it from this image we need to know four things

  1. The diameter of the planet
  2. The observed diamater of the planet on the imaging sensor
  3. The physical size of the image sensor 
  4. The focal length of the lens
The gravity on Naboo seems to match that of the Earth pretty closely so we'll use an estimate of 6000 km for the planet's radius. Unfortunately we don't know what imaging systems were in use a long time ago in the galaxy far, far away so to get anywhere we have to assume that space aliens use the same imaging technology we have. This gives a reasonable estimate of 35 mm film paired to a 50 mm lens. Captured image width on 35 mm film is 22 mm, so we'll use that value for sensor width. Width is used instead of height to avoid having to deal with possible anamorphic distortions.

Next we need to estimate the planet's observed size on the imaging sensor. This requires some manual curve fitting in Inkscape.

Scaling until the captured image is 22 mm wide tells us that the planet's observed diameter is 30 mm. Plugging these numbers into the relevant equations tells us that the blockade is 2⋅6000⋅50/30 = 20⋅10³ km away from the planet's center. We call this radius r₁. It is established in multiple movies that space ships in Star Wars can take off and land anywhere on a planet. When the queen's ship escapes they have to fight their way through the blockade which would indicate that it covers the entire planet, otherwise they could have avoided the blockade completely just by changing their trajectory to fly through an area where there are no Trade Federation vessels.

How many ships would this require? In order to calculate that we'd need to know how to place an arbitrary number of points on a sphere so that they are all equidistant. There is no formula for that (or at least that is what I was told at school, did not verify) so let's do a lower bound estimate. We'll assume that the blockade ships are at most 10 km apart. If they were any further, the queen's ship would have had no problems flying between the gaps. Each ship thus covers a circular area whose radius is 10 km. We call this r₂. Assuming perfect distribution of blockade vessels we can compute that it takes A₁/A₂ = π⋅(r₁)²/(π⋅(r₂)²) = (r₁)²/(r₂)² = (20⋅10³)²/10² = 4⋅10⁶ or 4 million ships to fully cover the area.

This is not a profitable operation. Even if each ship had a crew complement of only 10, it would still mean having an invasion force of 40 million people just to operate the blockade. There is no way the tax revenue from Naboo (or any adjoining planets, or possibly the entire galaxy) could even begin to cover the costs of this operation.

The equatorial hypothesis

An alternative approach would be that space ships in the Star Wars universe can't launch from anywhere on the planet, only from equatorial sites taking advantage of the boost given by the planet's rotation.

In this case the blockade force would only need to cover a narrow band over the equator. It would need to block it wholly, however, to prevent launches from spaceports all around the planet. Using the numbers above we can calculate that having a ring of ships 10 km apart at blockade height takes approximately 2⋅π⋅r₁/r₂ = 2⋅π⋅20⋅10³/100 = 1300 ships. This is a bit more feasible but not sufficient, because any escaping ship could avoid the blockade by flying 10 km above or below the equatorial plane. Thus the blockade must have height as well and it takes 1300 ships per each 10 km of blockade so a 100 km tall blockade would take 14 000 ships and a 1000 km one would take 130 000 ships. This is better but still not economically feasible.

The alternative planet size hypothesis

In the film Qui-Gon Jinn and Obi-Wan Kenobi are given a submersible called a bongo and told to proceed though the planet's core to get to Theed. The duration of this journey is not given so we have to estimate it somehow. The journey takes place during a military offensive and not much seems to have taken place during it so we assume that it took one hour. Based on visual inspection the bongo seems to travel at around 10 km/h. These measurements imply that Naboo's diameter is in fact only 10 km.

Plugging these numbers in the formulas above tells us that in this case the blockade is at a height of 16 km and would need to guard a surface area of roughly 900 km². The ship count estimation formula breaks down in this case as it says that it only takes 3 ships to cover the entire surface area. In any case this area could be effectively blocked with just a dozen ships or so. This would be feasible and it would explain other things, too.

If Naboo really is this kind of a supermassive mini planet it most likely has some rare and exotic materials on it. Exporting those to other parts of the galaxy would make financial sense and thus taxing them would be highly profitable. This would also explain why the Trade Federation chose to land their invasion force on the opposite side of the planet. It is as far from Theed's defenses as possible. This makes it a good place to stage a ground assault since moving troops to their final destination still only takes at most a few hours.

Friday, December 23, 2022

After exactly 10 years, Meson 1.0.0 is out

The first ever commit to the Meson repository was made 10 years ago to this day. To celebrate we have just released the long-awaited version 1.0.

The original design criterion for doing a 1.0 release was "when Meson does everything GStreamer needs". This happened, checks notes, three years ago (arguably even earlier than that). That is not the world's fastest reaction time, but that comes mostly down to our development policy. Meson aims to make releases at a steady pace and maintains backwards compatibility fairly well (not perfectly). There is unlikely to ever be a big breaking change, so there is no pressing technical need to bump the major version number.

Thus 1.0 is mostly a symbolical milestone rather than a technical one, end users should not not really notice that big of a difference. This does not mean that the release is any less important, though. To celebrate here is an assortment of random things that have happened over the years. Enjoy.

The greatest achievement

Perhaps the best example demonstrating the maturity of Meson is that I no longer do all the decisions. In fact most decisions and especially the code that goes with it is done by a diverse group of people. In fact I do very little of the actual development, I'm more of a product owner of sorts that can only nudge the project into certain directions rather than being able to turn the entire ship around on a dime. This is a bit sad, but absolutely necessary for long term survival of the project. It means that if one of those malevolent buses that seem to stalk software developers succeeded in hitting me, its effect on the project would not be all that big.

Reimplementation

There are two main reasons for reimplementing an existing open source project from scratch. The first one is that the upstream developer is a jerk and people don't want to work with them. The second is that someone, somewhere sees the project as important enough to have a second, independent implementation. I'm happy to report that (as far as I know at least), Meson is in the second group because there is a second from-scratch implementation of Meson called Muon

Meson is implemented in Python but its design was from the very start that this is only an implementation detail. We spent a fair bit of effort ensuring that the Python bits don't leak in the DSL, even by accident. There wasn't really any way of being sure about it short of doing a second implementation and now there is one as Muon is implemented in plain C.

Collaborative design

We all know that disagreeing with other people on the Internet might be very discouraging. However sometimes it works out incredibly well, such as in this merge request. That MR was really the first time a new feature was proposed and the submitter had a very different idea than me of what the API should look like. I distinctly remember feeling anxious about that at the time because I basically had to tell the submitter that their work would not be accepted.

To my surprise everything went really well. Even though there were many people involved and they had wildly different ideas on how to get the feature done, there was no pouting, no stomping of feet, shouting or the like (which, for the record, there had been in other similar discussions). Absolutely everybody involved really wanted to get the feature in and were willing to listen to others and change their stances based on the discussion. The final API turned out to be better than any of the individual proposals.

Thanks to contributors

According to Github statistics, a total of of 717 different people have at least one commit in the repository. This number does not cover all the other people who have contributed in other ways like docs, bug triaging, converting existing projects and so on. It is customary to thank people who have done major contributions like new features in milestone posts like these.

I'm going to do something different instead. In addition to "the big stuff" any project has a ton of less than glamorous work like bug fixing, refactoring, code cleanups and the like. These tasks are just as important as the more glitzy ones, but it sadly go underappreciated in many organisations. To curb this trend I'd like to pick three people to thank for the simple reason that when it turned out that sh*t needed to be done, they rolled up their sleeves and did it. Over and over again.

The first one is Dylan Baker, who has done major reorganisation work in the code including adding a lot of typing hints and fixed the myriad of bugs the adding of said type hints uncovered.

The second person is Eli Schwartz, who has done a ton of work all around, commented on many bug reports and on the Matrix channel. In fact he has done so much stuff that I suspect he never sleeps.

And finally we have Rosen Penev, who has done an astounding amount of work on WrapDB, both fixing existing wraps as well as updating them to new releases.

And finally: a secret

Meson gets a lot of bug reports. A lot a lot. Nirbheek Chauhan, one of the co-maintainers, once told me that Meson generates more bug email than all Gnome projects combined. I try my best to keep up with them, but the sad truth is that I don't have time to read most of them. Upon every release I have to clean up my mailbox by deleting all Meson bug mail.

The last time I did this I nuked more than 500 email threads in one go. No, not emails, email threads. So if you have wondered why your bug report has not gotten any replies, this is why. Simply reading the contents of Meson bug emails would be more than a full time job. Such is the price of success, I guess.

Monday, December 12, 2022

Print quality PDF generation, color separations, other fun stuff

Going from the simple color managed PDF generator discussed in the previous blog post into something more useful requires getting practical. So here is a screenshot of a "print ready" PDF document I generated with the code showing a typical layout you'd use for a softcover book. As printers can't really print all the way to the edges of paper, the cover needs to be printed to a larger sheet and then cut to its final size.

It's not of artistically high quality, granted, but most of the technical bits are there:

  • The printed page is noticeably bigger than the "active area" and has a bunch of magic squiggles needed by the printing house
  • The output is fully color managed CMYK
  • The gray box represents the bleed area and in a real document the cover image would extend over it, but I left it like this for visualization purposes.
  • Text can be rendered and rotated (see spine)
  • The book title is rendered with gold ink, not CMYK inks
  • There are colorbars for quality control
  • The registration and cut marks (the "bullseyes" and straight lines at paper corners, respectively) are drawn on all plates using PDF's builtin functionality so they are guaranteed to be correctly aligned
  • None of the prepress marks are guaranteed to be actually correct, I just swiped them from various sources
The full PDF can be downloaded from this link. From this print PDF, we can generate separations (or "printing plates") for individual ink components using Ghostscript.

Looking at this you can find several interesting things. For example the gray box showing the bleed area is composed of C, M and Y inks instead of only K, even though it was originally defined as a pure gray in RGB. This is how LittleCMS chose to convert it and it might or might not be what the original artist had in mind. High quality PDF generation is full of little quirks like this, blindly throwing numbers at color conversion functions is not enough to get good results, end users might need fairly precise control over low level operations.

Another thing to note is how the renderer has left "holes" for the book title in CMYK plates even though all color is in the gold ink plate. This avoids mixing inks but on the other hand requires someone to do proper trapping. That is its own can of worms, but fortunately most people can let the RIP handle it (I think).