Friday, June 4, 2021

Formatting an entire book with LibreOffice, what's it like?

I have created full books using both LaTeX and Scribus. I have never done it with LibreOffice, though. The closest I've ever come was watching people write their masters' theses in Word, failing miserably and swearing profusely. To find out what it's really like, I chose to typeset an entire book from scratch using nothing else but LibreOffice.

The unfortunate thing about LibreOffice (which it inherits from MS Word compatibility) is that there is a "correct" way to use it which is the exact opposite way of how people instinctively want to use it. The proper way is to use styles for everything as opposed to applying fonts, sizes, spacing et al by hand. In practice every time I am given a Word document I check how it has been formatted. I have never seen a single document in the wild that would have gotten this right. Even something as simple as chapter indentation is almost always done with spaces.

Getting the text

Rather than using lorem ipsum or writing a book from scratch, I grabbed an existing book from Project Gutenberg. A random choice landed upon Gulliver's Travels, which turned out to be fortunate as it has several interesting and uncommon typographical choices. The source data of Project Gutenberg is UTF-8 text. All formatting has to be added afterwards. Here's what the first page ended up looking like after a few evenings' worth of work.

The source text file is line wrapped to 80 characters and chapters are separated by two or more newlines. This does not really work with LO, so the first step is to preprocess the text with a Python script so that every chapter of text is on its own (very long) line and then the text can be imported to LO. After import each sections must be assigned a proper style. The simplest approach is to select all text, apply the Text Body style and then manually seek all chapter headings and set them to Heading 1. That takes care of the formatting needs of ~95% of the text (though the remaining 5% take 10x more work).

Page layout

The original book's dimensions are not provided, so I took a random softcover book [1] from my shelf, measured it with a ruler and replicated the page settings. The book is set in the traditional style where everything up to the actual text has page numbers in roman numerals whereas the actual text uses arabic numerals. Setting it up was straightforward, though I had to create six different page styles to get the desired result.

Text layout challenges

Gulliver's Travels is a bit unusual in that every chapter begins with a small introductory text explaining what will happen in the chapter. Apparently readers in the 1720s were not afraid of spoilers.

In the Project Gutenberg source text these sections (and many others) were written in all capital letters. However it is likely that in the original book they were instead written in small caps. Fixing this would require retyping the text to be in lower case. Fortunately LO has an option in the format menu to convert text to lower case, which makes this operation fairly painless.

Another unusual thing is that the book does not have a regular table of contents, instead it duplicates these small text chapters.

LO has a builtin TOC generator but it can't handle this (I think) so the layout has to be recreated by hand with tables and manual cross reference fields. Controlling page breaks and the like is difficult and I could not make it work perfectly. The above picture has two bugs, the illustration cross reference should be in roman numerals (as it is on a preface page) but LO insists on formatting it using an arabic number. The last chapter on the left page gets split up and the page number is on the left page, whereas it should be on the right (bottom aligned). Even better would be if the chapter heading and text could be defined to always stick together and not be split over pages. There is a setting for this, but it does not seem to work inside tables

Pictures

There are several illustrations in the book and scans of the pictures were also provided by Project Gutenberg. Adding them in the document revealed that figure handling is clearly LO's weakest point (again, presumably because it inherits its model from Word). It seems that in this model each figure has an anchor point in the text and you can align the figure relative to that but the image must be on the same page as the anchor. Were it to go on the next page, LO adds a page break so that the two go to the same page. This leaves a potentially large empty space at the end of the previous page, which looks just plain weird.

In contrast this is something that LaTeX does exceptionally well with its floating figures. Basically it tries to add the figure on the current page and if it will not fit, it puts it on the next page. There does not seem to be a way to get this behaviour in LO. Or at least I could not find one, googling did not help and neither did asking for help on the lazyweb. Playing with images was also the only time I managed to crash LO, so be careful; save early, save often.

The only reasonably working solution seems to be page aligned images. This works but means that if text is edited, figures do not move along with the changes and get disconnected from their source locations. Thus image aligning must be the very last thing to be done. This approach also does not work if you are using master documents. Books with many images should probably be typeset with Scribus instead, especially if proper color management is required.

In conclusion

If you are very disciplined and use LO exactly as it should be used, the end result is actually really nice. You can, for example, change the font used for text in only one place (the base style) and the entire document gets fully reformatted, reflown and repaged in less than a second. This allows you to do invasive layout tests easily, such as finding out how much more space IBM Plex Serif takes when compared to Nimbus Roman [2]. The downside is that any cut corners will cause broken output that you can't find without manually inspecting the entire document.

IKEA effect notwithstanding laying out the text in proper form makes it a lot more enticing. The process of shaping raw text to form really makes it come alive in a sense. It would be nice if Project Gutenberg (or anyone else, really) provided properly formatted versions of their books (and in fact, some already are) because presentation really makes a difference for readability. Plain text and unformatted HTML is unfortunately quite drab to read.

[1] The Finnish edition of the first book in the Illuminatus trilogy, for the curious among you.

[2] Approximately 380 pages compared to 340.

Thursday, May 27, 2021

Managing dependencies with Meson + WrapDB

A recent blog post talked about how to build and manage dependencies with CMake and FetchContent. The example that they used was a simple GUI application using the SFML multimedia libraries and the Dear ImGui widget toolkit using the corresponding wrapper library. For comparison let's do the same with Meson.

The basics

The tl/dr version is that all you need is to install Meson, check out this repository and start the build. Meson will automatically download all dependencies as source, build them from scratch and create a single statically linked final executable. SFML's source archive contains a bunch of prebuilt dependencies for OpenAL and the like. Those are not used, everything is compiled from original sources. A total of 11 subprojects are used including things like libpng, Freetype and Flac.

They are downloaded from Meson's WrapDB dependency provider/package manager service. which combines upstream tarballs with user submitted Meson build definitions. The only exception is stb. It has no releases, instead it is expected to be used directly from Git. As WrapDB only provides actual releases, this dependency needs to be checked out from a custom Git repo. This is entirely transparent to the user, the only change is that the wrap file specifying where the dependency comes from points to a different place.

If you actually try to compile the code you might face some issues. It has only been properly tested on Windows. It will probably work on Linux and most definitely won't work on macOS. At the time of writing GNU's web mirror has an expired certificate so downloading Freetype's release tarball will fail. You can work around it by downloading it by hand and placing it in the subprojects/packagecache directory. The build of SFML might also fail as the code uses auto_ptr, which has been removed from some stdlibs. This has been fixed in master (but not in the way you might have expected) but the fix has not made it to a release yet.

What does it look like?

I would have added an inline image here, but for some reason Blogger's image uploader is broken and just fails (on two different OSs even). So here's an imgur link instead.

This picture shows the app running. To the left you can also see all the dependencies that were loaded during the build. It also tells you why people should do proper release tarballs rather than relying on Github's autogenerated ones. Since every project's files are named v1.0.0.zip, the risk of name collision is high.

What's the difference between the two?

CMake has a single flat project space (or at least that is how it is being used here) which is used like this:

FetchContent_Declare(
  sfml
  URL https://github.com/SFML/SFML/archive/refs/tags/2.5.1.zip
  URL_MD5 2c4438b3e5b2d81a6e626ecf72bf75be
)
add_subdirectory(sfml)

I.e. "download the given file, extract it in the current directory (whatever it may be) and enter it as if it was our own code". This is "easy" but problematic in that the subproject may change its parent project in interesting ways that usually lead to debugging and hair pulling.

In Meson every subproject is compiled in its own isolated sandbox. They can only communicate in specific, well defined and secured channels. This makes it easy to generate projects that can be built from source on Windows/macOS/Android/etc and which use system dependencies on Linux transparently. This equals less hassle for everyone involved.

There are other advantages as well. Meson provides a builtin option for determining whether a project should build its libraries shared or static. This option can be set on the command line per subproject. The sample application project is set up to build everything statically for convenience. However one of the dependencies, OpenAL, is LGPL, so for final distributions you'll probably need to build it as a shared library. This can be achieved with the following command:

meson configure -Dopenal-soft:default_library=shared

After this only the OpenAL dependency is built as a shared library whereas everything else is still static. As this is a builtin, no project needs to write their own options, flags and settings to select whether to build shared or static libraries. Better yet, no end user has to hunt around to discover whether the option to change is FOO_BUILD_SHARED, FOO_ENABLE_SHARED, FOO_SHARED_LIBS, SHARED_FOO, or something else.

Tuesday, May 18, 2021

Why all open source maintainers are jerks, the Drake equation hypothesis

Preface

This blog post is meant to be humorous. It is not a serious piece of scientifically rigorous research. In particular it is not aiming to justify bad behaviour or toxicity in any way, shape or form. Neither does it claim that this mechanism is the only source of negativity. If you think it is doing any of these things, then you are probably taking the whole thing too seriously and are reading into it meanings and implications that are not there. If it helps, you can think of the whole thing as part of a stand-up comedy routine.

Why is everybody a jerk?

It seems common knowledge that maintainers of major open source projects are rude. You have your linuses, lennarts, ulrichs, robs and so on. Why is that? What is it about project maintenance that brings out these supposed toxics? Why can't projects be manned by nice people? Surely that would be better.

Let's examine this mathematically. For that we need some definitions. For each project we have the following variables:

  • N, the total number of users
  • f_p, the fraction of users who would like to change the program to better match their needs
  • f_r, the fraction of users with a problem who file a change request
  • f_rej, the fraction of submitted change requests that get rejected
  • f_i, the fraction of of people who consider rejections as attacks against their person
  • f_c, the fraction of people who complain about their perceived injustices on public forums
  • f_j, the fraction of complainers formulating their complaints as malice on the maintainer's part

Thus we have the following formula for the amount of "maintainer X is a jerk" comments on the net:

J = N * f_p * f_r * f_rej * f_i * f_c * f_j

Once J reaches some threshold, the whisper network takes over and the fact that someone is a jerk becomes "common knowledge".

The only two variables that a maintainer can reasonably control are N and f_rej (note that f_i can't ever be brought to zero, because some people are incredibly good at taking everything personally) Perceived jerkness can thus only be avoided either by having no users or accepting every single change request ever filed. Neither of these is a good long term strategy.

Thursday, May 13, 2021

.C as a file extension for C++ is not portable

Some projects use .C as a file extension for C++ source code. This is ill-advised, because it is can't really be made to work automatically and reliably. Suppose we have a file source.C with the following contents:

class Foo {
public:
    int x;
};

Let's compile this with the default compiler on Linux:

$ cc -c -o /dev/null source.C

Note that that command is using the C compiler, not the C++ one. Still, the compiler will autodetect the type from the extension and compile it as C++. Now let's do the same thing using Visual Studio:

$ cl /nologo /c source.C
source.C(1): Error C2061 Syntax error: Identifier 'Foo'
<a bunch of other errors>

In this case Visual Studio has chosen to compile it as plain C. The defaults between these two compilers are the opposite and that leads to problems.

How to fix this?

The best solution is to change the file extension to an unambiguous one. The following is a simple ZSH snippet that does the renaming part:

for i in **/*.C; do git mv ${i} ${i:r}.cpp; done

Then you need to do the equivalent in your build files with search-and-replace.

If that is not possible, you need to use the /TP compiler switch with Visual Studio to make it compile the source as C++ rather than C. Note that if you use this argument on a target, then all files are built as C++, even the plain C ones. This is unreliable and can lead to weird bugs. Thus you should rename the files instead.

Wednesday, May 5, 2021

Is the space increase caused by static linking a problem?

Most recent programming languages want to link all of their dependencies statically rather than using shared libraries. This has many implications, but for now we'll only focus on one: executable size. It is generally accepted that executables created in this way are bigger than when static linking. The question is how much and whether it even mattesr. Proponents of static linking say the increase is irrelevant given current computers and gigabit networks. Opponents are of the, well, opposite opinion. Unfortunately there is very little real world measurements around for this.

Instead of arguing about hypotheticals, let's try to find some actual facts. Can we find a case where, within the last year or so, a major proponent of static linking has voluntarily switched to shared linking due to issues such as bandwidth savings. If such a case can be found, then it would indicate that, yes, the binary size increase caused by static linking is a real issue.

Android WebView, Chrome and the Trichrome library

Last year (?)  Android changed the way they provide both the Chrome browser and the System WebView app [1]. Originally both of them were fully isolated, but after the change both of them had a dependency on a new library called Trichrome, which is basically just a single shared library. According to news sites, the reasoning was this:

"Chrome is no longer used as a WebView implementation in Q+. We've moved to a new model for sharing common code between Chrome and WebView (called "Trichrome") which gives the same benefits of reduced download and install size while having fewer weird special cases and bugs."

Google has, for a long time, favored static linking. Yet, in this case, they have chosen to switch from static linking to shared linking on their flagship application on their flagship operating system. More importantly their reasons seem to be purely technical. This would indicate that shared linking does provide real world benefits compared to static linking.

[1] I originally became aware of this issue since this change broke updates on both of these apps and I had to fix it manually with apkmirror.

Tuesday, May 4, 2021

"Should we break the ABI" is the wrong question

The ongoing battle on breaking C++'s ABI seems to be gathering steam again. In a nutshell there are two sets of people in this debate. The first set wants to break ABI to gain performance and get rid of bugs, whereas the second set of people want to preserve the ABI to keep old programs working. Both sides have dug their heels in the ground and refuse to budge.

However debating whether the ABI should be broken or not is not really the issue. A more productive question would be "if we do the break, how do we keep both the old and new systems working at the same time during a transition period". That is the real issue. If you can create a good solution to this problem, then the original problem goes away because both sides get what they want. In other words, what you want to achieve is to be able to run a command like this:

prog_using_old_abi | prog_using_new_abi

and have it work transparently and reliably. It turns out that this is already possible. In fact many (most?) people are reading this blog post on a computer that already does exactly that.

Supporting multiple ABIs at the same time

On Debian-based systems this is called multi-arch support. It allows you to, for example, run 32 and 64 bit apps transparently on the same machine at the same time. IIRC it was even possible to upgrade a 32 bit OS install to 64 bits by first enabling the 64 bit arch and then disabling the 32 bit one. The basic gist of multiarch is that rather than installing libraries to /usr/lib, they get installed to something like /usr/lib/x86_64. The kernel and userspace tools know how to handle these two different binary types based on the metadata in ELF executables.

Given this we could define an entirely new processor type, let's call it x86_65, and add that as a new architecture. Since there is no existing code we can do arbitrary changes to the ABI and nothing breaks. Once that is done we can create the full toolchain, rebuild all OS packages with the new toolchain and install them. The old libraries remain and can be used to run all the old programs that could not be recompiled (for whatever reason). 

Eventually the old version can be thrown away. Things like old Steam games could still be run via something like Flatpak. Major corporations that have old programs they don't want to touch are the outstanding problem case. This just means that Red Hat and Suse can sell them extra-long term support for the old ABI userland + toolchain for an extra expensive price. This way those organizations who don't want to get with the times are the ones who end up paying for the stability and in return distro vendors get more money. This is good.

Simpler ABI tagging

Defining a whole virtual CPU just for this seems a bit heavy handed and would probably encounter a lot of resistance. It would be a lot smoother if there were a simpler way to version this ABI change. It turns out that there is. If you read the ELF specification, you will find that it has two fields for ABI versioning (and Wikipedia claims that the first one is usually left at zero). Using these fields the multiarch specification could be expanded to be a "multi-abi" spec. It would require a fair bit of work in various system components like the linker, RPM, Apt and the like to ensure the two different ABIs are never loaded in the same process. As an bonus you could do ABI breaking changes to libc at the same time (such as redefining intmax_t) There does not seem to be any immediate showstoppers, though, and the basic feasibility has already been proven by multiarch.

Obligatory Rust comparison

Rust does not have a stable ABI, in fact it is the exact opposite. Any compiler version is allowed to break the ABI in any way it wants to. This has been a constant source of friction for a long time and many people have tried to make Rust commit to some sort of a stable ABI. No-one has been successful. It is unlikely they will commit to distro-level ABI stability in the foreseeable future, possibly ever.

However if we could agree to steady and continuous ABI transitions like these every few years then that is something that they might agree to. If this happens then the end result would be beneficial to everyone involved. Paradoxically it would also mean that by having a well established incremental way to break ABI would lead to more ABI stability overall.

Sunday, April 25, 2021

The joys of creating Xcode project files

Meson's Xcode backend was originally written in 2014 or so and was never complete or even sufficiently good for daily use. The main reason for this was that at the time I got extremely frustrated with Xcode's file format and had to stop because continuing with would have lead to a burnout. It's just that unpleasant. In fact, when I was working on it originally I spontaneously said the following sentence out loud:

You know, I really wish this file was XML instead.

I did not think these words could be spoken with a straight face. But they were.

The Xcode project file is not really even a "build file format" in the sense that it would be a high level description of the build setup that a human could read, understand and modify. Instead it seems that Xcode has an internal enterprise-quality object model and the project file is just this data structure serialised out to disk in a sort-of-like-json-designed-in-1990 syntax. This format is called a plist or a property list. Apparently there is even an XML version of the format, but fortunately I've never seen one of those. Plists are at least plain text so you can read and write them, but sufficiently like a binary that you can't meaningfully diff them which must make revision control conflict resolution a joy.

The semantics of Xcode project files are not documented. The only way to really work with them is to define simple projects either with Xcode itself or with CMake, read the generated project file and try to reverse-engineer its contents from those. If you get it wrong Xcode prints a useless error message. The best you can hope for is that it prints the line number where the error occurred. Often it does not.

The essentials

An Xcode project file is essentially a single object dictionary inside a wrapper dictionary. The keys are unique IDs and the values are dictionaries, meaning the whole thing is a string–variant wrapper dictionary containing a string–variant dictionary of string–variant dictionaries. There is no static typing, each object contains an isa key which has a string telling what kind of an object it is. Everything in Xcode is defined by building these objects and then by referring to other objects via their unique ids (except when it isn't, more about this later).

Since everything has a unique ID, a reasonable expectation would be that a) it would be unique per object and b) you would use that ID to refer to the target. Neither of these are true. For example let's define a single build target, which in Xcode/CMake parlance is a PBXNativeTarget. Since plist has a native array type, the target's sources could be listed in an array of unique ids of the files. Instead the array contains a reference to a PBXSourcesBuildPhase object that has a few keys which are useless and always the same and an array of unique ids. Those do not point to file objects, as you might expect, but to a PBXBuildFile object, whose contents look like this:

24AA497CCE8B491FB93D4C76 = {
  isa = PBXBuildFile;
  fileRef = 58CFC111B9B64310B946BCE7 /* /path/to/file */;
};

There is no other data in this object. Its only reason for existing, as far as I can tell, is to point to the actual PBXFileReference object which contains the filesystem path. Thus each file actually gets two unique ids which can't be used interchangeably. But why stop at two?

In order to make the file appear in Xcode's GUI it needs more unique ids. One for the tree widget and another one it points to. Even this is not enough, though, because if the file is used in two different targets, you can not reuse the same unique ids (you can in the build definition, but not in the GUI definition just to make things more confusing). The end result being that if you have one source file that is used in two different build targets, then it gets at least four different unique id numbers.

Quoting

In some contexts Xcode does not use uids but filenames. In addition to build targets Xcode also provides PBXAggregateTargets, which are used to run custom build steps like code generation. The steps are defined in a PBXShellScriptBuildPhase object whose output array definition looks like this:

outputPaths = (
    /path/to/output/file,
);

Yep, those are good old filesystem paths. Even better, they are defined as an actual honest to $deity array rather than a space separated string. This is awesome! Surely it means that Xcode will properly quote all these file names when it invokes external commands.

Lol, no!

If your file names have special characters in them (like, say, all of Meson's test suite does by design) then you get to quote them manually. Simply adding double quotes is not enough, since they are swallowed by the plist parser. Instead you need to add additional quoted quote characters like this: "\"foo bar\"".  Seems simple, but what if you need to pass a backslash through the system, like if you want to define some string as "foo\bar"? The common wisdom is "don't do that" but this is a luxury we don't have, because people will expect it to work, do it anyway and report bugs when it fails.

To cut a long and frustrating debugging session short, the solution is that you need to replace every backslash with eight backslashes and then it will work. This implies that the string is interpreted by a shell-like thing three times. I could decipher where two of them occur but the third one remains a mystery. Any other number of backslashes does not work and only leads to incomprehensible error messages.

Quadratic slowdowns for great enjoyment

Fast iterations are one of the main ingredients of an enjoyable development experience. Unfortunately Xcode does not provide that for this use case. It is actually agonizingly slow. The basic Meson test suite consists of around 240 sample projects. When using the Ninja backend it takes 10 minutes to configure, build and run the tests for all of them. The Xcode backend takes 24 minutes to do the same. Looking at the console when xcodebuild starts it first reads its input files, then prints "planning build", pauses for a while and then starts working. This freeze seems to take longer than Meson took configuring and generating the project file. Xcodebuild lags even for projects that have only one build target and one source file. It makes you wonder what it is doing and how it is even possible to spend almost a full second planning the build of one file. It also makes you wonder how a pure Python program written by random people in their spare time outperforms the flagship development application created by the richest corporation in the world.

Granted, this is a very uncommon case as very few people need to regenerate their project files all the time. Still, this slowness makes developing an Xcode backend a tedious job. Every time you add functionality to fix one test case, you have to rerun the entire test suite. This means that the further along you get, the slower development becomes. By the end I was spending more time on Youtube waiting for tests to finish than I did writing code.

Final wild speculation

Xcode has a version compatibility selection box that looks like this:

This is extremely un-Apple-like. Backwards compatibility is not a thing Apple has ever really done. Typically you get support for the current version and, if you are really lucky, the previous one. Yet Xcode has support for versions going all the way back to 2008. In fact, this might be the product with the longest backwards compatibility story ever provided by Apple. Why is this?

We don't really know. However being the maintainer of a build system means that sometimes people tell you things. Those things may be false or fabricated, of course, so the following is just speculation. I certainly have no proof for any of it. Anyhow it seems that a lot the fundamental code that is used to build macOS exists only in Xcode projects created ages ago. The authors of said projects have since left the company and nobody touches those projects for fear of breaking things. If true this would indicate that the Xcode development team has to keep those old versions working. No matter what.