Thursday, March 26, 2020

It's not what programming languages do, it's what they shepherd you to

How many of you have listened, read or taken part in a discussion about programming languages that goes like the following:

Person A: "Programming language X is bad, code written in it is unreadable and horrible."

Person B: "No it's not. You can write good code in X, you just have to be disciplined."

Person A: "It does not work, if you look at existing code it is all awful."

Person B: "No! Wrong! Those are just people doing it badly. You can write readable code just fine."

After this the discussion repeats from the beginning until either one gets fed up and just leaves.

I'm guessing more than 99% of you readers have seen this, often multiple times. The sad part of this is that even though this thing happens all the time, nobody learns anything and the discussion begins anew all the time. Let's see if we can do something about this. A good way to go about it is to try to come up with a name and a description for the underlying issue.
shepherding An invisible property of a progamming language and its ecosystem that drives people into solving problems in ways that are natural for the programming language itself rather than ways that are considered "better" in some sense. These may include things like long term maintainability, readability and performance.
This is a bit abstract, so let's look at some examples.

Perl shepherds you into using regexps

Perl has several XML parsers available and they are presumably good at their jobs (I have never actually used one so I wouldn't know). Yet, in practice, many Perl scripts do XML (and HTML) manipulation with regexes, which is brittle and "wrong" for lack of a better term. This is a clear case of shepherding. Text manipulation in Perl is easy. Importing, calling and using an XML parser is not. And really all you need to do is to change that one string to a different string. It's tempting. It works. Surely it could not fail. Let's just do it and get on with other stuff. Boom, just like that you have been shepherded.

Note that there is nothing about Perl that forces you to do this. It provides all the tools needed to do the right thing. And yet people don't, because they are being shepherded (unconsciously) into doing the thing that is easy and fast in Perl.

Make shepherds you into embedding shell pipelines in Makefiles

Compiling code with Make is tolerable, but it fails quite badly when you need to generate source code, data files and the like. The sustainable solution would be to write a standalone program in a proper scripting language that has all the code logic needed and call that from Make with your inputs and outputs. This rarely happens. Instead people think "I know, I have an entire Unix userland available [1], I can just string together random text mangling tools in a pipeline, write it here and be done". This is how unmaintainability is born.

Nothing about Make forces people to behave like this. Make shepherds people into doing this. It is the easy, natural outcome when faced with the given problem.

Other examples

  • C shepherds you into manipulating data via pointers rather than value objects.
  • C++ shepherds you into providing dependencies as header-only libraries.
  • Java does not shepherd you into using classes and objects, it pretty much mandates them.
  • Turing complete configuration languages shepherd you into writing complex logic with them, even though they are usually not particularly good programming environments.
[1] Which you don't have on Windows. Not to mention that every Unix has slightly different command line arguments and semantics for basic commands meaning shell pipelines are not actually portable.

Wednesday, March 11, 2020

The character that swallowed a week

In the last few posts we have looked at compiling LibreOffice from scratch using Meson. Contrary to what one might expect it was not particularly difficult, just laborious. The codegen bits (yes, there are several) required some deciphering, but other than that it was fairly straightforward. Unfortunately just compiling source code is not sufficient, as usually one also wants to run the result. Further you'd want the new binaries to behave in the same way as the old ones. This is where things get interesting.

Trying to run the main LibreOffice application is not particularly useful because it will almost certainly not work. Fortunately LO provides a bunch of sample and test applications one can use. One of these is a demo app that starts, initialises the LO runtime and opens up a GUI window. Perfect. After a build (which takes an hour) and install the binary is ready to run and produces a … segfault. More annoyingly it produces a segfault without any useful information that would aid in debugging.

Time to take out strace. Several attempts later we discover that LO tries to be dynamic and smart. LO consists of >150 shared libraries. Rather than link all of them together, what seems to be happening is that the code tries to determine where the main executable is, then looks up shared libraries relative to that binary, dlopens them and then goes on its merry way. If this fails it crashes somewhere for some reason I chose not to dig into. This debugging brought up an interesting discovery about naming. One of the libraries is called SAL, so naturally you would expect the actual file to be called libsal.so. This is also what the Makefile defining the library calls it. But that is not its actual name. Instead it is is libsallo.so. Somewhere within the bowels of the 150 000 lines of Make that make (ha) up the system, some (but not all) library names get an lo appended to them. There does not seem to be any logic to which ones, but fine, they can at least be fixed with manual work.

After that the program failed trying to open some data files. This was easily fixed by swiping all binary files from an existing working installation. Then it failed with a weird assert failure that seemed to indicate that the runtime object system had not been properly initialised. LO is built around a custom object model called UNO or Universal Network Objects. As you could tell by the name it comes from the 90s and has the following properties:
  • It provides an object model very close to the one in Java
  • Absolutely everything is done with inheritance, and all functionality is at least three namespaces deep
  • Objects are defined with a custom language that is compiled to C++ headers using a self built tool (I'm told this generates roughly a million lines of C++ headers in total)
  • All class types and objects are constructed at runtime
  • The runtime is written in C++, but the actual implementation is all void pointers and reinterpret casts

Break all the tools!

At this point the easy avenues had been explored and it was time to bring out gdb. While editing build definition files is something you can do with a simple editor, debugging really requires an IDE. The command line interface, even the curses one, is cumbersome compared to a full fledged GUI with mouse hovers for variables and so on. Fortunately this is a simple task: create a new Eclipse project, build (taking an hour again), install, run and debug. It worked really nicely. For about a day.

Eventually Eclipse's code indexer crashed because it was doing a full reindexing operation. When you restart Eclipse after such a crash it will restart the indexing operation and promptly crash again. The eventual solution was to manually delete all code that is not absolutely necessary for the test app, such as Writer, Calc and the rest. This brought the working set small enough that it did not crash any more. And no, increasing the amount of memory allocated to Eclipse's JVM does not fix the issue.  It crashes even if given 8 gigs of memory. With this done the sequence of steps leading to the crash could be debugged:

  1. The runtime tries to initialise itself, fails and throws its own custom exception type.
  2. The exception is caught, ignored and discarded, followed by calling a custom error handling code.
  3. The code retrieves the exception via other methods, and then tries to introspect the object for further information.
  4. Introspecting the error object for the reason why runtime initialisation failed requires the runtime to be initialised and working. As it is not, things crash.
Interestingly if you change the code at step 2 to just grab the exception and print its error message, it works correctly. Hooray for frameworks, I guess.

After some fixes the issue was gone but a new one took its place. Somehow the code got into a loop where function A called B, which called C and so on for about 20 functions until something called back to function A. This lead to an eternal loop and stack exhaustion.

At this point we have two supposedly identical programs: one built with gbuild and one built with Meson. They should behave identically but don't. This gives us an approach to solve the problem: single step both programs in a debugger until their execution differs. When that happens we know where the bug is. In practice this is less simple. Because the code uses a ton of OO helper classes and templates, even a single function call can lead into many stack frames that you have to step through manually. These helpers go away when optimisations are enabled, but in this particular case they can't be used as they make debugging quite difficult. Even -Og changes the code too much to be useful. So no optimization it is and every time you change the optimization level to test it takes, you guessed it, an hour to build (or more if you try -O2).

Manually single stepping through code is the sort of tedious manual labor that humans are bad at. Fortunately gdb has a Python scripting interface. Thus one could write a script to run on each gdb that connects to a common server that orders them to single step and report their current PC location and then halt when they differ. This worked perfectly until the programs called into libc. For some reason the same calls into libc (specifically getenv) took a different amount of steps to finish so the programs desynched almost immediately. Fixing that seemed to take too much work so that idea was scrapped.

Manually single stepping through code is difficult because if you accidentally go too far, you have to start over again. Fortunately we now have rr, which allows you to step backwards in code execution as well. Recording a trace of one of the programs worked. The other program worked as well. Running both of them at the same time failed miserably for reasons that remained unclear.

Debuglander ][: The Sloggening

Nevertheless at this point I had two debugging aides: what actually should happen as an rr trace and what actually was happening in a live debugger. Now it was just a matter of finding out where their execution differs. After about two days of debugger stepping I gave up, doing this sort of work by hand is just not something the human brain does very well (or at least mine doesn't). It was back to the old straw-grasping-at board.

Like most big frameworks, UNO had its own code that does special compiler magic. The Makefile lists several flags that should not be used to compile said code. I tried adding those in the Meson version. It did not help. There is also some assembly code that fiddles with registers. Maybe that was the problem? Nope again. Maybe one of the libraries is missing some preprocessor define that causes bad compilations? No. Everything seemed to be in order and doing the exact same thing as the original build system did.

LO does provide unit tests for some functionality. I had not built them with Meson yet, so I figured I'd fix some of those just to get something done. Eventually I converted one test that just exercises an Any object and it crashed in the exact same way. Now I had a reproducer for the bug with 10x to 100x less code. Once more unto the debuggers dear friends, once more!

Eventually the desync point was found. One execution went to a header file line 16 and the other went to the same header, but to line 61. It was getting fairly late so simply noticing the difference between 16 and 61 took a while. Now we had the smoking gun, but there was one more twist to this story: the header file did not have a line 61, it had only about 30 lines of text.

Nevertheless the debugger reported line 61. It even allowed one to inspect variables and all the things you would not expect to be able to do when your process execution is on a nonexisting line. What was happening? Was the debug info in the ELF files corrupted in some way? And then it finally hit me.

LibreOffice generates two different sets of headers for the same IDL files: regular and comprehensive (there is also a third bootstrap one, but let's ignore that). These headers have the same names and the same supposed behaviour but different implementations. You can #include either in your code and it will compile and sort-of-work. I don't know why the original developers had decided to tempt the ODR violation gods in this way and nobody on the LO IRC channel seemed to know either, but finally the core issue had been found. When reverse engineering the Makefiles I had found the code generation phase that creates the regular headers rather than the comprehensive ones.

The fix consisted of changing one of the code generator's command line arguments from -l to -C, recompiling (which took an hour again) and running the binary. Finally, one week's worth of debugging work was rewarded with this:


A blank window has never looked sweeter.

Final commercial announcement

If you enjoyed this post and would like to know more about the Meson build system, consider buying my ebook.

Monday, March 2, 2020

Meson manual sales numbers and a profitability estimate

The Meson Manual has been available for purchase for about two months now. This is a sufficient amount of time to be able to estimate total sales amounts and the like. As one of the goals of the project was to see if this could be a reasonable way to compensate FOSS maintainers for their work, let's go through the numbers in detail.

Income

At the time of writing, 45 copies of the book have been sold amounting to around 1350€ of income. The first month's sales were about 700€ and the second one's about 500€. This is the common sales pattern where sales start at some level and then gradually decrease as time passes. Depending on how one calculates the drop-off, this would indicate total sales over a year of 2000-2500€.

Expenses

The ecommerce site charges ~10€ per month, which is 120€ per year. Credit card processing fees should land somewhere in the 200-300€ area depending on sales. Accounting services come to about 500€. Then there are many small things that probably sum up to 100-200€ in a year. After these immediate expenses roughly 800-1300€ remains. Unfortunately this is where things get uglier.

The book was launched in my presentation at Linux.conf Australia. Having some sort of a launch is a necessity, because just putting things on the Internet does not work. This cost about 2000€ total in travel and lodging expenses. This single expense takes out all remaining income and then some. One could have gone to a conference nearer by but as Finland is far away from everywhere, the expenses are easily 500€ even for continental Europe. This also assumes that one would have gotten a talk accepted there, which is not at all guaranteed. In fact the vast majority of Meson talks I have submitted have been rejected. Thus depending on how you look, we are now either 700€ in the red or 300-800€ in the black in the best possible case.

Writing a book takes time. A lot of time. The way I got mine was to make a deal with my employer to only work four days a week for several months. I spent that one day a week (specifically Wednesday) writing the book and working out all the requisite bureaucracy. In my day job I'm a consultant and my salary is directly determined by the number of hours billed from customers. Doing the math on this says that writing the manual cost me ~9000€ compared to just being at work.

With this the total compensation of the manual comes out to a loss between 8000€ and 9500€.

So was it worth it?

It depends? As a personal endeavor writing, publishing and selling a full book is very satisfying and rewarding (even though writing it was at times incredibly tedious). But financially? No way. The break-even point seems to be about 10× the current sales. If 100× sales were possible, it might be sufficient amount for more people to take the risk and try to make a living this way. With these sales figures it's just not worth it.

Bonus chapter: visibility and marketing

The biggest problem of any project like this is marketing: how to get your project visible. This is either extremely hard or downright impossible. As an example, let's do some Fermi estimations on reachability on Twitter. Suppose you post about your new project on Twitter, how many of your followers would then buy the product? Ten percent is probably too much, and one tenth of a percent is too small, so let's say one percent. Thus to sell 400 copies you'd need to have 40 000 direct Twitter followers. Retweets do not count as their conversion rate is even lower.

Your only real chance is to get visibility on some news media, but that is also difficult. They get inundated by people and projects that want to get their products to be seen. Thus news sites publish news almost exclusively on large established projects as they are the things that interest their readers the most. Note that this should not be seen as a slag against these sites, it's just the nature of the beast. A news site posting mostly about crowdfunding campaigns of small unknown projects would not be a very tempting site and would go out of business quickly.

Advertising might work, but the downsides include a) it costs more money and b) the target audience for a programming manual is the set of people who are the most likely to use ad blockers or otherwise ignore online ads (I have never bought anything based on an ad I have seen online).

Sunday, March 1, 2020

Unity build test with Meson & LibreOffice

In a previous blog post we managed to build a notable chunk of LibreOffice with Meson. This has since been updated so you can build all top level apps (Writer, Calc, Impress, Draw). The results do not actually run, so the conversion may seem pointless. That is not the case, though, because once you have this build setup you can start doing interesting experiments on a large real world C++ code base. One of these is unity builds.

A unity build is a surprisingly simple technique to speed up builds, especially C++. The idea is that instead of compiling source files individually, you build them in larger batches by creating files like these and compiling them instead:

#include<source1.cpp>
#include<source2.cpp>
#include<source3.cpp>
// etc

One of the main downsides of this technique is writing and maintaining the unity files by hand. Fortunately Meson has builtin support for creating unity files for build targets. In the next release you can even specify how many source files you want to have in a single unity source. Enabling unity builds for a single target is simple and consists of adding the following keyword argument to a target's definition:

override_options: ['unity=on']

In this experiment we used LibreOffice's Writer target, which is a single shared library consisting of almost 700 source files. The unity block size was Meson's default of 4.

The results

Compiling source code as a unity build for the first time usually leads to build failures. This is exactly what happened here as well. There are many reasons for this. The most popular are name clashes from static and anonymous namespace functions and of classes with common names that are used without namespace qualifications. LO turned out to have at least three different classes called Size. All these issues need to be fixed either by renaming or by adding namespace qualifiers. This is boring manual work, but on the other hand you may find duplicated static functions across your code base.

Once all the issues have been fixed we can do actual measurements. The test machine was an 8 thread i7-3770 with an SSD drive and the buildmode was set to debug. The full build times are these:

Regular   10m 
Unity      4m 32s

This unity build is over 50% faster than the regular one, which is a fairly typical result. The difference would have been even bigger if we had used a bigger unity block size. Incremental builds were tested by deleting one object file and rebuilding.

Regular 26s
Unity   22s

Typically incremental unity builds are slower than regular ones, but in this case it is actually faster. This is probably because unity builds produce smaller output files that have less debug data. Thus the linker has less work. Increasing the unity block size would make the incremental build time slower.

Converting the code to compile as a unity build took on the order of three hours. Ironically most of it was spent waiting for compilations to finish. Since this saves around 5 minutes per build, the time investment is recovered after the development team has done 60 full builds. The code changes have been done with as little effort as possible, so polishing this to production quality would probably take 2-3 times as long.

Do try this at home!

The code is available in the unitytest branch of this repo, it has only been tested with Meson trunk on Linux. There are two targets that use unity builds: vcl and sw. You can toggle unity builds as well as the unity block size per target with the override_options keyword argument.