Wednesday, March 11, 2020

The character that swallowed a week

In the last few posts we have looked at compiling LibreOffice from scratch using Meson. Contrary to what one might expect it was not particularly difficult, just laborious. The codegen bits (yes, there are several) required some deciphering, but other than that it was fairly straightforward. Unfortunately just compiling source code is not sufficient, as usually one also wants to run the result. Further you'd want the new binaries to behave in the same way as the old ones. This is where things get interesting.

Trying to run the main LibreOffice application is not particularly useful because it will almost certainly not work. Fortunately LO provides a bunch of sample and test applications one can use. One of these is a demo app that starts, initialises the LO runtime and opens up a GUI window. Perfect. After a build (which takes an hour) and install the binary is ready to run and produces a … segfault. More annoyingly it produces a segfault without any useful information that would aid in debugging.

Time to take out strace. Several attempts later we discover that LO tries to be dynamic and smart. LO consists of >150 shared libraries. Rather than link all of them together, what seems to be happening is that the code tries to determine where the main executable is, then looks up shared libraries relative to that binary, dlopens them and then goes on its merry way. If this fails it crashes somewhere for some reason I chose not to dig into. This debugging brought up an interesting discovery about naming. One of the libraries is called SAL, so naturally you would expect the actual file to be called This is also what the Makefile defining the library calls it. But that is not its actual name. Instead it is is Somewhere within the bowels of the 150 000 lines of Make that make (ha) up the system, some (but not all) library names get an lo appended to them. There does not seem to be any logic to which ones, but fine, they can at least be fixed with manual work.

After that the program failed trying to open some data files. This was easily fixed by swiping all binary files from an existing working installation. Then it failed with a weird assert failure that seemed to indicate that the runtime object system had not been properly initialised. LO is built around a custom object model called UNO or Universal Network Objects. As you could tell by the name it comes from the 90s and has the following properties:
  • It provides an object model very close to the one in Java
  • Absolutely everything is done with inheritance, and all functionality is at least three namespaces deep
  • Objects are defined with a custom language that is compiled to C++ headers using a self built tool (I'm told this generates roughly a million lines of C++ headers in total)
  • All class types and objects are constructed at runtime
  • The runtime is written in C++, but the actual implementation is all void pointers and reinterpret casts

Break all the tools!

At this point the easy avenues had been explored and it was time to bring out gdb. While editing build definition files is something you can do with a simple editor, debugging really requires an IDE. The command line interface, even the curses one, is cumbersome compared to a full fledged GUI with mouse hovers for variables and so on. Fortunately this is a simple task: create a new Eclipse project, build (taking an hour again), install, run and debug. It worked really nicely. For about a day.

Eventually Eclipse's code indexer crashed because it was doing a full reindexing operation. When you restart Eclipse after such a crash it will restart the indexing operation and promptly crash again. The eventual solution was to manually delete all code that is not absolutely necessary for the test app, such as Writer, Calc and the rest. This brought the working set small enough that it did not crash any more. And no, increasing the amount of memory allocated to Eclipse's JVM does not fix the issue.  It crashes even if given 8 gigs of memory. With this done the sequence of steps leading to the crash could be debugged:

  1. The runtime tries to initialise itself, fails and throws its own custom exception type.
  2. The exception is caught, ignored and discarded, followed by calling a custom error handling code.
  3. The code retrieves the exception via other methods, and then tries to introspect the object for further information.
  4. Introspecting the error object for the reason why runtime initialisation failed requires the runtime to be initialised and working. As it is not, things crash.
Interestingly if you change the code at step 2 to just grab the exception and print its error message, it works correctly. Hooray for frameworks, I guess.

After some fixes the issue was gone but a new one took its place. Somehow the code got into a loop where function A called B, which called C and so on for about 20 functions until something called back to function A. This lead to an eternal loop and stack exhaustion.

At this point we have two supposedly identical programs: one built with gbuild and one built with Meson. They should behave identically but don't. This gives us an approach to solve the problem: single step both programs in a debugger until their execution differs. When that happens we know where the bug is. In practice this is less simple. Because the code uses a ton of OO helper classes and templates, even a single function call can lead into many stack frames that you have to step through manually. These helpers go away when optimisations are enabled, but in this particular case they can't be used as they make debugging quite difficult. Even -Og changes the code too much to be useful. So no optimization it is and every time you change the optimization level to test it takes, you guessed it, an hour to build (or more if you try -O2).

Manually single stepping through code is the sort of tedious manual labor that humans are bad at. Fortunately gdb has a Python scripting interface. Thus one could write a script to run on each gdb that connects to a common server that orders them to single step and report their current PC location and then halt when they differ. This worked perfectly until the programs called into libc. For some reason the same calls into libc (specifically getenv) took a different amount of steps to finish so the programs desynched almost immediately. Fixing that seemed to take too much work so that idea was scrapped.

Manually single stepping through code is difficult because if you accidentally go too far, you have to start over again. Fortunately we now have rr, which allows you to step backwards in code execution as well. Recording a trace of one of the programs worked. The other program worked as well. Running both of them at the same time failed miserably for reasons that remained unclear.

Debuglander ][: The Sloggening

Nevertheless at this point I had two debugging aides: what actually should happen as an rr trace and what actually was happening in a live debugger. Now it was just a matter of finding out where their execution differs. After about two days of debugger stepping I gave up, doing this sort of work by hand is just not something the human brain does very well (or at least mine doesn't). It was back to the old straw-grasping-at board.

Like most big frameworks, UNO had its own code that does special compiler magic. The Makefile lists several flags that should not be used to compile said code. I tried adding those in the Meson version. It did not help. There is also some assembly code that fiddles with registers. Maybe that was the problem? Nope again. Maybe one of the libraries is missing some preprocessor define that causes bad compilations? No. Everything seemed to be in order and doing the exact same thing as the original build system did.

LO does provide unit tests for some functionality. I had not built them with Meson yet, so I figured I'd fix some of those just to get something done. Eventually I converted one test that just exercises an Any object and it crashed in the exact same way. Now I had a reproducer for the bug with 10x to 100x less code. Once more unto the debuggers dear friends, once more!

Eventually the desync point was found. One execution went to a header file line 16 and the other went to the same header, but to line 61. It was getting fairly late so simply noticing the difference between 16 and 61 took a while. Now we had the smoking gun, but there was one more twist to this story: the header file did not have a line 61, it had only about 30 lines of text.

Nevertheless the debugger reported line 61. It even allowed one to inspect variables and all the things you would not expect to be able to do when your process execution is on a nonexisting line. What was happening? Was the debug info in the ELF files corrupted in some way? And then it finally hit me.

LibreOffice generates two different sets of headers for the same IDL files: regular and comprehensive (there is also a third bootstrap one, but let's ignore that). These headers have the same names and the same supposed behaviour but different implementations. You can #include either in your code and it will compile and sort-of-work. I don't know why the original developers had decided to tempt the ODR violation gods in this way and nobody on the LO IRC channel seemed to know either, but finally the core issue had been found. When reverse engineering the Makefiles I had found the code generation phase that creates the regular headers rather than the comprehensive ones.

The fix consisted of changing one of the code generator's command line arguments from -l to -C, recompiling (which took an hour again) and running the binary. Finally, one week's worth of debugging work was rewarded with this:

A blank window has never looked sweeter.

Final commercial announcement

If you enjoyed this post and would like to know more about the Meson build system, consider buying my ebook.

1 comment:

  1. Internal libraries/components get the "" treatment, external ones do not :-)

    Note that we have not one, not two, but 3 code generators! that compile some kind of IDL file.
    (it used to be worse, we used to have our own pre-processor, and our own branch of make)