Tuesday, November 8, 2016

Building code faster and why recursive Make is so slow

One of the most common reactions to Meson we have gotten has been that build time is dominated by the compiler, thus trying to make the build system faster is pointless. After all, Make just launches subproject processes and waits for them to finish, so there's nothing to be gained.

This is a reasonable assumption but also false. Even if we ignore the most obvious slow parts, such as Configure that can take up to 15 minutes to run on an embedded device to Libtool, which slows everything down for no good reason, there are still gains to be had.

Meson uses Ninja as its main backend. One of the (many) cool things about it is that it is being actively developed and maintained. A recent addition is the Ninja tracer framework. It takes the output of Ninja's log file and converts it to the timing format understood by Chrome's developer tools. It creates output that looks like this (thanks to Nirbheek for gathering the data).


This is a trace of building all of GStreamer using their new Meson based aggregate repo called gst-build. This build has been done using six parallel processes making up the six horizontal lanes in the diagram. Each colored square represents one build task such as compiling or linking with time increasing from left to right. This picture shows us both why Meson is so efficient and also why Make based builds are slow (though the latter requires some analysis).

The main reason for speed is that Ninja keeps all processor cores working all the time. It can do that because it has a global view of the problem. As an example if we build two targets A and B and B links to A, Ninja can start compiling the source code of B before it has finished linking A. This allows it to keep all cores pegged.

The perceptive reader might have noticed the blank space at the left end of the diagram. During that time only one core is doing anything, all others are idle. The process that is running is the GIR tool processing the gstreamer-1.0 library, which all other subprojects depend on. In theory we should be able to run compiles on other projects but for reliability reasons Meson assumes that some source might include the output of the tool as a header so it does not start compilations until the file is generated. This may seem weird, but there are projects that do these kinds of things and require that they work.

The white gap is also what causes Make to be so slow. Most projects write recursive makefiles because maintaining a non-recursive makefile for even a moderate sized project is a nightmare. When building recursively Make goes into each subdirectory in turn, builds it to completion and only then goes to the next one. The last step in almost every case is linking, which can only be done using one process. If we assume that linking takes one second, then for a project that has 20 subdirectories that adds up to 20 wasted seconds of wall time. For an n core machine it corresponds to (n-1)*(num_directories) seconds of CPU time. Those things add up pretty quickly. In addition to Make, the same issue crops up in Visual Studio project files and pretty much any build system that does not have a global view of the dependency tree.

Those interested in going through the original data can do so. Just open this link with either Chromium or Chrome and the chart will be displayed in devtools. Unfortunately this won't work with Firefox.

Bonus challenge

The output format that Chrome's dev tools understands is straightforward. It would be interesting if someone modified Make to output the same format and ran it on a moderate sized project using Make. GStreamer is an obvious candidate as it has both Meson and Autotools setups in master, though gst-build only works with Meson.

Wednesday, November 2, 2016

Exposing a C++ library with a stable plain C API

There has been a lot of talk recently about using Rust to create shared libraries that have a plain C ABI. This allows you to transition libraries piecewise to Rust. This is seen as a contrast to C++ which has an unstable ABI and thus is hard to maintain and to integrate with other programs and libraries (Python, Ruby etc) that only work with plain C calling conventions.

However just like you can export Rust code with a C ABI by writing the required boilerplate, the same can be done in C++. Since C++ has native support for C, this is very simple. As an example a class that looks like this (lot of stuff omitted for clarity):

class Foo {
public:
    Foo();
    ~Foo();

    void poke(int x);
};

becomes this:

struct CFoo;

struct CFoo* foo_create();
void foo_destroy(struct CFoo *f);
void foo_poke(struct CFoo *f, int x, char **error);

The implementation for each of these just casts CFoo* into Foo* and then calls the corresponding method. The last function handles errors roughly in the same way as GLib's GError. In case of an error the error message is put in the error parameter and it is the responsibility of the caller to check it. For simplicity this is a string, but it could be an error object as well.

This is straightforward but the big problem comes in reporting errors. Writing code to convert exceptions to error messages for each and every function is tedious and error-prone. Fortunately there is a simple solution called an hourglass shaped interface using a Lippincott function. With it the implementation of foo_poke looks like this:

void convert_exception(char **error) noexcept {
    char *msg;
    try {
        throw; // The magic happens here
    } catch(const std::exception &e) {
        msg = strdup(e.what());
    } catch(...) {
        msg = strdup("An unknown error happened.");
    }
    *error = msg;
}

void foo_poke(struct CFoo* f, int x, char **error) {
    try {
        reinterpret_cast<Foo*>(f)->poke(x);
    } catch(...) {
        convert_exception(error);
    }
}

Here we have moved all error handling away from the wrapper into a standalone helper function. All interface functions have now been reduced into just calling to the real function and, in case of errors, calling the converter. The reason this works is that C++ stores the exception that is currently in flight and allows it to be rethrown. You can think of it as an invisible argument to the converter function.

This simple construct allows us to expose C++ libraries with a plain C ABI with all the stability guarantees and interoperability goodness that entails. A full sample project can be found here. It also contains a C++ "unwrapper" on top of the plain C API that makes exceptions travel transparently over the interface (that is, using the exact same ABI).

Tuesday, October 11, 2016

Unix Userland as C++ Coroutines

The Unix userland concept is simple. It has standalone programs that do only one thing which are then chained together. A classical example is removing duplicates from a file, which looks like this:

cat filename.txt | sort | uniq

This is a very powerful concept. However there is a noticeable performance overhead. In order to do this, the shell needs to launch three different processes. In addition all communication between these programs happens over Unix pipes, meaning a round trip call into the kernel for each line of text transferred (roughly, not an exact number in all cases). Even though processes and pipes are cheap on Unix they still cause noticeable slowdown (as anyone who has run large shell scripts knows).

Meanwhile there has been a lot of effort to put coroutines into C++. A very simple (and not fully exhaustive) way of describing coroutines is to say that they basically implement the same functionality as Python generators. That is, this Python code:

def gen():
    yield 1
    yield 2
    yield 3

could be written in C++ like this (syntax is approximate and might not even compile):

int gen() {
    co_yield 1;
    co_yield 2;
    co_yield 3;
}

If we look more closely into the Unix userland specification we find that even though it is (always?) implemented with processes and pipes, it is not required to. If we use C's as if rule we are free to implement shell pipelines in any way we want as long as the output is the same. This gives us a straightforward guide to implement the Unix userland in C++ coroutines.

We can model each Unix command line application as a coroutine that reads input data, either stdout or stderr, one line at a time, operates on it and outputs the result in the same way. Then we parse a shell pipeline into its constituent commands, connect the output of element n as the input of element n + 1 and keep reading the output of the last pipeline element until EOF is reached. This allows us to run the entire pipeline without spawning a single process.

A simple implementation with a few basic commands is available on Github. Unfortunately real coroutines were not properly available on Linux at the time of writing, only a random nonmerged branch of Clang. Therefore the project contains a simple class based implementation that mimics coroutines. Interested readers are encouraged to go through the code and work out how much boilerplate code a proper coroutine implementation would eliminate.

Mini-FAQ

This implementation does not handle background processes/binary data/file redirection/invoking external programs/yourthinghere!

That is correct.

Isn't this just a reinvented Busybox?

Yes and no. Busybox provides all executables in a single binary but still uses processes to communicate between different parts of the pipeline. This implementation does not spawn any processes.

Couldn't you run each "process" in its own thread to achieve parallelism?

Yes. However I felt like doing this instead.

Monday, September 12, 2016

The unspeakable horror of Visual Studio PDB files

Disclaimer: everything said in the following blog post may be completely wrong. Since PDB files are (until very recently) completely undocumented everything below has been determined by trial and error. Doubt everything, believe nothing.

Debugging information, how hard can it be?

When compiling C-like languages, debug information is not a problem. It gets written in the object file along with code and when objects are linked to form an executable or shared library, each individual file's debug info is combined and written in the result file. If necessary, this information can then be stripped into a standalone file.

This seems like the simplest thing in the world. Which it is. Unless you are Microsoft.

Visual Studio stores debug information in standalone files with the extension .pdb. Whenever the compiler is invoked it writes an .obj file and then opens the pdb file to write the debug info in it (while removing the old one). At first glance this does not seem a bad design, and indeed it wasn't in the 80s when it was created (presumably).

Sadly multicore machines break this model completely. Individual compilation jobs should be executable in parallel (and in Unix they are) but with pdb files they can't be. Every compile process needs to obtain some sort of lock to be able to update the pdb file. Processes that could be perfectly isolated now spend a lot of time fighting with each other over access to the pdb file.

Fortunately you can tell VS to write each object file's pdb info in a standalone file which is then merged into the final pdb. It even works, unless you want to use precompiled headers.

VS writes a string inside the obj file a string pointing to the corresponding pdb file. However if you use precompiled headers it fails because the pdb strings are different in the source object and precompiled header object file and VS wants to join these two when generating the main .obj file. VS will then note that the strings are different and refuse to create the source object file because merging two files with serialised arrays is a known unsolvable problem in computer science. The merge would work if you compiled the precompiled header separately for each object file and gave them the same temporary pdb file name. In theory at least, this particular rat hole is not one I wish to get into so I did not try.

This does work if you give both compilations the same target pdb file (the one for the final target). This gives you the choice between a slowdown caused by locking or a slowdown caused by lack of precompiled headers. Or, if you prefer, the choice of not having debug information.

But then it gets worse.

If you choose the locking slowdown then you can't take object files from one target and use them in other targets. The usual reasons are either to get just one object file for a unit test without needing a recompilation or emulating -Wl,--whole-archive (available natively only in VS2015 or later) by putting all object files in a different target. Trying gets you a linking error due to an incorrect pdb file name.

There was a blog post recently that Microsoft is having problems in their daily builds because compiling Windows is starting to take over 24 hours. I'm fairly certain this is one of the main reasons why.

But then it gets worse.

Debug information for foo.exe is written to a file called foo.pdb. The debug information for a shared library foo.dll is also written to a file called foo.pdb.  That means you can't have a dll and an exe with the same name in the same directory. Unfortunately this is what you almost always want because Windows does not have rpath so you can't instruct an executable to look up its dependencies elsewhere (Though you can fake it with PATH. Yes, really.)

Fortunately you can specify the name of the output pdb, so you can tell VS to generate foo_exe.pdb and foo_lib.pdb. Unfortunately VS will also generate a dozen other files besides the pdb and whose names come from the target basename and which you can not change. No matter what you do, files will be clobbered. Even if they did not clobber the files would still be useless because VS the IDE assumes the file is called foo.pdb and refuses to work if it is not.

All this because writing debug information inside object files, which is where it belongs, is not supported.

But wait, it gets even worse.

Putting debug info in obj files was supported but is now deprecated.

In conclusion

Meson recently landed support for generating PDB files. Thanks to Nirbheek for getting this ball rolling. If you know that the above is complete bull and that it is possible to do all the things discussed above then please file bugs with information or, preferably, patches to fix them.

Friday, September 9, 2016

How to convert an Autotools project to Meson

Converting a project using Autotools into Meson is usually not particularly difficult, just grunt work. The Meson wiki has a guide and lots of other documentation on the subject. Here is an informal rough outline of the steps commonly needed.

Autoconf

Autoconf scripts can often seem impenetrable. Creating a converter script for them is as difficult a task as reimplementing all of Autoconf (and shell and a bunch of other stuff), which is not really feasible. Fortunately there is a sneaky shortcut to eliminate most of the work. The config.h file autoconf produces is easy to parse and autoconvert.

Meson has a helper script for exactly this use and applying it to a project is straightforward. Just copy config.h.in to config.h.meson. Then replace all lines that look like this:

#undef HAVE_ASPRINTF

into this:

#mesondefine HAVE_ASPRINTF

This can't be done automatically because some config headers have other #undef declarations so they can't be changed blindly. Then run the converter script on the resulting file. It will generate a skeleton project file that does the same checks in Meson. The checks it could not understand are left in the source as comments. This script usually needs some manual tweaking but should deal with most of the configure checks for you (unless you are doing something really low level like glib).

Automake

Converting automake is mostly just manual work of converting list of files for each target from Automake syntax to Meson syntax.

Except.

In some projects Automake setup houses the most baffling and incomprehensible parts of the entire build setup. In one project (which shall remain nameless) we had several project developers looking at the Make/shell pipeline/magic substitution/stuff declaration in the makefile just to understand what it was supposed to do (never mind how it was doing it).

Encountering one of these sucks. Fixing it will take a lot of effort, but the end result a notable reduction in technical debt and would make sense even if the project were to keep using Autotools. The most important thing when converting these to Meson is not to write it inside the Meson declaration. Always put these kind of large pipelines in standalone scripts (converted to Python if aiming to support Windows) and just invoke them from Meson. This makes the magic isolated and testable.

Finalisation

That should be the most of it. The rest is polishing things such as checking for dependencies (with pkg-config) and fixing things that break due to the different environment. As an example many autotools projects are only compiled and run in-source. Meson does not permit in-source builds so if any part of the project assumes an in-source build those need to be fixed. The usual culprit is unit tests.

That should be it. If you encounter any problems feel free to report bugs or chat with us at #mesonbuild on Freenode.

Sunday, August 28, 2016

Weird USB disk slowdown with multiple readers

I have a bunch of files and hard drives scattered about (as I'm sure many of you do, too). I wanted to transfer all of them into one drive, so to start off I wrote a Python program to go through all files on a drive, calculate their SHA-256 and write that to a file. Here is the core of the code that does the evaluation in parallel.

def scan(scanroot, ofilename):
    ofile = open(ofilename, 'w')
    with ProcessPoolExecutor() as pe:
        futures = []
        for root, dirs, files in os.walk(scanroot):
            for f in files:
                fname = os.path.join(root, f)
                futures.append(pe.submit(scan_file, fname))
        for f in futures:
            try:
                ofile.write(f.result())
            except Exception as e:
                print('ERROR:', str(e))

    ofile.close()


Very simple and works fine. But when I did this on a USB 3 disk on Linux (Ubuntu Wily) something weird happened. If you do the evaluation with just one process, the disk transfers data at a rate of 70 MB/s, which is a fraction slower the speed of an internal hard disk. When running 8 simultaneous jobs, the total transfer rate is 4 MB/s which is almost 20 times slower.

I have no idea what could be causing this but it seems to be specific to USB, internal hard drives handle multiple readers effortlessly.

Wednesday, July 20, 2016

Comparing GCC C cleanup attribute with C++ RAII

I recently got into a Twitter discussion about the GCC cleanup extension in C vs plain old C++ (and Rust, which does roughly the same). There were claims that the former is roughly the same as the latter. Let's examine this with a simple example. Let's start with the following C code.

Res* func() {
  Res *r = get_resource(); /* always succeeds for simplicity. */
  if(do_something(r)) {
    return r;
  } else {
    deallocate(r);
    return NULL;
}

This is straightforward: get a resource, do something with it and depending on outcome either return the resource or deallocate it and return NULL. This requires the developer to track the life cycle of r manually. This is prone to human errors such as forgetting the deallocate call, especially when more code is added to the function.

This code is trivial, but still representative of real world code. Usually you would have many things and resources going on in a function like this but for simplicity we omit all those. The question now becomes, how would one make this function truly reliable using GCC cleanup extension. By reliability we mean that the compiler must handle all cases and the developer must not need to write any manual management code.

The obvious approach is this:

Res* func() {
  Res *r __attribute__((cleanup(cleanfunc)));
  r = get_resource();
  if(do_something(r)) {
    return r;
  } else {
    return NULL;
  }
}

This does not work, though. The cleanup function is called always before function exit, so if do_something returns true, the function returns a pointer to a freed resource. Oops. To make it work you would need to do this:

Res* func() {
  Res *r __attribute__((cleanup(cleanfunc)));
  r = get_resource();
  if(do_something(r)) {
    Res *r2 = r;
    r = NULL;
    return r2;
  } else {
    return NULL;
  }
}

This is pointless manual work which is easy to get wrong, thus violating reliability requirements.

The only other option is to put the deallocator inside the else block. That does not work either, because it just replaces a call to deallocate with a manually written setup to call the deallocator upon scope exit (which happens immediately).

This is unavoidable with the cleanup attribute. It is always called. It can not automatically handle the case where the life cycle of a resource is conditional. Even if it were possible to tell GCC not to call the cleanup function, that call must be written by hand. Conditional life cycles always require manual work, thus making it unreliable (as in: a human needs to analyze and write code to fix the issue).

In C++ this would look (roughly) like the following.

std::unique_ptr<Res> func() {
  std::unique_ptr<Res> r(new Resource());
  if(do_something(r)) {
    return r;
  } else {
    return nullptr; // or throw an exception if you prefer
  }
}  

In this case the resource is always handled properly. It is not deallocated in the true branch but is deallocated in the false branch. No manual life cycle code is needed. Adding new code to this function is simple because the developer does not need to care about the life cycle of r, the compiler will enforce proper usage and does it more reliably than any human being.

Conclusions

GCC's cleanup attribute is a great tool for improving reliability to C. It should be used it whenever possible (note that it is not supported on MSVC so code that needs to be portable can't use it). However the cleanup attribute is not as easy to use or reliable as C++'s RAII primitives.