Tuesday, December 27, 2016

Comparing executable size of C++ exceptions vs plain C error structs

One of the Grand Pieces of Hacker Folklore is that you should not use C++ exceptions because they cause massive code bloat. Instead you should use error codes as exemplified by plain C, Go, Rust and other languages. These opinions have caused many a heated debates both on and off the Internet.

One of the problems is that accurately measuring the size difference of code that uses exceptions vs one that uses error codes is hard to do properly. Legend has it that when Symbian developers were considering whether to add exception support, they compiled a helloworld application (that has no error paths at all), noted that with exceptions enabled it was somewhat bigger and declared this unacceptable. What they did not consider is that this increase in code size was constant, not linear to the code size. They then went on to code a completely nonstandard cleanup mechanism on top of C++, which was both terrible and bloaty.

A more methodological approach

There is really only one way to fully reliably test whether error codes or exceptions produce smaller code. That is to take a large program that uses one or the other mechanism and convert it to use the other without changing anything else. This can take months of work and is obviously not feasible in the real world but would make for an interesting research project.

Instead we can simulate the problem. It is relatively easy to generate code with the same functionality in C and in C++ and compare their performance. Let's create a function that simulates an simple work load. First it grabs some resource, then calls some other functions and finally returns a result. In simple C++ it would look like this:

int func222() {
    DummyObject dobj;
    int a = func227();
    int b = func224();
    int c = func228();
    return a + b + c;
}

Here DummyObject is just a simple object that is just there so the exception unwinding mechanism has something to destroy. Its constructor and destructor are defined elsewhere so that the optimizer can not remove it completely. We create 1000 of these functions where each one calls into three random functions with a higher number. Calls to functions bigger than 1000 are done to function 1000 instead. Function 1000 just returns 1. This program does not do anything useful and its runtime is exponential, so you should probably not run it. 

In plain C using glib style error "objects" the function looks like this:

int func222(char **error) {
    int a, b, c;
    struct Dummy *d = dummy_new();

    a = func227(error);
    if(*error) {
        dummy_delete(d);
        return -1;
    }
    b = func224(error);
    if(*error) {
        dummy_delete(d);
        return -1;
    }
    c = func228(error);
    if(*error) {
        dummy_delete(d);
        return -1;
    }
    dummy_delete(d);
    return a + b + c;
}

The functionality is the same, we just need to manually check whether the function calls succeeded and deallocate the resource on every path. The code to generate and run this code is downloadable from this Github repo.

The results

We measure three different compilation methods. Plain C, plain C++ and C++ with -fno-exceptions. The last one represents the use case where any error is immediately fatal and thus recovery is pointless. We tested with a bunch of compilers. C++ version was 14 except on GCC 4.8.4 which does not support it so C++11 was used instead. C version was gnu11. All tests were done on AMD64 architecture except one which was done on 32 bit ARM (Raspberry Pi 2 + Debian). -O2 and -g were used for optimization and debug info. Here are the sizes of the resulting stripped binaries.


The results are surprising in many ways. Perhaps the biggest one is that using C++ and exceptions produces a smaller binaries than plain C every single time. The closest is Clang but even there C binaries take over 6% more space. Noexcept is always the smallest, but this is expected as it has no error recovery mechanism at all.

Another thing to note is that GCC seems to have had a big C++ code size increase between versions 4.9.2 and 6.2.0. Clang seems to produce a lot smaller C binaries than GCC. For C++ the difference is there but the gap is not as big.

What is causing this?

I don't know for sure. It would be interesting to hear from Clang/GCC developers what is happening here and what actually happens under the covers. But basically what it (probably) boils down to is that the compiler has more freedom to optimize the steps needed for stack unwinding. When the user does deallocation manually, the compiler needs to obey the code paths they have written even if they are not optimal). Autogenerated stack unwinding just needs to call a few destructors in any way it wants. Optimizing this is tedious work, the kind best left to a machine.

Conclusions

This is only one aspect of the (seemingly) eternal debate between error codes and exceptions. It does not prove (I say again, does not) that one is faster than the other, or even that one always produces more efficient or smaller code than the other. However what we can say is that there are cases where converting from error codes into full exceptions makes the final binary smaller. It would be interesting to know if this is true for other code bases as well. That requires a lot more testing, though.

The Github repo contains more stuff for those interested in going deeper. For example it contains a version of the C code that is a bit more size optimized (though a bit less readable) that only has error codes rather than full error structs. It is still bigger than the straightforward C++ implementation.

Monday, December 19, 2016

What is the global cost of Autoconf?

Waiting for autoconf to finish is annoying. But have you ever considered how much time and effort is wasted every day doing that? Neither have I, but let's estimate it anyway.

First we need an estimate of how many projects are using Autotools. 100 is too low and 10 000 is probably too high, so let's say 1000. We also need an assumption of how long an average autoconf run takes. Variability here is massive ranging from 30 seconds to (tens of) minutes on big projects on small hardware. Let's say an average of one minute as a round number.

Next we need to know how many times builds are run during a day. Some projects are very active with dozens of developers whereas most have only one person doing occasional development. Let's say each project has on average 2 developers and each one does an average of 4 builds per day.

This gives us an estimate of how much time is spent globally just waiting for autoconf to finish:

1000 proj * 2 dev/proj * 2 build/dev * 1 minute/build =
   4000 minutes = 66 hours

66 hours. Every day. If you are a business person, feel free to do the math on how much that costs.

But it gets worse. Many organisations and companies build a lot of projects each day. As an example there are hundreds of companies that have their own (embedded) Linux distribution that they do a daily build on. Linux distros do rebuilds constantly and so on. If we assume 10 000 organisations that do a daily build and we do a lowball estimate of 5 dependencies per project (many projects are not full distros, but instead build a custom runtime package or something similar), we get different numbers:

10 000 organisations * 1 build/organisation * 5 dependencies/build * 1 min/dependency = 50 000 minutes = 833 CPU hours

That is, every single day over 800 CPU hours are burned just to run autoconf. Estimating CPU consumption is difficult but based on statistics and division we can estimate an average consumption of 20 Watts. This amounts to 16 kWh every day or, assuming 300 working days a year, 5 MWh a year.

That is a lot of energy to waste just to be absolutely sure that your C compiler ships with stdlib.h.

Saturday, December 17, 2016

On reversing btrees in job interviews

Hiring new people is hard. No-one has managed to find a perfect way to separate poorly performing job applicants from the good ones. The method used in most big companies nowadays is the coding interview. In it the applicant is given some task, such as reversing a btree, and they are asked to write the code to do that on a whiteboard without a computer.

At regular intervals this brings up Internet snarkiness.



The point these comments make are basically sound, but they miss the bigger point which is this:

When you are asked to reverse a btree on a whiteboard, you are not being evaluated based on whether you can reverse a btree on a whiteboard.

Before we look into what you are evaluated on, some full disclosure may be in order. I have interviewed for Google a few times but each time they have chosen not to hire me. I have not interviewed people with this method nor have I received any training in doing so. Thus I don't know what aspects of interview performance are actually being evaluated, and presumably different companies have different requirements. These are just estimates which may be completely wrong.

Things tested before writing code

Getting an onsite interview is a challenge in itself. You have to pass multiple phone screens and programming tests just to get there. This test methodology is not foolproof. Every now and then (rarely, but still) a person manages to slip through the tests even if they would be unqualified for the job. Starting simple is an easy way to make sure that the applicant really is up to the challenge.

The second big test is one of attitude. While we all like to think that programming is about having fun solving hard problems with modern functional languages, creating world altering projects with ponies, rainbows and kittens, reality is more mundane. Over 80% of all costs in a software project are incurred during maintenance. This is usually nonglamorous and monotonous work. Even new projects have a ton of “grunt work” that just needs to be done. The attitude one displays when asked to do this sort of menial task can be very revealing. Some people get irritated because such a task is “beneath their intelligence”. These people are usually poor team players who you definitely do not want to hire.

The most important aspect of team work is communication. Yes, even more important than coding. Reversing a btree is a simple problem but the real work is in explaining what is being done and why. It is weird that even though communication is so important, very little effort is spent in teaching it (in many schools it is not taught at all). Being able to explain how to implement this kind of simple task is important, because a large fraction of your work consists of describing your solutions and ideas to other people. It does not matter if you can come up with the greatest ideas in the world if you can not make other “get” them.

Writing the code

Writing the code to reverse a btree should not take more than a few minutes. It is not hard, anyone who has done basic CS courses can get it done in a jiffy. But writing the code is an almost incidental part of the interview, in fact it is just a springboard to the actual problem.

Things tested after the code is written

Once the basic solution is done, you will get asked to expand it. For a btree questions might include:

  • If you used stack based iteration, what if the btree is deeper than there is stack space?
  • What if you need to be able to abort the reversal and restore the original tree given an external signal during reversal?
  • What if the btree is so large it won't fit in RAM? Or on one machine?
  • What if you need to reverse the tree while other threads are reading it?
  • What if you need to reverse the three while other threads are mutating it?
  • And so on.

As we see even the most seemingly simple problem can be made arbitrarily hard. The point of the original question is not to test if you can reverse the tree, but find how far you can go from there and where is the limit of what you don't know. If you come out of an interview where you are asked to reverse a tree and the only thing you do is reverse the tree, you have most likely failed.

When the edge of knowledge has been reached comes the next evaluation point: what do you do when you don't know what to do. This is important because a large fraction of programming is solving problems you have not solved before. It can also be very frustrating because it goes against the natural problem solving instinct most people have. Saying “honestly at this point I would look this up on Google” gives you a reply of “Suppose you do that and find nothing, what do you do?”.
Different people react in different ways in this situation. These include:

  • freezing up completely
  • whining
  • trying random incoherent things
  • trying focused changes
  • trying a solution that won't solve the problem fully but is better than the current solution (and explicitly stating this before starting work)

This test is meant to reveal how the person works under pressure. This is not a hard requirement for many jobs but even then those who can keep a cool head under stressing circumstances have a distinct advantage to those who can't.

Points of criticism

There are really only two reliable ways of assessing a candidate. The first one is word of mouth from someone who has worked with the person before. The second one is taking the candidate in and make them work with the team for a while. Neither of these approaches is scalable.

The coding interview aims to be a simplification of this process, where the person is tested on things close to “real world” problems. But it is not the real world, only a model. And like all models, it is a simplification that leaves some things out and emphasizes other things more than they should.
The above mentioned lack of Internet connectivity is one. It can be very frustrating to be denied the most obvious and simplest approach (which is usually the correct approach) for a problem for no seemingly good reason. There actually is a reason: to test behavior when information is not available, but it still may seem arbitrary and cold, especially since looking up said information usually yields useful data even if the exact solution is not found.

A different problem has to do with thinking. The interviewers encourage the applicant to think out loud as much as possible. The point of this is to see the thought process, what approaches they are going through and what things they are discarding outright. The reasoning behind this is sound but on the other hand it actively makes thinking about the problem harder for some people.

Every time you start explaining your thought process you have to do a mental context switch and then another one to go back to thinking. Brain research has shown us that different people think in very different ways. Talking while pondering a problem is natural for some people. It is very taxing for others. This gives a natural advantage to people who talk while thinking but there is not, as far as I know, any research results that these sort of people perform better at a given job.

Mental context switches are not the most problematic thing. For some people explaining their own thought process can be impossible. I don't know how common this is (or it it's even a thing in general) but at least for me personally, I can't really explain how I solve problems. I think about them but I can't really conceptualize how the solution is reached. Once a solution presents itself, explaining it is simple, but all details about the path that lead there remain in the dark.

People who think in this manner have a bit of an unfortunate handicap, because it is hard to tell the difference between a person thinking without speaking and a person who has just frozen with no idea what to do. Presumably interviewers are taught to notice this and help the interviewee but since all books and guidelines emphasize explaining your thought process, this might not always succeed perfectly.

But does it really matter?

That's the problem with human issues. Without massive research in the issue we don't know.

Friday, December 2, 2016

What does configure actually run

One of the claimed advantages of Autotools is that it depends (paraphrasing here) "only on make + sh". Some people add sed to the list as well. This is a nice talking point, but it is actually true? Lets' find out.

Determining this is simple, we just run the GNU Hello program's configure script under strace like this:

strace -e execve  -f ./configure 2>stderr.txt > stdout.txt

This puts all command invocations of the process and its children to stderr.txt. Then we can massage it slightly with Python and get the following list of commands.

arch
as
basename
bash
cat
cc
cc1
chmod
collect2
conftest
cp
diff
dirname
echo
expr
file
gawk
gcc
getsysinfo
grep
hostinfo
hostname
install
ld
ln
ls
machine
make
mkdir
mktemp
msgfmt
msgmerge
mv
oslevel
print
rm
rmdir
sed
sh
sleep
sort
tr
true
uname
uniq
universe
wc
which
xgettext

This list contains a total of 49 different commands including heavyweights such as diff and gawk.

Thus we find that the answer to the question is no. Configure requires a lot more stuff than just shell plus make. In fact it requires a big chunk of the Unix userland implictly.

Pedantic postscriptum

It could be said that all of these programs are not strictly required and that configure could (potentially) work without them present. This is probably correct, but many of these programs provide functionality which is essential and not provided by either plain shell or Make.

Tuesday, November 8, 2016

Building code faster and why recursive Make is so slow

One of the most common reactions to Meson we have gotten has been that build time is dominated by the compiler, thus trying to make the build system faster is pointless. After all, Make just launches subproject processes and waits for them to finish, so there's nothing to be gained.

This is a reasonable assumption but also false. Even if we ignore the most obvious slow parts, such as Configure that can take up to 15 minutes to run on an embedded device to Libtool, which slows everything down for no good reason, there are still gains to be had.

Meson uses Ninja as its main backend. One of the (many) cool things about it is that it is being actively developed and maintained. A recent addition is the Ninja tracer framework. It takes the output of Ninja's log file and converts it to the timing format understood by Chrome's developer tools. It creates output that looks like this (thanks to Nirbheek for gathering the data).


This is a trace of building all of GStreamer using their new Meson based aggregate repo called gst-build. This build has been done using six parallel processes making up the six horizontal lanes in the diagram. Each colored square represents one build task such as compiling or linking with time increasing from left to right. This picture shows us both why Meson is so efficient and also why Make based builds are slow (though the latter requires some analysis).

The main reason for speed is that Ninja keeps all processor cores working all the time. It can do that because it has a global view of the problem. As an example if we build two targets A and B and B links to A, Ninja can start compiling the source code of B before it has finished linking A. This allows it to keep all cores pegged.

The perceptive reader might have noticed the blank space at the left end of the diagram. During that time only one core is doing anything, all others are idle. The process that is running is the GIR tool processing the gstreamer-1.0 library, which all other subprojects depend on. In theory we should be able to run compiles on other projects but for reliability reasons Meson assumes that some source might include the output of the tool as a header so it does not start compilations until the file is generated. This may seem weird, but there are projects that do these kinds of things and require that they work.

The white gap is also what causes Make to be so slow. Most projects write recursive makefiles because maintaining a non-recursive makefile for even a moderate sized project is a nightmare. When building recursively Make goes into each subdirectory in turn, builds it to completion and only then goes to the next one. The last step in almost every case is linking, which can only be done using one process. If we assume that linking takes one second, then for a project that has 20 subdirectories that adds up to 20 wasted seconds of wall time. For an n core machine it corresponds to (n-1)*(num_directories) seconds of CPU time. Those things add up pretty quickly. In addition to Make, the same issue crops up in Visual Studio project files and pretty much any build system that does not have a global view of the dependency tree.

Those interested in going through the original data can do so. Just open this link with either Chromium or Chrome and the chart will be displayed in devtools. Unfortunately this won't work with Firefox.

Bonus challenge

The output format that Chrome's dev tools understands is straightforward. It would be interesting if someone modified Make to output the same format and ran it on a moderate sized project using Make. GStreamer is an obvious candidate as it has both Meson and Autotools setups in master, though gst-build only works with Meson.

Wednesday, November 2, 2016

Exposing a C++ library with a stable plain C API

There has been a lot of talk recently about using Rust to create shared libraries that have a plain C ABI. This allows you to transition libraries piecewise to Rust. This is seen as a contrast to C++ which has an unstable ABI and thus is hard to maintain and to integrate with other programs and libraries (Python, Ruby etc) that only work with plain C calling conventions.

However just like you can export Rust code with a C ABI by writing the required boilerplate, the same can be done in C++. Since C++ has native support for C, this is very simple. As an example a class that looks like this (lot of stuff omitted for clarity):

class Foo {
public:
    Foo();
    ~Foo();

    void poke(int x);
};

becomes this:

struct CFoo;

struct CFoo* foo_create();
void foo_destroy(struct CFoo *f);
void foo_poke(struct CFoo *f, int x, char **error);

The implementation for each of these just casts CFoo* into Foo* and then calls the corresponding method. The last function handles errors roughly in the same way as GLib's GError. In case of an error the error message is put in the error parameter and it is the responsibility of the caller to check it. For simplicity this is a string, but it could be an error object as well.

This is straightforward but the big problem comes in reporting errors. Writing code to convert exceptions to error messages for each and every function is tedious and error-prone. Fortunately there is a simple solution called an hourglass shaped interface using a Lippincott function. With it the implementation of foo_poke looks like this:

void convert_exception(char **error) noexcept {
    char *msg;
    try {
        throw; // The magic happens here
    } catch(const std::exception &e) {
        msg = strdup(e.what());
    } catch(...) {
        msg = strdup("An unknown error happened.");
    }
    *error = msg;
}

void foo_poke(struct CFoo* f, int x, char **error) {
    try {
        reinterpret_cast<Foo*>(f)->poke(x);
    } catch(...) {
        convert_exception(error);
    }
}

Here we have moved all error handling away from the wrapper into a standalone helper function. All interface functions have now been reduced into just calling to the real function and, in case of errors, calling the converter. The reason this works is that C++ stores the exception that is currently in flight and allows it to be rethrown. You can think of it as an invisible argument to the converter function.

This simple construct allows us to expose C++ libraries with a plain C ABI with all the stability guarantees and interoperability goodness that entails. A full sample project can be found here. It also contains a C++ "unwrapper" on top of the plain C API that makes exceptions travel transparently over the interface (that is, using the exact same ABI).

Tuesday, October 11, 2016

Unix Userland as C++ Coroutines

The Unix userland concept is simple. It has standalone programs that do only one thing which are then chained together. A classical example is removing duplicates from a file, which looks like this:

cat filename.txt | sort | uniq

This is a very powerful concept. However there is a noticeable performance overhead. In order to do this, the shell needs to launch three different processes. In addition all communication between these programs happens over Unix pipes, meaning a round trip call into the kernel for each line of text transferred (roughly, not an exact number in all cases). Even though processes and pipes are cheap on Unix they still cause noticeable slowdown (as anyone who has run large shell scripts knows).

Meanwhile there has been a lot of effort to put coroutines into C++. A very simple (and not fully exhaustive) way of describing coroutines is to say that they basically implement the same functionality as Python generators. That is, this Python code:

def gen():
    yield 1
    yield 2
    yield 3

could be written in C++ like this (syntax is approximate and might not even compile):

int gen() {
    co_yield 1;
    co_yield 2;
    co_yield 3;
}

If we look more closely into the Unix userland specification we find that even though it is (always?) implemented with processes and pipes, it is not required to. If we use C's as if rule we are free to implement shell pipelines in any way we want as long as the output is the same. This gives us a straightforward guide to implement the Unix userland in C++ coroutines.

We can model each Unix command line application as a coroutine that reads input data, either stdout or stderr, one line at a time, operates on it and outputs the result in the same way. Then we parse a shell pipeline into its constituent commands, connect the output of element n as the input of element n + 1 and keep reading the output of the last pipeline element until EOF is reached. This allows us to run the entire pipeline without spawning a single process.

A simple implementation with a few basic commands is available on Github. Unfortunately real coroutines were not properly available on Linux at the time of writing, only a random nonmerged branch of Clang. Therefore the project contains a simple class based implementation that mimics coroutines. Interested readers are encouraged to go through the code and work out how much boilerplate code a proper coroutine implementation would eliminate.

Mini-FAQ

This implementation does not handle background processes/binary data/file redirection/invoking external programs/yourthinghere!

That is correct.

Isn't this just a reinvented Busybox?

Yes and no. Busybox provides all executables in a single binary but still uses processes to communicate between different parts of the pipeline. This implementation does not spawn any processes.

Couldn't you run each "process" in its own thread to achieve parallelism?

Yes. However I felt like doing this instead.

Monday, September 12, 2016

The unspeakable horror of Visual Studio PDB files

Disclaimer: everything said in the following blog post may be completely wrong. Since PDB files are (until very recently) completely undocumented everything below has been determined by trial and error. Doubt everything, believe nothing.

Debugging information, how hard can it be?

When compiling C-like languages, debug information is not a problem. It gets written in the object file along with code and when objects are linked to form an executable or shared library, each individual file's debug info is combined and written in the result file. If necessary, this information can then be stripped into a standalone file.

This seems like the simplest thing in the world. Which it is. Unless you are Microsoft.

Visual Studio stores debug information in standalone files with the extension .pdb. Whenever the compiler is invoked it writes an .obj file and then opens the pdb file to write the debug info in it (while removing the old one). At first glance this does not seem a bad design, and indeed it wasn't in the 80s when it was created (presumably).

Sadly multicore machines break this model completely. Individual compilation jobs should be executable in parallel (and in Unix they are) but with pdb files they can't be. Every compile process needs to obtain some sort of lock to be able to update the pdb file. Processes that could be perfectly isolated now spend a lot of time fighting with each other over access to the pdb file.

Fortunately you can tell VS to write each object file's pdb info in a standalone file which is then merged into the final pdb. It even works, unless you want to use precompiled headers.

VS writes a string inside the obj file a string pointing to the corresponding pdb file. However if you use precompiled headers it fails because the pdb strings are different in the source object and precompiled header object file and VS wants to join these two when generating the main .obj file. VS will then note that the strings are different and refuse to create the source object file because merging two files with serialised arrays is a known unsolvable problem in computer science. The merge would work if you compiled the precompiled header separately for each object file and gave them the same temporary pdb file name. In theory at least, this particular rat hole is not one I wish to get into so I did not try.

This does work if you give both compilations the same target pdb file (the one for the final target). This gives you the choice between a slowdown caused by locking or a slowdown caused by lack of precompiled headers. Or, if you prefer, the choice of not having debug information.

But then it gets worse.

If you choose the locking slowdown then you can't take object files from one target and use them in other targets. The usual reasons are either to get just one object file for a unit test without needing a recompilation or emulating -Wl,--whole-archive (available natively only in VS2015 or later) by putting all object files in a different target. Trying gets you a linking error due to an incorrect pdb file name.

There was a blog post recently that Microsoft is having problems in their daily builds because compiling Windows is starting to take over 24 hours. I'm fairly certain this is one of the main reasons why.

But then it gets worse.

Debug information for foo.exe is written to a file called foo.pdb. The debug information for a shared library foo.dll is also written to a file called foo.pdb.  That means you can't have a dll and an exe with the same name in the same directory. Unfortunately this is what you almost always want because Windows does not have rpath so you can't instruct an executable to look up its dependencies elsewhere (Though you can fake it with PATH. Yes, really.)

Fortunately you can specify the name of the output pdb, so you can tell VS to generate foo_exe.pdb and foo_lib.pdb. Unfortunately VS will also generate a dozen other files besides the pdb and whose names come from the target basename and which you can not change. No matter what you do, files will be clobbered. Even if they did not clobber the files would still be useless because VS the IDE assumes the file is called foo.pdb and refuses to work if it is not.

All this because writing debug information inside object files, which is where it belongs, is not supported.

But wait, it gets even worse.

Putting debug info in obj files was supported but is now deprecated.

In conclusion

Meson recently landed support for generating PDB files. Thanks to Nirbheek for getting this ball rolling. If you know that the above is complete bull and that it is possible to do all the things discussed above then please file bugs with information or, preferably, patches to fix them.

Friday, September 9, 2016

How to convert an Autotools project to Meson

Converting a project using Autotools into Meson is usually not particularly difficult, just grunt work. The Meson wiki has a guide and lots of other documentation on the subject. Here is an informal rough outline of the steps commonly needed.

Autoconf

Autoconf scripts can often seem impenetrable. Creating a converter script for them is as difficult a task as reimplementing all of Autoconf (and shell and a bunch of other stuff), which is not really feasible. Fortunately there is a sneaky shortcut to eliminate most of the work. The config.h file autoconf produces is easy to parse and autoconvert.

Meson has a helper script for exactly this use and applying it to a project is straightforward. Just copy config.h.in to config.h.meson. Then replace all lines that look like this:

#undef HAVE_ASPRINTF

into this:

#mesondefine HAVE_ASPRINTF

This can't be done automatically because some config headers have other #undef declarations so they can't be changed blindly. Then run the converter script on the resulting file. It will generate a skeleton project file that does the same checks in Meson. The checks it could not understand are left in the source as comments. This script usually needs some manual tweaking but should deal with most of the configure checks for you (unless you are doing something really low level like glib).

Automake

Converting automake is mostly just manual work of converting list of files for each target from Automake syntax to Meson syntax.

Except.

In some projects Automake setup houses the most baffling and incomprehensible parts of the entire build setup. In one project (which shall remain nameless) we had several project developers looking at the Make/shell pipeline/magic substitution/stuff declaration in the makefile just to understand what it was supposed to do (never mind how it was doing it).

Encountering one of these sucks. Fixing it will take a lot of effort, but the end result a notable reduction in technical debt and would make sense even if the project were to keep using Autotools. The most important thing when converting these to Meson is not to write it inside the Meson declaration. Always put these kind of large pipelines in standalone scripts (converted to Python if aiming to support Windows) and just invoke them from Meson. This makes the magic isolated and testable.

Finalisation

That should be the most of it. The rest is polishing things such as checking for dependencies (with pkg-config) and fixing things that break due to the different environment. As an example many autotools projects are only compiled and run in-source. Meson does not permit in-source builds so if any part of the project assumes an in-source build those need to be fixed. The usual culprit is unit tests.

That should be it. If you encounter any problems feel free to report bugs or chat with us at #mesonbuild on Freenode.

Sunday, August 28, 2016

Weird USB disk slowdown with multiple readers

I have a bunch of files and hard drives scattered about (as I'm sure many of you do, too). I wanted to transfer all of them into one drive, so to start off I wrote a Python program to go through all files on a drive, calculate their SHA-256 and write that to a file. Here is the core of the code that does the evaluation in parallel.

def scan(scanroot, ofilename):
    ofile = open(ofilename, 'w')
    with ProcessPoolExecutor() as pe:
        futures = []
        for root, dirs, files in os.walk(scanroot):
            for f in files:
                fname = os.path.join(root, f)
                futures.append(pe.submit(scan_file, fname))
        for f in futures:
            try:
                ofile.write(f.result())
            except Exception as e:
                print('ERROR:', str(e))

    ofile.close()


Very simple and works fine. But when I did this on a USB 3 disk on Linux (Ubuntu Wily) something weird happened. If you do the evaluation with just one process, the disk transfers data at a rate of 70 MB/s, which is a fraction slower the speed of an internal hard disk. When running 8 simultaneous jobs, the total transfer rate is 4 MB/s which is almost 20 times slower.

I have no idea what could be causing this but it seems to be specific to USB, internal hard drives handle multiple readers effortlessly.

Wednesday, July 20, 2016

Comparing GCC C cleanup attribute with C++ RAII

I recently got into a Twitter discussion about the GCC cleanup extension in C vs plain old C++ (and Rust, which does roughly the same). There were claims that the former is roughly the same as the latter. Let's examine this with a simple example. Let's start with the following C code.

Res* func() {
  Res *r = get_resource(); /* always succeeds for simplicity. */
  if(do_something(r)) {
    return r;
  } else {
    deallocate(r);
    return NULL;
}

This is straightforward: get a resource, do something with it and depending on outcome either return the resource or deallocate it and return NULL. This requires the developer to track the life cycle of r manually. This is prone to human errors such as forgetting the deallocate call, especially when more code is added to the function.

This code is trivial, but still representative of real world code. Usually you would have many things and resources going on in a function like this but for simplicity we omit all those. The question now becomes, how would one make this function truly reliable using GCC cleanup extension. By reliability we mean that the compiler must handle all cases and the developer must not need to write any manual management code.

The obvious approach is this:

Res* func() {
  Res *r __attribute__((cleanup(cleanfunc)));
  r = get_resource();
  if(do_something(r)) {
    return r;
  } else {
    return NULL;
  }
}

This does not work, though. The cleanup function is called always before function exit, so if do_something returns true, the function returns a pointer to a freed resource. Oops. To make it work you would need to do this:

Res* func() {
  Res *r __attribute__((cleanup(cleanfunc)));
  r = get_resource();
  if(do_something(r)) {
    Res *r2 = r;
    r = NULL;
    return r2;
  } else {
    return NULL;
  }
}

This is pointless manual work which is easy to get wrong, thus violating reliability requirements.

The only other option is to put the deallocator inside the else block. That does not work either, because it just replaces a call to deallocate with a manually written setup to call the deallocator upon scope exit (which happens immediately).

This is unavoidable with the cleanup attribute. It is always called. It can not automatically handle the case where the life cycle of a resource is conditional. Even if it were possible to tell GCC not to call the cleanup function, that call must be written by hand. Conditional life cycles always require manual work, thus making it unreliable (as in: a human needs to analyze and write code to fix the issue).

In C++ this would look (roughly) like the following.

std::unique_ptr<Res> func() {
  std::unique_ptr<Res> r(new Resource());
  if(do_something(r)) {
    return r;
  } else {
    return nullptr; // or throw an exception if you prefer
  }
}  

In this case the resource is always handled properly. It is not deallocated in the true branch but is deallocated in the false branch. No manual life cycle code is needed. Adding new code to this function is simple because the developer does not need to care about the life cycle of r, the compiler will enforce proper usage and does it more reliably than any human being.

Conclusions

GCC's cleanup attribute is a great tool for improving reliability to C. It should be used it whenever possible (note that it is not supported on MSVC so code that needs to be portable can't use it). However the cleanup attribute is not as easy to use or reliable as C++'s RAII primitives.

Monday, June 27, 2016

Running COBOL on Arduino using the Meson Build System

IDENTIFICATION DIVISION.
PROGRAM ID. COBOL-ARDUINO.

DATA DIVISION.

In my previous post we found out how to compile projects for the Arduino, which is a bare metal 8 bit microcontroller. One of the advantages of using a full fledged build system is that you can use all the tools in the world as opposed to the basic ones that come with the Arduino IDE. Like, for example, the GNU COBOL compiler.



PROCEDURE DIVISION.

The implementation is actually quite simple. GNU COBOL does not compile to machine code. Instead it transpiles to plain C and does most of the COBOLy things with a runtime library. The design is pretty clever and there is very little setup and configuration. Best of all, it is easy (well, relatively easy at least) to write code that does not cause the COBOL runtime to call into malloc.

The rest of the implementation work is straightforward. We compile the COBOL code into a C file and try to compile it. First it fails due to a missing libcob.h. We just create one that is empty. After that it fails with missing struct and function definitions. Struct definitions can be stolen from the real libcob.h. The functions are a bit more work, but writing skeleton implementations that do just the bare minimum to get the program running did not take particularly long.

Then it is just a question of hooking the COBOL compilation to the build system and we are done. The code is available in this Github repo.

EXIT PROGRAM.

Friday, June 24, 2016

Meson build system now has Arduino support

After some midsummer hacking Meson now has support for building and uploading Arduino projects. (caveat: this MR needs to land to master first). This is what it looks like.



A sample project to clone has been set up in Github. It currently only supports Linux and Arduino Uno but the repo has documentation on how to adapt it to your configuration.

Monday, May 16, 2016

Performance testing a new parallel zip file implementation

Previously we looked into rewriting a zip file unpacker rather than modernising an existing one as well as measuring its performance. Now we have implemented fully parallel zip file compression so it is time to measure its performance against existing applications. The new implementation is called Parzip and can be obtained from its Github repo.

Test setup

Parzip uses only LZMA compression (though it can uncompress deflated archives). We tested it against the following programs:

  • Info-Zip, the default zip/unzip implementation on most Unix platforms
  • 7-Zip, the first archiver to use LZMA
  • Tar + gzip, the default way of compressing files on Unix
  • Tar + xz, is the same as above, but using LZMA compression instead of deflate
The test corpus was a checkout of the Clang compiler, including full source code, SVN directories and a build tree. This amounts to roughly 27 gigabytes of data. The test machine was an i7 machine with 8 cores and a spinning disk running Debian stable. Parzip was self compiled and is used liblzma from Debian repository, all other binaries were their default versions as provided by Debian. All compression settings were left at their default values.

Results

Let's start with the most important measurement: how long did each app take to compress the directory tree.
Info-Zip and gzip both use the deflate algorithm which is a lot faster than LZMA. Here we can see how they are the fastest even though they only use one thread for compression. Parzip is the fastest LZMA compressor by quite a wide margin. Xz is the slowest because it uses only one thread. 7-zip has some threading support which makes it faster than xz, but it loses heavily to Parzip.

Decompression times show a similar trend.
Decompression is a lot faster than compression (this graph is in seconds as opposed to minutes). The interesting point here is that Parzip is much faster than Info-Zip and gzip even though it uses LZMA. Being able to use all cores really helps here.

Finally let's look at sizes of the compressed archives.
Nothing particularly surprising here. LZMA files are roughly the same size with xz being the smallest, 7-zip following close by and Parzip holding the third place. Programs using deflate achieve noticeably worse compression ratio. Info-Zip is the clear loser here, maybe due to it using its own deflate implementation rather than zlib.

Big iron test

We also tested the performance of xz and Parzip on a 48 core heavy duty server.
XZ takes almost three hours to compress the data set while Parzip can achieve the same in just over 20 minutes. This is almost 90% faster. The theoretical best possible speedup would be roughly 97%, but due to overhead and other issues we can't reach those numbers.
Decompression times are similar with Parzip being 93% faster. This is due to decompression parallelising better than compression. Compression needs to write its results in a single file whereas decompression can write its outputs to the file system independent of each other.

Conclusions

Using modern tools and technologies is is possible to create a parallel file compressor that achieves almost the same compression ratio as the current leading technology (tar + xz) while providing almost linear parallelisation on the number of processor cores available.

Caveat

If you plan on doing your own measurements (which you should, it's educational), be careful when analysing the results. If you have a slow hard disk and a lot of decompression threads, they may saturate the disk's write capacity.

Wednesday, May 4, 2016

Jzip parallel unzipper performance results

In my previous blog post I wrote about reimplementing Info-Zip in modern C++ so that we can unpack files in parallel. This sparked a lively Reddit discussion. One of the main points raised was that once this implementation gets support for all other stuff Info-Zip does, then it will be just as large.

Testing this is simple: just implement all missing features. Apart from encryption support (which you should not use, but gpg instead), the work is now finished and available in jzip github repo. Here are the results.

Lines of code (as counted by wc)


Info-Zip: 82 057 lines of C
jzip: 1091 lines of C++

Stripped binary size


Info-Zip: 159 kB
jzip: 51 kB

Performance

Performance was tested by zipping a full Clang source + build tree into one zip file. This includes both the source, svn directories and all build artifacts. The total size was 9.3 gigabytes. Extraction times were as follows.

Info-Zip: 5m 38s
jzip: 2m 47s

Jzip is roughly twice as fast. This is a bit underwhelming result given that the test machine has 8 cores. Further examination showed that the reason for this was that jzip saturates the hard drive write capacity.

Conclusions

Using a few evenings worth of spare time it is possible to reimplement an established (but relatively straightforward) product with two orders of magnitude less code and massively better performance.

Update: more measurements

Usinag a 48 core machine with fast disks.

Info-zip: 3m 32s
jzip: 12s

On this machine jzip is 95% faster.

On the other hand when running on a machine with a slow disk, jzip may be up to 30% slower because of write contention overhead.

Sunday, April 24, 2016

Rewriting from scratch, should you do it?

One thing that has bothered me for a while is that the unzip command line tool only decompresses one file at a time. As a weekend project I wanted to see if I could make it work in parallel. One could write this functionality from scratch but this gave me a possibility to try really look into incremental development.

All developers love writing stuff from scratch rather than fixing and existing solutions (yes, guilty as charged). However the accepted wisdom is that you should never do a from scratch rewrite but instead improve what you have via incremental improvements. Thus I downloaded the sources of Info-Zip and got to work.

Info-Zip's code base is quite peculiar. It predates things such as ANSI C and has support for tons of crazy long dead hardware. MS-DOS ranks among the most recently added platforms. There is a lot of code for 16 bit processors, near and far pointers and all that fun stuff your grandad used to complain about. There are even completely bizarre things such as this snippet:

#ifndef const
#  define const const
#endif

The code base contains roughly 80 000 lines of K&R C. This should prove an interesting challenge. Those wanting to play along can get the code from Github.

Compiling the code turned out to be quite simple. There is no configure script or the like, everything is #ifdeffed inside the source files. You just compile them into an app and then you have a working exe. The downside is that the source has more preprocessor code than actual code (only a slight exaggaration).

Originally the code used a single (huge) global struct that houses everything. At some point the developers needed to make the code reentrant. Usually this means changing every function to take the global state struct as a function argument instead. These people chose not to do this. Instead they created a C preprocessor macro system that can be used to pass the struct as an argument but also compile the code so it has the old style global struct. I have no idea why they did that. The only explanation that makes any sort of sense is that adding the pointer to stack on every function call is too expensive on 16 bit and smaller platforms. This is just speculation, though, but if anyone knows for sure please let me know.

This meant that every single function definition was a weird concoction of preprocessor macros and K&R syntax. For details see this commit that eventually killed it.

Getting rid of all the cruft was not particularly difficult, only tedious. The original developers were very pedantic about flagging their #if/#endif pairs so killing dead code was straightforward. The downside was that what remained after that was awful. The code had more asterisks than letters. A typical function was hundreds of lines long. Some preprocessor symbols were defined in opposite ways in different header files but things worked because some other preprocessor clauses kept all but one from being evaluated (the code confused Eclipse's syntax highlighter so it's really hard to see what was really happening).

Ten or so hours of solid work later most dead cruft was deleted and the code base had shrunk to 30 000 lines of code. At this point looking into adding threading was starting to become feasible. After going through the code that iterates the zip index and extracts files it became a lot less feasible. As an example the inflate function was not isolated from the rest of the code. All its arguments were given in The One Big Struct and it fiddled with it constantly. Those would need to be fully separated to make anything work.

That nagging sound in your ear

While fixing the code I kept hearing the call of the rewrite siren. Just rewrite from scratch, it would say. It's a lot less work. Go on! Just try it! You know you want to!

Eventually the lure got too strong so I opened the Wikipedia page on Zip file format. Three hours and 373 lines of C++ later I had a parallel unzipper written from scratch. Granted it does not do advanced stuff like encryption, ZIP64 or creating subdirectories for files that it writes. But it works! Code is available in this repo.

Even better, adding multithreading took one commit with 22 additions and 7 deletions. The build definition is 10 lines of Meson instead of 1000+ lines of incomprehensible Make.

There really is no reason, business or otherwise, to modernise the codebase of Info-Zip. With contemporary tools, libraries and methodologies you can create code that is an order of magnitude simpler, clearer, more maintainable and just all around more pleasant to work with than existing code. In a fraction of the time.

Sometimes rewriting from scratch is the correct thing to do.

This is the exact opposite of what I set out to prove but that's research for you.

Update

The program in question has now been expanded to do full Zip packing + unpacking. See here for benchmarks.

Tuesday, April 19, 2016

A look into Linux software deployment

One of the great virtues of Linux distributions is that they create an easy way to install massive amounts of software. They provide an easy way to install a massive amount of software. They also provide automatic updates so the end user does not have to care. This mechanism is considered by many to be the best way to provide software for end users.

But is it really?

Discussions on this topic usually stir strong emotions, but let's try to examine this issue through practical use cases.

Suppose a developer releases a new game. He wants to have this game available for all Linux users immediately and, conversely, all Linux users want to play his game as soon as possible. Further, let's set a few more reasonable requirements for this use case:

  • The game package should be runnable immediately after the developer makes a release.
  • The end user should be able to install and play the game without requiring root privileges.
  • The end user should not be required to compile any source code.
  • The developer should be able to create only one binary package that will run on all distros.

These fairly reasonable requirements are easy to fulfill on all common platforms including Windows, Android, iOS and OSX. However on Linux this is not possible.

The delay between an application's release and having it available in distros is usually huge. The smallest time between releases is on Ubuntu at once every six months. That means that, on average, it takes three months from release into being usable. And this is assuming that the app is already in Debian/Ubuntu, gets packaged immediately and so on.

Some of you might be scoffing at this point. You might think that you just have to time the releases to right before the distro freezes and that this is not a problem. But it is. Even if you time your releases to Ubuntu (as opposed to, say, Fedora), the final freeze of Ubuntu causes complications. During freeze new packages are not allowed in the archive any more. This takes a month or more. More importantly, forcing app developers to time their releases on distro cadences is not workable, because app developers have their own schedules, deadlines and priorities.

There are many other reasons why these sorts of delays are a bad. Rather than go through them one by one, let's do a thought experiment instead. Suppose Steve Jobs were still alive and was holding a press conference on the next release of iOS. He finishes it up with this:

- "The next version of iOS will be released January. All applications that want to be in the app store must be submitted in full before the end of December. If you miss the deadline the next chance to add an app to the store will be in two years. One more thing: once the release ships you can't push updates to your apps. Enjoy the rest of the conference."

Even Distortion Field Steve himself could not pull that one off. He would be laughed off the stage. Yet this is the way the Linux community expects application deployment to be done.

A closer look at the numbers

Let's do a different kind of a thought experiment. Suppose I have a magic wand. I wave it once, which causes every application that is in the Android and iOS app stores to become free software. I wave it a second time and every one of those apps grows a Linux port with full Debian packaging and is then submitted to Debian. How long will it take until they are all processed and available in the archive?

Go ahead and do a rough evaluation. I'll wait.

The answer is obvious: it will never happen! There are two major reasons. First of all, the Debian project simply does not have enough resources to review tens of thousands of new packages. The second reason is that even if they did, no-one wants to review 20 different fart apps for DFSG violations.

Now you might say that this is a good thing and that nobody really needs 20 fart apps. This might be true but this limitation affects all other apps as well. If apps should be provided by distros then the people doing application review become a filter between application developers and end users. The only apps that have a chance of being accepted are those some reviewer has a personal interest in. All others get dropped to the sidelines. To an application developer this can be massively demoralizing. Why on earth should he fulfill the whims of a random third party just to be able to provide software to his end users?

There is also a wonderful analogy to censorship here, but I'll let you work that one out yourselves.

Javascript frameworks: same problem, different angle

Let's leave applications for a while and look at something completely different: web frameworks. There are several hundred of them and what most seem to have in common is that they don't come in distro packages. Instead you install them like this:

curl http://myawesomejsframework.com/installer.sh | sudo sh

This is, of course, horrible. Any person with even rudimentary knowledge of operating systems, security or related fields will recoil in horror when seeing unverified scripts being run with root privileges. And then they are told that this is actually going out of style and you should instead pull the app as a ready to run Docker container. That you just run. On your production servers. Without any checksumming or verification.

This is how you drive people into depression and liver damage.

After the initial shock passes the obvious followup question is why. Why on earth would people do this instead of using proven good distro packages? Usually this is accompanied by several words that start with the letter f.

If you stop and think about the issue objectively, the reason to do this is obvious.

Distro packages are not solving the problems people have.

This is so simple and unambiguous that people have a hard time grasping it. So here it is again, but this time written in bold:

Distro packages are not solving the problems people have!

Typical web app deployment happens on Debian stable, which gets released once every few years. This is really nice and cool, but the problem is that the web moves a lot faster. Debian's package versions are frozen upon release, however deploying a web app on a year old framework (let alone two or three) is not feasible. It is just too old.

The follow up to this is that you should instead "build your own deb packages for the framework". This does not work either. Usually newer versions of frameworks require newer versions of their dependencies. Those might require newer versions of their dependencies and so on. This is a lot of work, not to mention that the more system packages you replace the bigger risk you run of breaking some part of the base distro.

If you examine this problem with a wider perspective you find that these non-distro package ways of installing software exist for a reason. All the shell scripts that run as root, internal package managers, xdg-app, statically linked binaries, Docker images, Snappy packages, OSX app bundles, SteamOS games, Windows executables, Android and iOS apps and so on exist for only one reason:

They provide dependencies that are not available on the base system.

If the app does not provide the dependencies itself, it will not work.

But what does it really mean?

A wise man once said that the all wisdom begins by acknowledging the facts. Let's go through what we have learned thus far.

Fact #1: There are two systems at play: the operating system and the apps on top of it. These have massively different change rates. Distros change every few years. Apps and web sites change daily. Those changes must be deployable to users immediately.

Fact #2: The requirements and dependencies of apps may conflict with the dependencies of the platform. Enforcing a single version of a dependency across the entire system is not possible.

If you take these facts and follow them to their logical conclusion, we find that putting distro core packages and third party applications in the same namespace, which is what mandating the use of distro packages for everything does, can not work.

Instead what we need to accept is that any mechanism of app delivery must consist of two isolated parts. The first is the platform, which can be created with distro packages just like now. The second part is the app delivery on top. App and framework developers need to be able to provide self contained packages directly to end users. Requiring approval of any sort from an unrelated team of volunteers is a failure. This is basically what other operating systems and app stores have done for decades. This approach has its own share of problems, the biggest of which is apps embedding unsafe versions of libraries such as OpenSSL. This is a solvable problem, but since this blog posting is already getting to be a bit too long I refer you to the link at the end of this post for the technical details.

Switching away from the current Linux packaging model is a big change. Some of it has already started happening in the form of initiatives such as xdg-app, Snappy and Appimage. Like any change this is going to be painful for some, especially distributions whose role is going to diminish. The psychological change has already taken hold from web frameworks (some outright forbid distros from packaging them) to Steam games. Direct deployment seems to be where things are headed. When the world around you changes you can do one of two things.

Adapt or perish.

Closing words

If you are interested in the mechanics of building app packages and providing security and dependencies for them watch this presentation on the subject I held at LCA 2016.

Sunday, April 10, 2016

Testing performance of build optimisations

This blog post examines the performance effect of various compiler options. The actual test consists of compressing a 270 megabyte unpacked tar file with zlib. For training data of profile guided optimisation we used a different 5 megabyte tar file.

All measurements were done five times and the fastest time of each run was chosen. Both the zlib library and the code using it is compiled from scratch. For this we used the Wrap dependency system of the Meson build system. This should make the test portable to all platforms (we tested only Linux + GCC). The test code is available on Github.

We used two different machines for testing. The first one is a desktop machine with Intel i7 running latest Ubuntu and GCC 5.2.1. The other machine is a Raspberry Pi 2 with Raspbian and GCC 4.9.2. All tests were run with basic optimization (-O2) and release optimization (-O3).

Here are the results for the desktop machine. Note the vertical scale does not go down to zero to make the differences more visible. The first measurement uses a shared library for zlib, all others use static linking.


Overall the differences are quite small, the difference between the fastest and slowest time is roughly 4%. Other things to note:

  • static libraries are noticeably faster than shared ones
  • -O2 and -O3 do not have a clear performance order, sometimes one is faster, sometimes the other
  • pgo seems to have a bigger performance impact than lto
The times for Raspberry Pi look similar.



The performance difference between the fastest and slowest options is 5%, which is roughly the same as on the desktop. On this platform pgo also has a bigger impact than lto. However here lto has a noticeable performance impact (though it is only 1%) as opposed to the desktop, where it was basically unmeasurable.

Sunday, April 3, 2016

Simple new features of the Meson build system

One of the main design goals of Meson has been to make the 95% use case as simple as possible. This means that rather than providing a framework that people can use to solve their problems themselves, we'd rather just provide the solution.

For a simple example let's look at using the address sanitizer. On other build systems you usually need to copy a macro that someone else has written into your build definition or write a toggle that adds the compiler flags manually. This is not a lot of work but everyone needs to do it and these things add up. Different projects might write different options for this feature so when switching between projects you always need to learn and remember what the option was called on each.

At its core this really is a one bit issue. Either you want to use address sanitizer or you don't. With this in mind, Meson provides an interface that is just as simple. You can either enable address sanitizer during build tree configuration:

meson -Db_sanitizer=address <source_dir> <build_dir> 

Or you can configure it afterwards with the configuration tool:

mesonconf -Db_sanitizer=address <build_dir>

There are a bunch of other options that you can also enable with these commands (for the remainder of the document we only use the latter).

Link-time optimization:

mesonconf -Db_lto=true <build_dir>

Warnings as errors:

mesonconf -Dwerror=true <build_dir>

Coverage results:

mesonconf -Db_coverage=true <build_dir>

Profile guided optimization:

mesonconf -Dd_pgo=generate (or =use) <build_dir>

Compiler warning levels (1 is -Wall, 3 is almost everything):

mesonconf -Dwarning_level=2 <build_dir>

The basic goal of this option system is clear: you, the user, should only describe what you want to happen. It is the headache of the build system to make it happen.