Wednesday, January 4, 2017

Clarifications to xz compression article

My previous article on beating the compression performance of xz went a bit viral. Comments on various discussion forums seemed to bring up the same, slightly erroneous, points so here are a few clarifications.

Random access to files is a useless feature

For most use cases this is probably true. But that was not the main point of the experiment. The point was to enable parallel compression without making the final size much bigger. Even more important was to enable parallel decompression. Very few file formats support that without losing compression performance. As a classical example PKZIP supports parallel decompression but its archives are a lot bigger than packed tar files.

Tarballs are just fine and need no changes

Many people were of the opinion that tarballs are "good enough for source distribution". This is probably correct. Unfortunately when people hear the word "tar" they immediately think of source tarballs. That is only a small fraction of tar usage in the world. Many file formats have tar files (or equivalent) hidden inside them. Thus all operations on said files are serial and can use only one CPU at a time.

As an example, the deb file format consists of a bunch of metadata and an embedded tar file. RPM is the same, except that it has a cpio archive rather than tar. This means that installing updates is an inherently serial operation. Update packages can only be installed one at a time, and because they are in a tarlike format, only one CPU can be used for processing them. This is particularly slow on platforms such as the Raspberry Pi that have underpowered multicore CPUs.

When running workloads such as a CI builder, there are multiple steps that are serial when using tar-like file formats. As an example pbuilder has the following steps:
  • unpack base image <- serial
  • install updates <- serial
  • install dependencies <- serial
  • unpack source <- serial
  • compile <- parallel
  • compress final result <- serial
If you are building Libreoffice, Firefox or Chromium, the compilation step dominates. For smaller packages the bottleneck may be elsewhere. As an extreme example, the Debian package build of Meson installs circa 1.5 GB of dependency packages but the actual build consists of a Python distutils invocation that takes all of one second (running the test suite does take a while, though).

This is not suitable for production due to X

It was never meant to. It was merely an experiment to see if a better file format were possible.

Pack/unpack operations are bound by disk, not by CPU

This is true on some platforms but not all of them. LZMA compression is almost always CPU bound even on the fastest CPUs and so is decompression if you have a fast SSD. Disks are getting faster all the time, CPUs are not, so CPUs will become even more of a bottleneck in the future.

What we have is good enough, the benefits are not worth an overhaul and even if they were inertia would prevent uptake

Maybe. But experiments are fun to do regardless.

Monday, January 2, 2017

Beating the compression performance of xz, or has the time come to dump tar?

Tar is one of the basic tools of Unix. It has been in use, basically unchanged, since 1979. It does one thing and it does it quite well. Basically it is a stream format for describing files. It consists of consecutive file metadata and content pairs. This is a simple and reliable format but it has a big downside: individual entries can not be accessed without processing the entire file from the beginning. Wikipedia has further details.

Compression makes this even worse. Accessing byte n requires unpacking every byte from the beginning of the file. This is unfortunate in itself but it has an even bigger downside. Compression and decompression are inherently serial operations. They can not be run in parallel. This was not really an issue in the seventies, but nowadays even a Raspberry Pi has four cores. Using only one seems wasteful.

There are some attempts to work around this, such as pigz, but they just chop the full file into constant sized blocks and compress them in parallel. Even in this case parallel decompression is not possible.

Why is tar used then?

In addition to inertia, tar has one major feature on its side: it compresses really well. Modern compressors like lzma (and even zlib) like having a lot of data to achieve high compression ratios. Tar clumps everything into one file, which is good for compressors. Let's examine how much. For testing we took Linux kernel 4.9 source tree. We first recompressed it with xz using the same encoder settings as for the other systems. The unpacked source is 762MB and the compressed size is 92 megabytes, which is our baseline.

The obvious comparison to tar + xz is a zip file using LZMA compression. There is an implementation called Parzip that supports parallel LZMA compression out of the box. When run it produces a zip file that is 164 megabytes in size. It is easy to see why tar + xz is so popular. The main point of compression is to save space and having a file that is almost twice the size is unacceptable.

One inefficiency of pkzip is that the metadata format is both messy and not compressed. Creating a new file format that stores the index as a single compressed lump at the end of the file is straightforward. The code for this (and the rest of the examples listed here) can be found in the jpak Github repo. Doing this creates a file that is 158 megabytes in size. This is better, but the compression ratio is still unusably bad.

If we wish to preserve perfect O(1) random file access to the compressed data this is probably the best we can do. But if loosen the requirements a bit we can get further. An archive basically consists of a sequence of files of different size. If we start from the top and pick files until their total uncompressed size reaches some predetermined limit, concatenate them into a single clump and compress that we can give the compressor large swatches of data at a time and in addition can compress the different clumps in parallel. To access a random file we need to decompress only the clump it is in, not the entire file from the beginning. This makes the operation take O(clumpsize) time.

If we do this using a clump size of 1 MB (uncompressed) the result is 102 MB file. This is a big improvement but still lags behind tar + xz. Increasing the clump size to 10MB produces a 93 MB file. Upping the size to 100 MB (which means that the file will consist of 8 clumps) produces a 91 MB file. This is smaller than what xz produces.


Wait, what? How can it be smaller than tar+xz?

The key here lies in the tar file format. It interleaves file data and metadata. Compressors do not really like this, because these two items usually have different statistics. Putting data that is similar next to each other improves compression ratios. Jpak stores the file metadata in a structure-of-arrays formation. That is, each metadata is a structure with properties A, B, C and so on. Jpak first writes the property A of all entries followed by all properties B and so on. This layout compresses better than tar's intermixed layout. Or that's the working theory at the moment, there has not been a thorough analysis.

It should be noted that the actual compression method is the same in every case: LZMA as provided by liblzma. The only difference is in how the data is laid out in the file. Changing the layout allows you to do the compression and decompression in parallel, get a random access index to your data and achieve as good or even better compression ratios with the downside being slightly slower access to individual files.

The small print

The current implementation does not actually do the operations in parallel (though Parzip does). The code has not been tested very thoroughly, so it might fail in many interesting ways. Do not use it to store any data you care about. The current implementation stores user and group metadata only as numbers rather than in strings, so it is not exactly equivalent to tar (though user/group strings are usually the same for most files so they compress really well). Other metadata bits may be missing as well.

Tuesday, December 27, 2016

Comparing executable size of C++ exceptions vs plain C error structs

One of the Grand Pieces of Hacker Folklore is that you should not use C++ exceptions because they cause massive code bloat. Instead you should use error codes as exemplified by plain C, Go, Rust and other languages. These opinions have caused many a heated debates both on and off the Internet.

One of the problems is that accurately measuring the size difference of code that uses exceptions vs one that uses error codes is hard to do properly. Legend has it that when Symbian developers were considering whether to add exception support, they compiled a helloworld application (that has no error paths at all), noted that with exceptions enabled it was somewhat bigger and declared this unacceptable. What they did not consider is that this increase in code size was constant, not linear to the code size. They then went on to code a completely nonstandard cleanup mechanism on top of C++, which was both terrible and bloaty.

A more methodological approach

There is really only one way to fully reliably test whether error codes or exceptions produce smaller code. That is to take a large program that uses one or the other mechanism and convert it to use the other without changing anything else. This can take months of work and is obviously not feasible in the real world but would make for an interesting research project.

Instead we can simulate the problem. It is relatively easy to generate code with the same functionality in C and in C++ and compare their performance. Let's create a function that simulates an simple work load. First it grabs some resource, then calls some other functions and finally returns a result. In simple C++ it would look like this:

int func222() {
    DummyObject dobj;
    int a = func227();
    int b = func224();
    int c = func228();
    return a + b + c;
}

Here DummyObject is just a simple object that is just there so the exception unwinding mechanism has something to destroy. Its constructor and destructor are defined elsewhere so that the optimizer can not remove it completely. We create 1000 of these functions where each one calls into three random functions with a higher number. Calls to functions bigger than 1000 are done to function 1000 instead. Function 1000 just returns 1. This program does not do anything useful and its runtime is exponential, so you should probably not run it. 

In plain C using glib style error "objects" the function looks like this:

int func222(char **error) {
    int a, b, c;
    struct Dummy *d = dummy_new();

    a = func227(error);
    if(*error) {
        dummy_delete(d);
        return -1;
    }
    b = func224(error);
    if(*error) {
        dummy_delete(d);
        return -1;
    }
    c = func228(error);
    if(*error) {
        dummy_delete(d);
        return -1;
    }
    dummy_delete(d);
    return a + b + c;
}

The functionality is the same, we just need to manually check whether the function calls succeeded and deallocate the resource on every path. The code to generate and run this code is downloadable from this Github repo.

The results

We measure three different compilation methods. Plain C, plain C++ and C++ with -fno-exceptions. The last one represents the use case where any error is immediately fatal and thus recovery is pointless. We tested with a bunch of compilers. C++ version was 14 except on GCC 4.8.4 which does not support it so C++11 was used instead. C version was gnu11. All tests were done on AMD64 architecture except one which was done on 32 bit ARM (Raspberry Pi 2 + Debian). -O2 and -g were used for optimization and debug info. Here are the sizes of the resulting stripped binaries.


The results are surprising in many ways. Perhaps the biggest one is that using C++ and exceptions produces a smaller binaries than plain C every single time. The closest is Clang but even there C binaries take over 6% more space. Noexcept is always the smallest, but this is expected as it has no error recovery mechanism at all.

Another thing to note is that GCC seems to have had a big C++ code size increase between versions 4.9.2 and 6.2.0. Clang seems to produce a lot smaller C binaries than GCC. For C++ the difference is there but the gap is not as big.

What is causing this?

I don't know for sure. It would be interesting to hear from Clang/GCC developers what is happening here and what actually happens under the covers. But basically what it (probably) boils down to is that the compiler has more freedom to optimize the steps needed for stack unwinding. When the user does deallocation manually, the compiler needs to obey the code paths they have written even if they are not optimal). Autogenerated stack unwinding just needs to call a few destructors in any way it wants. Optimizing this is tedious work, the kind best left to a machine.

Conclusions

This is only one aspect of the (seemingly) eternal debate between error codes and exceptions. It does not prove (I say again, does not) that one is faster than the other, or even that one always produces more efficient or smaller code than the other. However what we can say is that there are cases where converting from error codes into full exceptions makes the final binary smaller. It would be interesting to know if this is true for other code bases as well. That requires a lot more testing, though.

The Github repo contains more stuff for those interested in going deeper. For example it contains a version of the C code that is a bit more size optimized (though a bit less readable) that only has error codes rather than full error structs. It is still bigger than the straightforward C++ implementation.

Monday, December 19, 2016

What is the global cost of Autoconf?

Waiting for autoconf to finish is annoying. But have you ever considered how much time and effort is wasted every day doing that? Neither have I, but let's estimate it anyway.

First we need an estimate of how many projects are using Autotools. 100 is too low and 10 000 is probably too high, so let's say 1000. We also need an assumption of how long an average autoconf run takes. Variability here is massive ranging from 30 seconds to (tens of) minutes on big projects on small hardware. Let's say an average of one minute as a round number.

Next we need to know how many times builds are run during a day. Some projects are very active with dozens of developers whereas most have only one person doing occasional development. Let's say each project has on average 2 developers and each one does an average of 4 builds per day.

This gives us an estimate of how much time is spent globally just waiting for autoconf to finish:

1000 proj * 2 dev/proj * 2 build/dev * 1 minute/build =
   4000 minutes = 66 hours

66 hours. Every day. If you are a business person, feel free to do the math on how much that costs.

But it gets worse. Many organisations and companies build a lot of projects each day. As an example there are hundreds of companies that have their own (embedded) Linux distribution that they do a daily build on. Linux distros do rebuilds constantly and so on. If we assume 10 000 organisations that do a daily build and we do a lowball estimate of 5 dependencies per project (many projects are not full distros, but instead build a custom runtime package or something similar), we get different numbers:

10 000 organisations * 1 build/organisation * 5 dependencies/build * 1 min/dependency = 50 000 minutes = 833 CPU hours

That is, every single day over 800 CPU hours are burned just to run autoconf. Estimating CPU consumption is difficult but based on statistics and division we can estimate an average consumption of 20 Watts. This amounts to 16 kWh every day or, assuming 300 working days a year, 5 MWh a year.

That is a lot of energy to waste just to be absolutely sure that your C compiler ships with stdlib.h.

Saturday, December 17, 2016

On reversing btrees in job interviews

Hiring new people is hard. No-one has managed to find a perfect way to separate poorly performing job applicants from the good ones. The method used in most big companies nowadays is the coding interview. In it the applicant is given some task, such as reversing a btree, and they are asked to write the code to do that on a whiteboard without a computer.

At regular intervals this brings up Internet snarkiness.



The point these comments make are basically sound, but they miss the bigger point which is this:

When you are asked to reverse a btree on a whiteboard, you are not being evaluated based on whether you can reverse a btree on a whiteboard.

Before we look into what you are evaluated on, some full disclosure may be in order. I have interviewed for Google a few times but each time they have chosen not to hire me. I have not interviewed people with this method nor have I received any training in doing so. Thus I don't know what aspects of interview performance are actually being evaluated, and presumably different companies have different requirements. These are just estimates which may be completely wrong.

Things tested before writing code

Getting an onsite interview is a challenge in itself. You have to pass multiple phone screens and programming tests just to get there. This test methodology is not foolproof. Every now and then (rarely, but still) a person manages to slip through the tests even if they would be unqualified for the job. Starting simple is an easy way to make sure that the applicant really is up to the challenge.

The second big test is one of attitude. While we all like to think that programming is about having fun solving hard problems with modern functional languages, creating world altering projects with ponies, rainbows and kittens, reality is more mundane. Over 80% of all costs in a software project are incurred during maintenance. This is usually nonglamorous and monotonous work. Even new projects have a ton of “grunt work” that just needs to be done. The attitude one displays when asked to do this sort of menial task can be very revealing. Some people get irritated because such a task is “beneath their intelligence”. These people are usually poor team players who you definitely do not want to hire.

The most important aspect of team work is communication. Yes, even more important than coding. Reversing a btree is a simple problem but the real work is in explaining what is being done and why. It is weird that even though communication is so important, very little effort is spent in teaching it (in many schools it is not taught at all). Being able to explain how to implement this kind of simple task is important, because a large fraction of your work consists of describing your solutions and ideas to other people. It does not matter if you can come up with the greatest ideas in the world if you can not make other “get” them.

Writing the code

Writing the code to reverse a btree should not take more than a few minutes. It is not hard, anyone who has done basic CS courses can get it done in a jiffy. But writing the code is an almost incidental part of the interview, in fact it is just a springboard to the actual problem.

Things tested after the code is written

Once the basic solution is done, you will get asked to expand it. For a btree questions might include:

  • If you used stack based iteration, what if the btree is deeper than there is stack space?
  • What if you need to be able to abort the reversal and restore the original tree given an external signal during reversal?
  • What if the btree is so large it won't fit in RAM? Or on one machine?
  • What if you need to reverse the tree while other threads are reading it?
  • What if you need to reverse the three while other threads are mutating it?
  • And so on.

As we see even the most seemingly simple problem can be made arbitrarily hard. The point of the original question is not to test if you can reverse the tree, but find how far you can go from there and where is the limit of what you don't know. If you come out of an interview where you are asked to reverse a tree and the only thing you do is reverse the tree, you have most likely failed.

When the edge of knowledge has been reached comes the next evaluation point: what do you do when you don't know what to do. This is important because a large fraction of programming is solving problems you have not solved before. It can also be very frustrating because it goes against the natural problem solving instinct most people have. Saying “honestly at this point I would look this up on Google” gives you a reply of “Suppose you do that and find nothing, what do you do?”.
Different people react in different ways in this situation. These include:

  • freezing up completely
  • whining
  • trying random incoherent things
  • trying focused changes
  • trying a solution that won't solve the problem fully but is better than the current solution (and explicitly stating this before starting work)

This test is meant to reveal how the person works under pressure. This is not a hard requirement for many jobs but even then those who can keep a cool head under stressing circumstances have a distinct advantage to those who can't.

Points of criticism

There are really only two reliable ways of assessing a candidate. The first one is word of mouth from someone who has worked with the person before. The second one is taking the candidate in and make them work with the team for a while. Neither of these approaches is scalable.

The coding interview aims to be a simplification of this process, where the person is tested on things close to “real world” problems. But it is not the real world, only a model. And like all models, it is a simplification that leaves some things out and emphasizes other things more than they should.
The above mentioned lack of Internet connectivity is one. It can be very frustrating to be denied the most obvious and simplest approach (which is usually the correct approach) for a problem for no seemingly good reason. There actually is a reason: to test behavior when information is not available, but it still may seem arbitrary and cold, especially since looking up said information usually yields useful data even if the exact solution is not found.

A different problem has to do with thinking. The interviewers encourage the applicant to think out loud as much as possible. The point of this is to see the thought process, what approaches they are going through and what things they are discarding outright. The reasoning behind this is sound but on the other hand it actively makes thinking about the problem harder for some people.

Every time you start explaining your thought process you have to do a mental context switch and then another one to go back to thinking. Brain research has shown us that different people think in very different ways. Talking while pondering a problem is natural for some people. It is very taxing for others. This gives a natural advantage to people who talk while thinking but there is not, as far as I know, any research results that these sort of people perform better at a given job.

Mental context switches are not the most problematic thing. For some people explaining their own thought process can be impossible. I don't know how common this is (or it it's even a thing in general) but at least for me personally, I can't really explain how I solve problems. I think about them but I can't really conceptualize how the solution is reached. Once a solution presents itself, explaining it is simple, but all details about the path that lead there remain in the dark.

People who think in this manner have a bit of an unfortunate handicap, because it is hard to tell the difference between a person thinking without speaking and a person who has just frozen with no idea what to do. Presumably interviewers are taught to notice this and help the interviewee but since all books and guidelines emphasize explaining your thought process, this might not always succeed perfectly.

But does it really matter?

That's the problem with human issues. Without massive research in the issue we don't know.

Friday, December 2, 2016

What does configure actually run

One of the claimed advantages of Autotools is that it depends (paraphrasing here) "only on make + sh". Some people add sed to the list as well. This is a nice talking point, but it is actually true? Lets' find out.

Determining this is simple, we just run the GNU Hello program's configure script under strace like this:

strace -e execve  -f ./configure 2>stderr.txt > stdout.txt

This puts all command invocations of the process and its children to stderr.txt. Then we can massage it slightly with Python and get the following list of commands.

arch
as
basename
bash
cat
cc
cc1
chmod
collect2
conftest
cp
diff
dirname
echo
expr
file
gawk
gcc
getsysinfo
grep
hostinfo
hostname
install
ld
ln
ls
machine
make
mkdir
mktemp
msgfmt
msgmerge
mv
oslevel
print
rm
rmdir
sed
sh
sleep
sort
tr
true
uname
uniq
universe
wc
which
xgettext

This list contains a total of 49 different commands including heavyweights such as diff and gawk.

Thus we find that the answer to the question is no. Configure requires a lot more stuff than just shell plus make. In fact it requires a big chunk of the Unix userland implictly.

Pedantic postscriptum

It could be said that all of these programs are not strictly required and that configure could (potentially) work without them present. This is probably correct, but many of these programs provide functionality which is essential and not provided by either plain shell or Make.

Tuesday, November 8, 2016

Building code faster and why recursive Make is so slow

One of the most common reactions to Meson we have gotten has been that build time is dominated by the compiler, thus trying to make the build system faster is pointless. After all, Make just launches subproject processes and waits for them to finish, so there's nothing to be gained.

This is a reasonable assumption but also false. Even if we ignore the most obvious slow parts, such as Configure that can take up to 15 minutes to run on an embedded device to Libtool, which slows everything down for no good reason, there are still gains to be had.

Meson uses Ninja as its main backend. One of the (many) cool things about it is that it is being actively developed and maintained. A recent addition is the Ninja tracer framework. It takes the output of Ninja's log file and converts it to the timing format understood by Chrome's developer tools. It creates output that looks like this (thanks to Nirbheek for gathering the data).


This is a trace of building all of GStreamer using their new Meson based aggregate repo called gst-build. This build has been done using six parallel processes making up the six horizontal lanes in the diagram. Each colored square represents one build task such as compiling or linking with time increasing from left to right. This picture shows us both why Meson is so efficient and also why Make based builds are slow (though the latter requires some analysis).

The main reason for speed is that Ninja keeps all processor cores working all the time. It can do that because it has a global view of the problem. As an example if we build two targets A and B and B links to A, Ninja can start compiling the source code of B before it has finished linking A. This allows it to keep all cores pegged.

The perceptive reader might have noticed the blank space at the left end of the diagram. During that time only one core is doing anything, all others are idle. The process that is running is the GIR tool processing the gstreamer-1.0 library, which all other subprojects depend on. In theory we should be able to run compiles on other projects but for reliability reasons Meson assumes that some source might include the output of the tool as a header so it does not start compilations until the file is generated. This may seem weird, but there are projects that do these kinds of things and require that they work.

The white gap is also what causes Make to be so slow. Most projects write recursive makefiles because maintaining a non-recursive makefile for even a moderate sized project is a nightmare. When building recursively Make goes into each subdirectory in turn, builds it to completion and only then goes to the next one. The last step in almost every case is linking, which can only be done using one process. If we assume that linking takes one second, then for a project that has 20 subdirectories that adds up to 20 wasted seconds of wall time. For an n core machine it corresponds to (n-1)*(num_directories) seconds of CPU time. Those things add up pretty quickly. In addition to Make, the same issue crops up in Visual Studio project files and pretty much any build system that does not have a global view of the dependency tree.

Those interested in going through the original data can do so. Just open this link with either Chromium or Chrome and the chart will be displayed in devtools. Unfortunately this won't work with Firefox.

Bonus challenge

The output format that Chrome's dev tools understands is straightforward. It would be interesting if someone modified Make to output the same format and ran it on a moderate sized project using Make. GStreamer is an obvious candidate as it has both Meson and Autotools setups in master, though gst-build only works with Meson.