Sunday, August 28, 2016

Weird USB disk slowdown with multiple readers

I have a bunch of files and hard drives scattered about (as I'm sure many of you do, too). I wanted to transfer all of them into one drive, so to start off I wrote a Python program to go through all files on a drive, calculate their SHA-256 and write that to a file. Here is the core of the code that does the evaluation in parallel.

def scan(scanroot, ofilename):
    ofile = open(ofilename, 'w')
    with ProcessPoolExecutor() as pe:
        futures = []
        for root, dirs, files in os.walk(scanroot):
            for f in files:
                fname = os.path.join(root, f)
                futures.append(pe.submit(scan_file, fname))
        for f in futures:
            try:
                ofile.write(f.result())
            except Exception as e:
                print('ERROR:', str(e))

    ofile.close()


Very simple and works fine. But when I did this on a USB 3 disk on Linux (Ubuntu Wily) something weird happened. If you do the evaluation with just one process, the disk transfers data at a rate of 70 MB/s, which is a fraction slower the speed of an internal hard disk. When running 8 simultaneous jobs, the total transfer rate is 4 MB/s which is almost 20 times slower.

I have no idea what could be causing this but it seems to be specific to USB, internal hard drives handle multiple readers effortlessly.

Wednesday, July 20, 2016

Comparing GCC C cleanup attribute with C++ RAII

I recently got into a Twitter discussion about the GCC cleanup extension in C vs plain old C++ (and Rust, which does roughly the same). There were claims that the former is roughly the same as the latter. Let's examine this with a simple example. Let's start with the following C code.

Res* func() {
  Res *r = get_resource(); /* always succeeds for simplicity. */
  if(do_something(r)) {
    return r;
  } else {
    deallocate(r);
    return NULL;
}

This is straightforward: get a resource, do something with it and depending on outcome either return the resource or deallocate it and return NULL. This requires the developer to track the life cycle of r manually. This is prone to human errors such as forgetting the deallocate call, especially when more code is added to the function.

This code is trivial, but still representative of real world code. Usually you would have many things and resources going on in a function like this but for simplicity we omit all those. The question now becomes, how would one make this function truly reliable using GCC cleanup extension. By reliability we mean that the compiler must handle all cases and the developer must not need to write any manual management code.

The obvious approach is this:

Res* func() {
  Res *r __attribute__((cleanup(cleanfunc)));
  r = get_resource();
  if(do_something(r)) {
    return r;
  } else {
    return NULL;
  }
}

This does not work, though. The cleanup function is called always before function exit, so if do_something returns true, the function returns a pointer to a freed resource. Oops. To make it work you would need to do this:

Res* func() {
  Res *r __attribute__((cleanup(cleanfunc)));
  r = get_resource();
  if(do_something(r)) {
    Res *r2 = r;
    r = NULL;
    return r2;
  } else {
    return NULL;
  }
}

This is pointless manual work which is easy to get wrong, thus violating reliability requirements.

The only other option is to put the deallocator inside the else block. That does not work either, because it just replaces a call to deallocate with a manually written setup to call the deallocator upon scope exit (which happens immediately).

This is unavoidable with the cleanup attribute. It is always called. It can not automatically handle the case where the life cycle of a resource is conditional. Even if it were possible to tell GCC not to call the cleanup function, that call must be written by hand. Conditional life cycles always require manual work, thus making it unreliable (as in: a human needs to analyze and write code to fix the issue).

In C++ this would look (roughly) like the following.

std::unique_ptr<Res> func() {
  std::unique_ptr<Res> r(new Resource());
  if(do_something(r)) {
    return r;
  } else {
    return nullptr; // or throw an exception if you prefer
  }
}  

In this case the resource is always handled properly. It is not deallocated in the true branch but is deallocated in the false branch. No manual life cycle code is needed. Adding new code to this function is simple because the developer does not need to care about the life cycle of r, the compiler will enforce proper usage and does it more reliably than any human being.

Conclusions

GCC's cleanup attribute is a great tool for improving reliability to C. It should be used it whenever possible (note that it is not supported on MSVC so code that needs to be portable can't use it). However the cleanup attribute is not as easy to use or reliable as C++'s RAII primitives.

Monday, June 27, 2016

Running COBOL on Arduino using the Meson Build System

IDENTIFICATION DIVISION.
PROGRAM ID. COBOL-ARDUINO.

DATA DIVISION.

In my previous post we found out how to compile projects for the Arduino, which is a bare metal 8 bit microcontroller. One of the advantages of using a full fledged build system is that you can use all the tools in the world as opposed to the basic ones that come with the Arduino IDE. Like, for example, the GNU COBOL compiler.



PROCEDURE DIVISION.

The implementation is actually quite simple. GNU COBOL does not compile to machine code. Instead it transpiles to plain C and does most of the COBOLy things with a runtime library. The design is pretty clever and there is very little setup and configuration. Best of all, it is easy (well, relatively easy at least) to write code that does not cause the COBOL runtime to call into malloc.

The rest of the implementation work is straightforward. We compile the COBOL code into a C file and try to compile it. First it fails due to a missing libcob.h. We just create one that is empty. After that it fails with missing struct and function definitions. Struct definitions can be stolen from the real libcob.h. The functions are a bit more work, but writing skeleton implementations that do just the bare minimum to get the program running did not take particularly long.

Then it is just a question of hooking the COBOL compilation to the build system and we are done. The code is available in this Github repo.

EXIT PROGRAM.

Friday, June 24, 2016

Meson build system now has Arduino support

After some midsummer hacking Meson now has support for building and uploading Arduino projects. (caveat: this MR needs to land to master first). This is what it looks like.



A sample project to clone has been set up in Github. It currently only supports Linux and Arduino Uno but the repo has documentation on how to adapt it to your configuration.

Monday, May 16, 2016

Performance testing a new parallel zip file implementation

Previously we looked into rewriting a zip file unpacker rather than modernising an existing one as well as measuring its performance. Now we have implemented fully parallel zip file compression so it is time to measure its performance against existing applications. The new implementation is called Parzip and can be obtained from its Github repo.

Test setup

Parzip uses only LZMA compression (though it can uncompress deflated archives). We tested it against the following programs:

  • Info-Zip, the default zip/unzip implementation on most Unix platforms
  • 7-Zip, the first archiver to use LZMA
  • Tar + gzip, the default way of compressing files on Unix
  • Tar + xz, is the same as above, but using LZMA compression instead of deflate
The test corpus was a checkout of the Clang compiler, including full source code, SVN directories and a build tree. This amounts to roughly 27 gigabytes of data. The test machine was an i7 machine with 8 cores and a spinning disk running Debian stable. Parzip was self compiled and is used liblzma from Debian repository, all other binaries were their default versions as provided by Debian. All compression settings were left at their default values.

Results

Let's start with the most important measurement: how long did each app take to compress the directory tree.
Info-Zip and gzip both use the deflate algorithm which is a lot faster than LZMA. Here we can see how they are the fastest even though they only use one thread for compression. Parzip is the fastest LZMA compressor by quite a wide margin. Xz is the slowest because it uses only one thread. 7-zip has some threading support which makes it faster than xz, but it loses heavily to Parzip.

Decompression times show a similar trend.
Decompression is a lot faster than compression (this graph is in seconds as opposed to minutes). The interesting point here is that Parzip is much faster than Info-Zip and gzip even though it uses LZMA. Being able to use all cores really helps here.

Finally let's look at sizes of the compressed archives.
Nothing particularly surprising here. LZMA files are roughly the same size with xz being the smallest, 7-zip following close by and Parzip holding the third place. Programs using deflate achieve noticeably worse compression ratio. Info-Zip is the clear loser here, maybe due to it using its own deflate implementation rather than zlib.

Big iron test

We also tested the performance of xz and Parzip on a 48 core heavy duty server.
XZ takes almost three hours to compress the data set while Parzip can achieve the same in just over 20 minutes. This is almost 90% faster. The theoretical best possible speedup would be roughly 97%, but due to overhead and other issues we can't reach those numbers.
Decompression times are similar with Parzip being 93% faster. This is due to decompression parallelising better than compression. Compression needs to write its results in a single file whereas decompression can write its outputs to the file system independent of each other.

Conclusions

Using modern tools and technologies is is possible to create a parallel file compressor that achieves almost the same compression ratio as the current leading technology (tar + xz) while providing almost linear parallelisation on the number of processor cores available.

Caveat

If you plan on doing your own measurements (which you should, it's educational), be careful when analysing the results. If you have a slow hard disk and a lot of decompression threads, they may saturate the disk's write capacity.

Wednesday, May 4, 2016

Jzip parallel unzipper performance results

In my previous blog post I wrote about reimplementing Info-Zip in modern C++ so that we can unpack files in parallel. This sparked a lively Reddit discussion. One of the main points raised was that once this implementation gets support for all other stuff Info-Zip does, then it will be just as large.

Testing this is simple: just implement all missing features. Apart from encryption support (which you should not use, but gpg instead), the work is now finished and available in jzip github repo. Here are the results.

Lines of code (as counted by wc)


Info-Zip: 82 057 lines of C
jzip: 1091 lines of C++

Stripped binary size


Info-Zip: 159 kB
jzip: 51 kB

Performance

Performance was tested by zipping a full Clang source + build tree into one zip file. This includes both the source, svn directories and all build artifacts. The total size was 9.3 gigabytes. Extraction times were as follows.

Info-Zip: 5m 38s
jzip: 2m 47s

Jzip is roughly twice as fast. This is a bit underwhelming result given that the test machine has 8 cores. Further examination showed that the reason for this was that jzip saturates the hard drive write capacity.

Conclusions

Using a few evenings worth of spare time it is possible to reimplement an established (but relatively straightforward) product with two orders of magnitude less code and massively better performance.

Update: more measurements

Usinag a 48 core machine with fast disks.

Info-zip: 3m 32s
jzip: 12s

On this machine jzip is 95% faster.

On the other hand when running on a machine with a slow disk, jzip may be up to 30% slower because of write contention overhead.

Sunday, April 24, 2016

Rewriting from scratch, should you do it?

One thing that has bothered me for a while is that the unzip command line tool only decompresses one file at a time. As a weekend project I wanted to see if I could make it work in parallel. One could write this functionality from scratch but this gave me a possibility to try really look into incremental development.

All developers love writing stuff from scratch rather than fixing and existing solutions (yes, guilty as charged). However the accepted wisdom is that you should never do a from scratch rewrite but instead improve what you have via incremental improvements. Thus I downloaded the sources of Info-Zip and got to work.

Info-Zip's code base is quite peculiar. It predates things such as ANSI C and has support for tons of crazy long dead hardware. MS-DOS ranks among the most recently added platforms. There is a lot of code for 16 bit processors, near and far pointers and all that fun stuff your grandad used to complain about. There are even completely bizarre things such as this snippet:

#ifndef const
#  define const const
#endif

The code base contains roughly 80 000 lines of K&R C. This should prove an interesting challenge. Those wanting to play along can get the code from Github.

Compiling the code turned out to be quite simple. There is no configure script or the like, everything is #ifdeffed inside the source files. You just compile them into an app and then you have a working exe. The downside is that the source has more preprocessor code than actual code (only a slight exaggaration).

Originally the code used a single (huge) global struct that houses everything. At some point the developers needed to make the code reentrant. Usually this means changing every function to take the global state struct as a function argument instead. These people chose not to do this. Instead they created a C preprocessor macro system that can be used to pass the struct as an argument but also compile the code so it has the old style global struct. I have no idea why they did that. The only explanation that makes any sort of sense is that adding the pointer to stack on every function call is too expensive on 16 bit and smaller platforms. This is just speculation, though, but if anyone knows for sure please let me know.

This meant that every single function definition was a weird concoction of preprocessor macros and K&R syntax. For details see this commit that eventually killed it.

Getting rid of all the cruft was not particularly difficult, only tedious. The original developers were very pedantic about flagging their #if/#endif pairs so killing dead code was straightforward. The downside was that what remained after that was awful. The code had more asterisks than letters. A typical function was hundreds of lines long. Some preprocessor symbols were defined in opposite ways in different header files but things worked because some other preprocessor clauses kept all but one from being evaluated (the code confused Eclipse's syntax highlighter so it's really hard to see what was really happening).

Ten or so hours of solid work later most dead cruft was deleted and the code base had shrunk to 30 000 lines of code. At this point looking into adding threading was starting to become feasible. After going through the code that iterates the zip index and extracts files it became a lot less feasible. As an example the inflate function was not isolated from the rest of the code. All its arguments were given in The One Big Struct and it fiddled with it constantly. Those would need to be fully separated to make anything work.

That nagging sound in your ear

While fixing the code I kept hearing the call of the rewrite siren. Just rewrite from scratch, it would say. It's a lot less work. Go on! Just try it! You know you want to!

Eventually the lure got too strong so I opened the Wikipedia page on Zip file format. Three hours and 373 lines of C++ later I had a parallel unzipper written from scratch. Granted it does not do advanced stuff like encryption, ZIP64 or creating subdirectories for files that it writes. But it works! Code is available in this repo.

Even better, adding multithreading took one commit with 22 additions and 7 deletions. The build definition is 10 lines of Meson instead of 1000+ lines of incomprehensible Make.

There really is no reason, business or otherwise, to modernise the codebase of Info-Zip. With contemporary tools, libraries and methodologies you can create code that is an order of magnitude simpler, clearer, more maintainable and just all around more pleasant to work with than existing code. In a fraction of the time.

Sometimes rewriting from scratch is the correct thing to do.

This is the exact opposite of what I set out to prove but that's research for you.

Update

The program in question has now been expanded to do full Zip packing + unpacking. See here for benchmarks.