Sunday, January 22, 2017

A Python extension module using C, C++, FORTRAN and Rust

One of the advantages of using a a general build system is that combining code written in different languages is easy. To demonstrate this I wrote a simple Python module called Polysnake. It compiles and links four different compiled languages into one shared extension module. The languages are C, C++, FORTRAN and Rust.

The build system setup consists of five declarations in Meson. It should be doable in four, but Rust is a bit special. Ignoring the boilerplate the core business logic looks like this:

rustlib = static_library('func', 'func.rs')

py3_mod.extension_module('polysnake',
  'polysnake.c',
  'func.cpp',
  'ffunc.f90',
  link_with : rustlib,
  dependencies : py3_dep)

The code is available on Github.

Compiling and running.

Compiling is done using the default Meson commands:

meson build
ninja -C build

Once built the main script can be run with this command:

PYTHONPATH=build ./main.py

The script just calls into the module and prints the string it returns. The output looks like this:

Combining many languages is simple.

This line is created in C.
This line is created in FORTRAN.
This line is created in C++.
This line is created in Rust.

Why not COBOL?

Saturday, January 14, 2017

Using a smart card for decryption from the command line

In Finland the national ID card contains a fully featured smart card, much like in Belgium and several other countries. It can be used for all sorts of cool things, like SSH logins. Unfortunately using this card for HW crypto operations is not easy and there is not a lot of easily digestible documentation available. Here's how you would do some simple operations from the command line.

The commands here should work on most other smart cards with very little changes. Note that these are just examples, they are not hardened for security at all. In production you'd use the libraries directly, keep all data in memory rather than putting it in temp files and so on.

First you plug in the device and card and check the status:

$ pkcs15-tool -k

Private RSA Key [todentamis- ja salausavain]
Object Flags   : [0x1], private
Usage          : [0x26], decrypt, sign, unwrap
Access Flags   : [0x1D], sensitive, alwaysSensitive, neverExtract, local
Access Rules   : execute:01;
ModLength      : 2048
Key ref        : 0 (0x0)
Native         : yes
Path           : XXXXXXXXXXXX
Auth ID        : 01
ID             : 45
MD:guid        : XXXXXXXXXXXX

Private RSA Key [allekirjoitusavain]
Object Flags   : [0x1], private
Usage          : [0x200], nonRepudiation
Access Flags   : [0x1D], sensitive, alwaysSensitive, neverExtract, local
Access Rules   : execute:02;
ModLength      : 2048
Key ref        : 0 (0x0)
Native         : yes
Path           : XXXXXXXXXXXX
Auth ID        : 02
ID             : 46
MD:guid        : XXXXXXXXXXXX

This card has two keys. We use the first one whose usage is "decrypt". Its ID number is 45. Now we need to extract the public key from the card:

$ pkcs15-tool --read-public-key 45 > mykey.pub

Next we generate some data and encrypt it with the public key. The important thing to note here is that you can only encrypt a small amount of data, on the order of a few dozen bytes.

$ echo secret message > original

Then we encrypt this message with OpenSSL using the extracted public key.

$ openssl rsautl -encrypt -inkey mykey.pub -pubin -in original -out encrypted

The file encrypted now contains the encrypted message. The only way to decrypt it is to transfer the data to the smart card for decryption, because the private key can only be accessed inside the card. This is achieved with the following command.

$ pkcs11-tool --decrypt -v --input-file encrypted --output-file decrypted --id 45 -l -m RSA-PKCS --pin 1234

After this the decrypted contents have been written to the file decrypted. The important bit here is -m RSA-PKCS, which specifies the exact form of RSA to use. There are several and if you use the wrong one the output will be random bytes without any errors or warnings.

What can this be used for?

Passwordless full disk encryption is one potential use. Combining a card reader with a Raspberry Pi and an electrical lock makes for a pretty spiffy DIY physical access control system.

Sunday, January 8, 2017

Measuring execution performance of C++ exceptions vs error codes

In an earlier blog post we found out that C++ exceptions produce smaller executables than corresponding code using error codes. The most common comment to it was that executable size is not that important, performance is what matters. Since we have the tooling, let's do perf measurements as well.

Test setup

We generate code that calls a few other functions that call other functions and so on until we reach the bottommost function. In that function we either raise an exception or error or return success. This gives us two parameters to vary: the depth of the call hierarchy and what percentage of calls produce an error.

The code itself is simple. The C++ version looks like this:

int func0() {
    int num = random() % 5;
    if(num == 0) {
        return func1();
    }
    if(num == 1) {
        return func2();
    }
    if(num == 2) {
        return func3();
    }
    if(num == 3) {
        return func4();
    }
    return func5();
}

We generate a random number and based on that call a function deeper in the hierarchy. This means that every time we call the test it goes from the top function to the final function in a random way. We use the C random number generator and seed it with a known value so both C and C++ have the same call chain path, even though they are different binaries. Note that this function calls on average a function whose number is its own number plus three. So if we produce 100 functions, the call stack is on average 100/3 = 33 entries deep when the error is generated.

The plain C version that uses error objects is almost identical:

int func0(struct Error **error) {
    int num = random() % 5;
    int res;
    if(num == 0) {
        res = func1(error);
    } else if(num == 1) {
        res = func2(error);
    } else if(num == 2) {
        res = func3(error);
    } else if(num == 3) {
        res = func4(error);
    } else {
        res = func5(error);
    }
    /* This is a no-op in this specific code but is here to simulate
     * real code that would check the error and based on that
     * do something. We must have a branch on the error condition,
     * because that is what real world code would have as well.
     */
    if(*error) {
        return -1;
    }

    return res;
}

The code for generating this code and running the measurements can be found in this Github repo.

Results

We do not care so much about how long the executables take to run, only which one of them runs faster. Thus we generate output results that look like this (Ubuntu 16/10, GCC 6.2, Intel i7-3770):

CCeCCCCCCC
CCCCCCccec
eEcCCCeECE
cceEEeECCe
EEEcEEEeCE
EEEEEeccEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE

In this diagram C means that plain C error codes are faster and E that exceptions are faster for that particular pair of parameters. For easier reading all cases where error codes win have been given a red background color. Each row represents a different number of functions. The top row has 50 functions (call depth of 16) and each row adds 50 so the bottommost row has 500 functions.

The columns represent different error rates. The leftmost column has 0% errors, the next one has 1% and so on. A capital letter means that the runtime difference was bigger than 10%. Each executable was executed five times and the fastest run was taken in each case. Full results look like the following:

GCC 6.2      Clang 3.8.1  Clang 4.0 trunk

CCeCCCCCCC   EecCcCcCcC   ECcECcEeCC
CCCCCCccec   CceEEeEcEC   EEeEECEcCC
eEcCCCeECE   EEEEEECEeE   EEEECECeCC
cceEEeECCe   EcEeEECEEE   EEEEceeEeC
EEEcEEEeCE   EEEeEEEEEe   EEEEEEEEEE
EEEEEeccEE   EEEEEEEEEE   EEEEEEEEEE
EEEEEEEEEE   EEEEEEEEEE   EEEEEEEEEE
EEEEEEEEEE   EEEEEEEEEE   EEEEEEEEEE
EEEEEEEEEE   EEEEEEEEEE   EEEEEEEEEE
EEEEEEEEEE   EEEEEEEEEE   EEEEEEEEEE


Immediately we see that once the stack depth grows above a certain size (here 200/3 = 66), exceptions are always faster. This is not very interesting, because call stacks are usually not this deep (enterprise Java notwithstanding). For lower depths there is a lot of noise, especially for GCC. However we can see that for small error rates exceptions are sometimes faster, especially for Clang. Because of the noise we redid the measurements several times. This changed the individual measurements but produced the same overall shape where small errors seem to be faster with exceptions but as the error rate grows error codes become faster.

On a Raspberry Pi the results look like this:

GCC 4.9.2

eeccCCCCCC
EeeeccccCC
Eeeeeecccc
Eeeeeeeccc
EEeeeeeeec
EEEEEeeeee
EEEEEEEEee
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE

Here the results are a lot clearer and consistent, probably due to the simpler architecture of the ARM processor. Exceptions are faster except for small call depths and large error rates. When using 50 functions error objects only become noticeable faster at 4% error rate.

Finally let's do the same measurements on an i5-6600K.

GCC 4.8.4    gcc 5.4.0   Clang 3.{4, 6, 8}

ECCCCCCCCCC  eeccCCCCCC  EeeccCCCCC
EecCCCCCCCC  EEEeeeeccc  EEEEEeeeee
EEecCCCCCCC  EEEEEEeeee  EEEEEEEEEE
EEEeecCCCCC  EEEEEEEEEe  EEEEEEEEEE
EEEEeecccCC  EEEEEEEEEE  EEEEEEEEEE
EEEEEeeeccc  EEEEEEEEEE  EEEEEEEEEE
EEEEEEeeeec  EEEEEEEEEE  EEEEEEEEEE
EEEEEEEEeee  EEEEEEEEEE  EEEEEEEEEE
EEEEEEEEEee  EEEEEEEEEE  EEEEEEEEEE
EEEEEEEEEEe  EEEEEEEEEE  EEEEEEEEEE

Here we see that the compiler used can make a massive difference. GCC 4.8.4 favors error codes but 5.4.0 is the opposite as are all versions of Clang (which produce identical results). On this platform exceptions seem to be even more performant than on the Pi.

The top left corner seems to be the most interesting part of this entire diagram, as it represents shallow call chains with few errors. This is similar to most real world code used in, for example, XML or JSON parsing. Let's do some more measurements focusing on this area.

i5               Raspberry Pi

GCC    Clang     GCC    Clang
5.4.0  3.8       4.9.2  3.5.0

cCCCC  eCCCC     eCCCC  eCCCC
ecCCC  EeCCC     ecCCC  ecCCC
ecCCC  EeccC     eccCC  eccCC
eccCC  EEecc     eecCC  eecCC
eeccC  EEEee     eeccC  eeccC


In these measurements the parameters go from 10 functions at the top row to 50 in the bottom row in increments of 10 and the colums go from 0% error on the left to 4% on the right in increments of 1. Here we see that exceptions keep their performance advantage when there are no errors, especially when using Clang, but error codes get better as soon as the error rate goes up even a little.

Conclusions

Whenever a discussion on C++ exceptions occurs, there is That Guy who comes in and says "C++ exceptions are slow, don't use them". At this point the discussion usually breaks down on fighting about whether exceptions are fast or not. These discussion are ultimately pointless because asking whether exceptions are "fast" is the wrong question.

The proper question to ask is "under which circumstances are exceptions faster". As we have seen here, the answer is surprisingly complex. It depends on many factors including which platform, compiler, code size and error rate is used. Blanket statements about performance of X as compared to Y are usually simplifying, misleading and just plain wrong. Such is the case here as well.

(Incidentally if someone can explain why i7 has those weird performance fluctuations, please post in the comments.)

Thanks to Juhani Simola for running the i5 measurements.

Wednesday, January 4, 2017

Clarifications to xz compression article

My previous article on beating the compression performance of xz went a bit viral. Comments on various discussion forums seemed to bring up the same, slightly erroneous, points so here are a few clarifications.

Random access to files is a useless feature

For most use cases this is probably true. But that was not the main point of the experiment. The point was to enable parallel compression without making the final size much bigger. Even more important was to enable parallel decompression. Very few file formats support that without losing compression performance. As a classical example PKZIP supports parallel decompression but its archives are a lot bigger than packed tar files.

Tarballs are just fine and need no changes

Many people were of the opinion that tarballs are "good enough for source distribution". This is probably correct. Unfortunately when people hear the word "tar" they immediately think of source tarballs. That is only a small fraction of tar usage in the world. Many file formats have tar files (or equivalent) hidden inside them. Thus all operations on said files are serial and can use only one CPU at a time.

As an example, the deb file format consists of a bunch of metadata and an embedded tar file. RPM is the same, except that it has a cpio archive rather than tar. This means that installing updates is an inherently serial operation. Update packages can only be installed one at a time, and because they are in a tarlike format, only one CPU can be used for processing them. This is particularly slow on platforms such as the Raspberry Pi that have underpowered multicore CPUs.

When running workloads such as a CI builder, there are multiple steps that are serial when using tar-like file formats. As an example pbuilder has the following steps:
  • unpack base image <- serial
  • install updates <- serial
  • install dependencies <- serial
  • unpack source <- serial
  • compile <- parallel
  • compress final result <- serial
If you are building Libreoffice, Firefox or Chromium, the compilation step dominates. For smaller packages the bottleneck may be elsewhere. As an extreme example, the Debian package build of Meson installs circa 1.5 GB of dependency packages but the actual build consists of a Python distutils invocation that takes all of one second (running the test suite does take a while, though).

This is not suitable for production due to X

It was never meant to. It was merely an experiment to see if a better file format were possible.

Pack/unpack operations are bound by disk, not by CPU

This is true on some platforms but not all of them. LZMA compression is almost always CPU bound even on the fastest CPUs and so is decompression if you have a fast SSD. Disks are getting faster all the time, CPUs are not, so CPUs will become even more of a bottleneck in the future.

What we have is good enough, the benefits are not worth an overhaul and even if they were inertia would prevent uptake

Maybe. But experiments are fun to do regardless.

Monday, January 2, 2017

Beating the compression performance of xz, or has the time come to dump tar?

Tar is one of the basic tools of Unix. It has been in use, basically unchanged, since 1979. It does one thing and it does it quite well. Basically it is a stream format for describing files. It consists of consecutive file metadata and content pairs. This is a simple and reliable format but it has a big downside: individual entries can not be accessed without processing the entire file from the beginning. Wikipedia has further details.

Compression makes this even worse. Accessing byte n requires unpacking every byte from the beginning of the file. This is unfortunate in itself but it has an even bigger downside. Compression and decompression are inherently serial operations. They can not be run in parallel. This was not really an issue in the seventies, but nowadays even a Raspberry Pi has four cores. Using only one seems wasteful.

There are some attempts to work around this, such as pigz, but they just chop the full file into constant sized blocks and compress them in parallel. Even in this case parallel decompression is not possible.

Why is tar used then?

In addition to inertia, tar has one major feature on its side: it compresses really well. Modern compressors like lzma (and even zlib) like having a lot of data to achieve high compression ratios. Tar clumps everything into one file, which is good for compressors. Let's examine how much. For testing we took Linux kernel 4.9 source tree. We first recompressed it with xz using the same encoder settings as for the other systems. The unpacked source is 762MB and the compressed size is 92 megabytes, which is our baseline.

The obvious comparison to tar + xz is a zip file using LZMA compression. There is an implementation called Parzip that supports parallel LZMA compression out of the box. When run it produces a zip file that is 164 megabytes in size. It is easy to see why tar + xz is so popular. The main point of compression is to save space and having a file that is almost twice the size is unacceptable.

One inefficiency of pkzip is that the metadata format is both messy and not compressed. Creating a new file format that stores the index as a single compressed lump at the end of the file is straightforward. The code for this (and the rest of the examples listed here) can be found in the jpak Github repo. Doing this creates a file that is 158 megabytes in size. This is better, but the compression ratio is still unusably bad.

If we wish to preserve perfect O(1) random file access to the compressed data this is probably the best we can do. But if loosen the requirements a bit we can get further. An archive basically consists of a sequence of files of different size. If we start from the top and pick files until their total uncompressed size reaches some predetermined limit, concatenate them into a single clump and compress that we can give the compressor large swatches of data at a time and in addition can compress the different clumps in parallel. To access a random file we need to decompress only the clump it is in, not the entire file from the beginning. This makes the operation take O(clumpsize) time.

If we do this using a clump size of 1 MB (uncompressed) the result is 102 MB file. This is a big improvement but still lags behind tar + xz. Increasing the clump size to 10MB produces a 93 MB file. Upping the size to 100 MB (which means that the file will consist of 8 clumps) produces a 91 MB file. This is smaller than what xz produces.


Wait, what? How can it be smaller than tar+xz?

The key here lies in the tar file format. It interleaves file data and metadata. Compressors do not really like this, because these two items usually have different statistics. Putting data that is similar next to each other improves compression ratios. Jpak stores the file metadata in a structure-of-arrays formation. That is, each metadata is a structure with properties A, B, C and so on. Jpak first writes the property A of all entries followed by all properties B and so on. This layout compresses better than tar's intermixed layout. Or that's the working theory at the moment, there has not been a thorough analysis.

It should be noted that the actual compression method is the same in every case: LZMA as provided by liblzma. The only difference is in how the data is laid out in the file. Changing the layout allows you to do the compression and decompression in parallel, get a random access index to your data and achieve as good or even better compression ratios with the downside being slightly slower access to individual files.

The small print

The current implementation does not actually do the operations in parallel (though Parzip does). The code has not been tested very thoroughly, so it might fail in many interesting ways. Do not use it to store any data you care about. The current implementation stores user and group metadata only as numbers rather than in strings, so it is not exactly equivalent to tar (though user/group strings are usually the same for most files so they compress really well). Other metadata bits may be missing as well.

Tuesday, December 27, 2016

Comparing executable size of C++ exceptions vs plain C error structs

One of the Grand Pieces of Hacker Folklore is that you should not use C++ exceptions because they cause massive code bloat. Instead you should use error codes as exemplified by plain C, Go, Rust and other languages. These opinions have caused many a heated debates both on and off the Internet.

One of the problems is that accurately measuring the size difference of code that uses exceptions vs one that uses error codes is hard to do properly. Legend has it that when Symbian developers were considering whether to add exception support, they compiled a helloworld application (that has no error paths at all), noted that with exceptions enabled it was somewhat bigger and declared this unacceptable. What they did not consider is that this increase in code size was constant, not linear to the code size. They then went on to code a completely nonstandard cleanup mechanism on top of C++, which was both terrible and bloaty.

A more methodological approach

There is really only one way to fully reliably test whether error codes or exceptions produce smaller code. That is to take a large program that uses one or the other mechanism and convert it to use the other without changing anything else. This can take months of work and is obviously not feasible in the real world but would make for an interesting research project.

Instead we can simulate the problem. It is relatively easy to generate code with the same functionality in C and in C++ and compare their performance. Let's create a function that simulates an simple work load. First it grabs some resource, then calls some other functions and finally returns a result. In simple C++ it would look like this:

int func222() {
    DummyObject dobj;
    int a = func227();
    int b = func224();
    int c = func228();
    return a + b + c;
}

Here DummyObject is just a simple object that is just there so the exception unwinding mechanism has something to destroy. Its constructor and destructor are defined elsewhere so that the optimizer can not remove it completely. We create 1000 of these functions where each one calls into three random functions with a higher number. Calls to functions bigger than 1000 are done to function 1000 instead. Function 1000 just returns 1. This program does not do anything useful and its runtime is exponential, so you should probably not run it. 

In plain C using glib style error "objects" the function looks like this:

int func222(char **error) {
    int a, b, c;
    struct Dummy *d = dummy_new();

    a = func227(error);
    if(*error) {
        dummy_delete(d);
        return -1;
    }
    b = func224(error);
    if(*error) {
        dummy_delete(d);
        return -1;
    }
    c = func228(error);
    if(*error) {
        dummy_delete(d);
        return -1;
    }
    dummy_delete(d);
    return a + b + c;
}

The functionality is the same, we just need to manually check whether the function calls succeeded and deallocate the resource on every path. The code to generate and run this code is downloadable from this Github repo.

The results

We measure three different compilation methods. Plain C, plain C++ and C++ with -fno-exceptions. The last one represents the use case where any error is immediately fatal and thus recovery is pointless. We tested with a bunch of compilers. C++ version was 14 except on GCC 4.8.4 which does not support it so C++11 was used instead. C version was gnu11. All tests were done on AMD64 architecture except one which was done on 32 bit ARM (Raspberry Pi 2 + Debian). -O2 and -g were used for optimization and debug info. Here are the sizes of the resulting stripped binaries.


The results are surprising in many ways. Perhaps the biggest one is that using C++ and exceptions produces a smaller binaries than plain C every single time. The closest is Clang but even there C binaries take over 6% more space. Noexcept is always the smallest, but this is expected as it has no error recovery mechanism at all.

Another thing to note is that GCC seems to have had a big C++ code size increase between versions 4.9.2 and 6.2.0. Clang seems to produce a lot smaller C binaries than GCC. For C++ the difference is there but the gap is not as big.

What is causing this?

I don't know for sure. It would be interesting to hear from Clang/GCC developers what is happening here and what actually happens under the covers. But basically what it (probably) boils down to is that the compiler has more freedom to optimize the steps needed for stack unwinding. When the user does deallocation manually, the compiler needs to obey the code paths they have written even if they are not optimal). Autogenerated stack unwinding just needs to call a few destructors in any way it wants. Optimizing this is tedious work, the kind best left to a machine.

Conclusions

This is only one aspect of the (seemingly) eternal debate between error codes and exceptions. It does not prove (I say again, does not) that one is faster than the other, or even that one always produces more efficient or smaller code than the other. However what we can say is that there are cases where converting from error codes into full exceptions makes the final binary smaller. It would be interesting to know if this is true for other code bases as well. That requires a lot more testing, though.

The Github repo contains more stuff for those interested in going deeper. For example it contains a version of the C code that is a bit more size optimized (though a bit less readable) that only has error codes rather than full error structs. It is still bigger than the straightforward C++ implementation.

Monday, December 19, 2016

What is the global cost of Autoconf?

Waiting for autoconf to finish is annoying. But have you ever considered how much time and effort is wasted every day doing that? Neither have I, but let's estimate it anyway.

First we need an estimate of how many projects are using Autotools. 100 is too low and 10 000 is probably too high, so let's say 1000. We also need an assumption of how long an average autoconf run takes. Variability here is massive ranging from 30 seconds to (tens of) minutes on big projects on small hardware. Let's say an average of one minute as a round number.

Next we need to know how many times builds are run during a day. Some projects are very active with dozens of developers whereas most have only one person doing occasional development. Let's say each project has on average 2 developers and each one does an average of 4 builds per day.

This gives us an estimate of how much time is spent globally just waiting for autoconf to finish:

1000 proj * 2 dev/proj * 2 build/dev * 1 minute/build =
   4000 minutes = 66 hours

66 hours. Every day. If you are a business person, feel free to do the math on how much that costs.

But it gets worse. Many organisations and companies build a lot of projects each day. As an example there are hundreds of companies that have their own (embedded) Linux distribution that they do a daily build on. Linux distros do rebuilds constantly and so on. If we assume 10 000 organisations that do a daily build and we do a lowball estimate of 5 dependencies per project (many projects are not full distros, but instead build a custom runtime package or something similar), we get different numbers:

10 000 organisations * 1 build/organisation * 5 dependencies/build * 1 min/dependency = 50 000 minutes = 833 CPU hours

That is, every single day over 800 CPU hours are burned just to run autoconf. Estimating CPU consumption is difficult but based on statistics and division we can estimate an average consumption of 20 Watts. This amounts to 16 kWh every day or, assuming 300 working days a year, 5 MWh a year.

That is a lot of energy to waste just to be absolutely sure that your C compiler ships with stdlib.h.