sunnuntai 3. syyskuuta 2017

Comparing C and C++ usage and performance with a real world project

The relative performance of C and C++ is the stuff of folk legends and Very Strong Opinions. There are microbenchmarks that can prove differences in performance in any direction one could wish for, but, as always, they are not conclusive in any way. For an actual comparison you'd need to have a complete, non-trivial program in one language, translate it to the other one without doing any functional changes and then comparing the results. The problem here is that this sort of a conversion does not exist.

So I made one myself.

I took the well known pkg-config program which is written in plain C using GLib and converted it to C++. The original code is available at Freedesktop and the converted version is on Github (branches master and cpp). The C++ version does not have any dependencies outside of C++ standard library whereas the C version depends on GLib and by extension pcre (which is an internal dependency of Glib, pkg-config does not use regular expressions).

All tests were run on Ubuntu 1704. The C++ version was tested both with GCC/stdlibc++ and Clang/libc++. Measurements were done with the gtk+ test in pkg-config's test suite.

The results in a single array

                           C++ stlibc++   C++ libc++       C

Optimized exe size                180kB        153kB    47kB
minsize exe size                  100kB        141kB    43kB
3rd party dep size                    0            0   1.5MB
compile time                       3.9s         3.3s    0.1s
run time                          0.01s       0.005s  0.004s
lines of code                      3385         3385    3388
memory allocations                 9592         8571    5549
Explicit deallocation calls           0            0      79
memory leaks                          0            0   >1000
peak memory consumption           136kB         53kB    56kB

Binary sizes and builds

The first thing to note is that the C version is a lot smaller than the corresponding C++ executables. However if you factor in the size of the external third party dependency binaries, i.e. the shared libraries of GLib and pcre, the C version is an order of magnitude bigger. One could argue endlessly what is the correct way to calculate these sizes (because a system provided library is shared among many users) but we're not going to do that here.

C++ is known for its slow build times and that is also apparent here. Again it should be noted that compiling the C dependencies takes several minutes so if you are on a platform where dependencies are built from source, the C version is a lot slower.

The runtime is fast for all three versions. This is expected because pkg-config is a fairly simple program. Stdlibc++ is slower than the other two, whose runtime is within measurement error of each other.

Memory

Memory and resource management has traditionally been the problem of C, where the programmer is responsible for shepherding and freeing every single resource. This can be clearly seen in the result table above. Perhaps the most striking fact is that there are 79 explicit (that is, written and maintained by the developer) resource release calls. That means that more than 2% of all statements in the entire code base are resource deallocation calls.

Every manual resource deallocation call is a potential bug. This is confirmed by the number of memory leaks as reported by Valgrind. There are more than 1000 of them, several dozen of which are marked as "definitely lost". The C++ implementation on the other hand uses value types such as std::string and RAII consistently. Every resource is deallocated automatically by the compiler, which, as we can see, does it perfectly. There are no resource leaks.

Memory consumption is also interesting. The C version works by creating an array of package objects and strings. Then it creates a hash table with pointers that point to said array. This is the classical C "sea of aliased pointers" problem, where the developer must keep track of the origin and meaning of every single pointer with no help from the compiler.

The C++ version has no pointers but instead uses value types. This means that all data is stored twice: once in the array and a second time in the hash table. This could probably be optimized away but was left as is for the purposes of this experiment. Even with this duplication we find that the version using libc++ uses less memory than the Glib one. Stdlibc++ uses a fair bit more memory than the other two. To see why, let's look at some Massif graphs starting with stdlibc++.


This shows that for some reason stdlibc++ allocates one chunk of 70 kB during startup. If we ignore this allocation the memory consumption is about 60 kB which is roughly the same as for the other two executables.

Plain C looks like this.


The most notable thing here is that Massif can not tell the difference between different allocation sources but instead lumps everything under g_malloc0. The C++ version shows allocations per container type which is extremely useful.

Finally, here is the chart for libc++.



Libc++ does not have an initial allocation like stdlibc++, so its memory usage is lower. Its containers also seem to be more optimized, so it uses less memory overall. Memory consumption could probably be reduced by using a linear probing hash map (which is also what Glib does internally) rather than the node-based one as required by the C++ standard but it would mean having an external dependency which we want to avoid.

The conversion job

One of the many talking points of Rust is that converting C to it is easy. This is spoken of in quotes such as "Rust is the only language that allows you to convert existing C code into a memory safe language piece by piece" (link to original purposefully omitted to protect the innocent). Depending on your definition of a "memory safe language" this statement is either true or complete bunk.

If you are of the opinion that Rust is the only memory safe language then the statement is obviously true.

If not then this statement is fairly vacuous. Every programming language that has support for plain C ABI and calling conventions, which is to say almost every one of them, has supported transitioning from C code one function at a time. Pascal, D, Java with JNI, even Fortran have been capable of doing this for decades.

C++ can also do this but it goes even further: it supports replacing C structures one element at a time. Pkg-config had many structs which consisted of things like GLists of char pointers. In any other programming languages changing this element means converting the entire struct from C into your new language in a single step. This means changing all code that uses said struct into the new language in one commit, which is usually huge and touches a large fraction of the code base.

In C++ you can convert only a fraction of the struct, such as replacing one of the stringlists with a std::vector<std::string>. Other elements of the struct can remain unchanged. This means smaller, more understandable commits. The extra bonus here is that these changes do not affect functionality in any way. There are no test suite regressions during the update process, even when working with frankenstein structs that are half C and half C++.

Following this train of thought to its final station yields slightly paradoxical results. If you have a legacy C code base that you want to convert to D or Rust or whatever, it might make sense to convert it to C++ first. This allows you to do the hard de-C-ification in smaller steps. The result is modern C++ with RAII and value types that is a lot simpler to convert to the final target language.

The only other programming language in common use that is capable of doing this is Objective C but it has the unfortunate downside of being Objective C.

Conclusions

Converting an existing C program into C++ can yield programs that are as fast, have fewer dependencies and consume less memory. The downsides include a slightly bigger executable and slower compilation times.

tiistai 29. elokuuta 2017

Dependencies, why are they so hard to get right?

The life cycle of any programming project looks roughly like this.


We start with the plain source code. It gets processed by the build system to produce so called "build tree artifacts". These are executables, libraries and the like but they are slightly special. They are stored inside the build tree and can not usually be used directly. Every build system has its own special magic sprinkled in the outputs. The files inside a build tree can not be run directly (usually) and the file system layout can be anything. The build tree is each build system's internal implementation detail, which is usually not documented and definitely not stable. The only thing that can reliably operate on items in the build directory is the build system itself.

The final stage is the "staging directory" which usually is the system tree, as an example /usr/lib in Unix machines but can be e.g. an app bundle dir on OSX or a standalone dir that is used to generate an MSI installer package on Windows. The important step here is installation. Conceptually it scrubs all traces of the build system's internal info and make the outputs conform to the standards of the current operating system.

The different dependency types

Based on this there are three different ways to obtain dependencies.


The first and simplest one is to take the source code of your dependency, put it inside your own project and pretend it is a native part of your project. Examples of this include the SQLite amalgamation file and some header-only C++ libraries. This way of obtaining dependencies is not generally recommended or interesting so we'll ignore it for the remainder of this post.

Next we'll look into the final case. Dependencies that are installed on the system are relatively easy to use as they are guaranteed to exist before any compilation steps are undertaken and they don't change during build steps. The most important thing to note here is that these dependencies must provide their own usage information in a build system independent format that is preferably fully declarative. The most widely accepted solution here is pkg-config but there can be others, as long as it is fully build system independent.

Which leaves us the middle case: build system internal dependencies. There are many implementations of this ranging from Meson subprojects to CMake internal projects and many new languages such as D and Rust which insist on compiling all dependencies by themselves all the time. This is where things get complicated.

Since the internal state of build trees are different, it is easy to see that you can not mix two different build systems within one single build tree. Or, rather, you could but it would require one of them to be in charge and the other one to do all of the following:
  • conform to the file layout of the master project
  • conform to the file format internals of the master project (which, if you remember, are undocumented and unstable)
  • export full information about what it generates, where and how to the master project in a fully documented format
  • accept dependency information for any dependency built by the master project in a standardized format
And there's a bunch more. If you go to any build system developer and tell them to add these features to their system they will first laugh at you and tell you that it will happen absolutely never.

This is totally understandable. Pairing together the output of two wildly different unstable interfaces in a reliable way is not fun or often even possible. But it gets worse.

Lucy in the Sky with Diamond Dependency Graphs

Suppose that your dependency graph looks like this.

The main program uses two libraries libbaz and libbob. Each one of them builds with a different build system each of which has its own package manager functionality. They both depend on a common library libfoo. As an example libbob might be a language wrapper for libfoo whereas libbaz only uses it internally. It is crucially important that the combined project has one, and only one, copy of libfoo and it must be shared by both dependents. Duplicate dependencies lead, at best, into link time errors and at worst to ten hour debugging sessions of madness in production.

The question then becomes: who should build libfoo? If it is provided as a system dependency this is not an issue but for build tree dependencies things break horribly. Each package manager will most likely insist on compiling all their own dependencies (in their own special format) and plain refuse to work anything else. What if we want the main program to build libfoo instead (as it is the one in charge)? This quagmire is the main reason why certain language advocates' view of "just call into our build tool [which does not support any way of injecting external dependency information] from your build tool and things will work" ultimately unworkable.

What have we learned?

  1. Everything is terrible and broken.
  2. Every project must provide a completely build system agnostic way of declaring how it is to be used when it is provided as a system dependency.
  3. Every build system must support reading said dependency information.
  4. Mixing multiple build systems in a single build directory is madness.

lauantai 19. elokuuta 2017

Apple laptops have become garbage

When OSX launched it quite quickly attracted a lot of Linux users and developers. There were three main reason for this:

  1. Everything worked out of the box
  2. The hardware was great, even sexy
  3. It was a full Unix laptop
It is interesting, then, that none of these things really hold true any more.

Everything works out of the box

I have an Android phone. One of the things one would like to do with it is to take pictures and then transfer them to a computer. On Linux and Windows this is straightforward: you plug in the USB cable, select "share pictures" on the phone and the operating system pops up a file dialog. Very simple.

In OSX this does not work. Because Android is a competitor to the iPhone (which makes Apple most of its money nowadays) it is in Apple's business interest to not work together with competing products. They have actively and purposefully chosen to make things worse for you, the paying customer, for their own gain. Google provides a file transfer helper application but since it is not hooked inside the OS its UX is not very good.

But let's say you personally don't care about that. Maybe you are a fully satisfied iPhone user. Very well, let's look at something completely different: external monitors. In this year's Europython conference introductory presentation the speaker took the time to explicitly say that if anyone presenting had a latest model Macbook Pro, it would not work with the venue's projectors. Things have really turned on their heads because up to a few years ago Macs were pretty much the only laptops that always worked.

This problem is not limited to projectors. At home I have an HP monitor that has been connected to many a different video source and it has worked flawlessly. The only exception is the new work laptop. Connecting it to this monitor makes the system go completely wonky. On every connection it does an impressive impersonation of the dance floor of a german gay bar with colors flickering, things switching on and off and changing size for about ten seconds or so. Then it works. Until the screen saver kicks in and the whole cycle repeats.

If this was not enough every now and then the terminal application crashes. It just goes completely blank and does not respond to anything. This is a fairly impressive feat for an application that reached feature stability in 1993 or thereabouts.

Great hardware

One of the things I do in my day job is mobile app development (specifically Android). This means connecting external display, mouse and keyboard to the work laptop. Since macs have only two USB ports they are already fully taken and there is nowhere to plug the development phone. The choices here are to either unplug the mouse whenever you need to deploy or debug on the device or use a USB hub.

Using dongles for connectivity is annoying but at least with a hub one can get things working. Except no. I have a nice USB hub that I have used for many years on many devices that works like a charm. Except on this work computer. Connecting anything through that hub causes something to break so the keyboard stops working every two minutes. The only solution is to unplug the hub and then replug it again. Or, more specifically, not to use the hub but instead live without an external mouse. This is even more ridiculous when you consider that Apple was the main pioneer for driving USB adoption back in the day.

Newer laptop models are even worse. They have only USB-C connectors and each consecutive model seems to have fewer and fewer of them. Maybe their eventual goal is to have a laptop with no external connection slots, not even a battery charger port. The machine would ship from the factory pre-charged and once the juice runs out (with up to 10 hours of battery life™) you have to throw it away and buy a new one. It would make for good business.

After the introduction of the Retina display (which is awesome) the only notable hardware innovation has been the emojibar. It took the concept of function buttons and made it worse.

Full Unix support

When OSX launched it was a great Unix platform. It still is pretty much the same it was then, but by modern standards it is ridiculously outdated. There is no Python 3 out of the box, and Python 2 is several versions behind the latest upstream release. Other tools are even worse. Perl is 5.18 from 2014 or so, Bash is 3.2 with the copyright year of 2007, Emacs from 2014 and Vim from 2013. This is annoying even for people who don't use macs, but just maintain software that supports OSX. Having to maintain compatibility with these sorts of stone age tools is not fun.

What is causing this dip in quality?

There are many things one could say about the current state of affairs. However there is already someone who has put it into words much more eloquently than any of us ever could. Take it away, Steve:

Post scriptum

Yes, this blog post was written on a Macbook, but it is one of the older models which were still good. I personally need to maintain a piece of software that has native support for OSX so I'm probably going to keep on using it for the foreseeable future. That being said if someone starts selling a laptop with a Risc-V processor, a retina-level display and a matte screen, I'm probably going to be first in line to get one.

maanantai 7. elokuuta 2017

Reconstructing old game PC speaker music

Back when dinosaurs walked the earth regular PCs did not have sound cards by default. Instead they had a small piezoelectric speaker that could only produce simple beeps. The sound had a distinctive feel and was described with words such as "ear-piercing", "horrible" and "SHUT DOWN THAT INFERNAL RACKET THIS INSTANT OR SO HELP ME GOD".

The biggest limitation of the sound system was that it could only play one constant tone at a time. This is roughly equivalent to playing the piano with one finger and only pressing one key at a time. Which meant that the music in games of the era had to be simple. (Demoscene people could do crazy things with this hardware but it's not relevant for this post so we'll ignore it.)

An interesting challenge, then, is whether you could take a recording of game music of that era, automatically detect the notes that were played, reconstruct the melody and play it back with modern audio devices. It seems like a fairly simple problem and indeed there are ready made solutions for detecting the loudest note in a given block of audio data. This works fairly well but has one major problem. Music changes from one note to another seamlessly and if you just chop the audio into constant sized blocks, you get blocks with two different consecutive notes in them. This confuses pitch detectors. In order to split the sound into single note blocks you'd need to know the length of each note and you can't determine that unless you have detected the pitches.

This circular problem could probably be solved with some sort of an incremental refinement search or having a detector for blocks with note changes. We're not going to do that. Let's look at the actual waveform instead.
This shows that the original signal consists of square waves, which makes this specific pitch detector a lot simpler to write. All we need to do is to detect when the signal transitions between the "up" and "down" values. This is called a zero-crossing detector. When we add the duration of one "up" and the following "down" segment we have the duration of one full duty cycle. The frequency being played is the inverse of this value.

With this algorithm we can get an almost cycle-accurate reconstruction of the original sound data. The problem is that it takes a lot of space so we need to merge consecutive cycles if they are close enough to each other. This requires a bit of tolerance and guesswork since the original analog components were not of the highest quality so they have noticeable jitter in note lengths. With some polishes and postprocessing you get an end result that goes something like this. Enjoy.

maanantai 24. heinäkuuta 2017

Managing the build definitions of a big project with many subprojects and interdependencies

Last week the news broke that Boost is switching from their own build system to CMake. This made me finally look properly into how Boost is built and what lessons we can learn from it. The results turned out to be quite interesting.

For those interested in diving into Boost's code note that the source layout in Git repos is different from what it is in the release tarballs. The latter has a sort of a "preinstalled header" directory with all public headers whereas they are inside each individual repository in Git. There also seem to be two different sets of build definitions, one for each.

Creating a sample project

My first idea was to convert a subset of Boost into Meson for a direct comparison. I spent a lot of time looking at the Jamfiles and could not understand a single thing about them. So instead I created a demonstration project called Liftoff, which can be downloaded from Github. The project had the following requirements:
  • support many standalone subprojects
  • subprojects can depend on other subprojects
  • shared dependencies are built only once, every project using it gets the same instance
  • subprojects can be built either as shared or static libraries or used in a header only mode
  • can build either all projects or only one + all its dependencies
  • any dependency can also be obtained from the system if it is available
  • monorepo layout, but support splitting it up into many individual repos if desired

The libraries

The project consists of four independent subprojects:
  • lo_test, a simple unit testing framework
  • lo_adder, a helper module for adding integers, depends on lt_test
  • lo_strings, a helper module for manipulating strings, has no dependencies
  • lo_shuttle, an application to launch shuttles, depends on all other modules
Note how both lo_adder and lo_shuttle depend on lo_test. Each subproject comes with a header and unit tests, some come with a dependency library as well.

The dependency bit

The core idea behind Meson's dependency system is that projects can declare dependency objects which specify how the dependency should be used (sort of like a Meson-internal pkg-config file). This is how it looks like for the string library:

lo_strings_dep = declare_dependency(link_with : string_lib,
  include_directories : include_directories('.'),
)

Other projects can then request this dependency object and use it to build their targets like this:

string_dep = dependency('lo_strings', fallback : ['lo_strings', 'lo_strings_dep'])

This is Meson nomenclature for "try to find the dependency from the system and if not found use the one in the given subproject". This dependency object can then be used in build targets and the build system takes care of the rest.

Building it

The build command from the command line is this:

meson build
ninja -C build test

This builds and runs all tests. Once you have it built, here are things to try:
  • toggle between shared and static libraries with mesonconf -Ddefault_library=shared [or static]
  • note how the test library is built only once, even though it is used by two different subprojects
  • do a mesonconf -Dmodule=lo_strings and build, note that no other subproject is built anymore
  • do a mesonconf -Dmodule=lo_adder and build, note that lo_test is built automatically, because it is a direct dependency of lo_adder

"Header only" dependencies

Some projects want to ship header only libraries but to also make it possible to build a helper library, usually to cut down on build times. This can be done but it is usually not pretty. You need to write "implementation header files" and do magic preprocessor incantations to ensure things are built in proper locations. We could replicate all of that in Meson if we wanted to, after all it's only grunt work. But we're not going to do that.

Instead we are going to do something fancier.

The main problem here is that traditionally there has been no way to tell that a dependency should also come with some source files that should be compiled in the dependent target. However in Meson this is supported. The lo_strings subproject can be set up to build in this way with the following command:

mesonconf build -Dlo_strings:header_only=true

When the project is built after this, the lo_strings project is not built, instead its source files are put inside the dependent targets and built there. Note that the build definition files for the dependent targets do not change at all. They are identical regardless of where your dependency comes from or how it should be built. Also switching between how things should be built does not require changing the build definition files, it can be toggled from "the outside".

How much space do the build definitions take in total?

66 lines.

tiistai 18. heinäkuuta 2017

Experiment: binary size reduction by using common function tails

In embedded development the most important feature of any program is its size. The raw performance does not usually matter that much, but size does. A program that is even one byte larger than available flash size is useless.

GCC, Clang and other free compilers do an admirable job in creating small executables when asked to with the -Os compiler switch. However there are still optimizations that could be added. Suppose we have two functions that looks like this:

int funca() {
  int i = 0;
  i+=func2();
  return i+func1();
}

int funcb() {
  int i = 1;
  i+=func3();
  return i+func1();
}

They would get compiled into the following asm on x86-64:

funca():
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], 0
        call    func2()
        add     DWORD PTR [rbp-4], eax
        call    func1()
        mov     edx, eax
        mov     eax, DWORD PTR [rbp-4]
        add     eax, edx
        leave
        ret
funcb():
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], 1
        call    func3()
        add     DWORD PTR [rbp-4], eax
        call    func1()
        mov     edx, eax
        mov     eax, DWORD PTR [rbp-4]
        add     eax, edx
        leave
        ret

If you look carefully, the last 7 instructions on both of these functions are identical. In fact the code above can be rewritten to this:

funca():
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], 0
        call    func2()
common_tail:
        add     DWORD PTR [rbp-4], eax
        call    func1()
        mov     edx, eax
        mov     eax, DWORD PTR [rbp-4]
        add     eax, edx
        leave
        ret
funcb():
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], 1
        call    func3()
        jmp common_tail

Depending on your point of view this can be seen as either a cool hack or an insult to everything that is good and proper in the world. funcb does an unconditional jump inside the body of an unrelated function. The reason this works is that we know that the functions end in a ret operand which pops a return address from the stack and jumps into that (that is, the "parent function" that called the current function). Thus both code segments are identical and can be collapsed into one. This is an optimisation that can only be done at the assembly level, because C prohibits gotos between functions.

How much does this save?

To test this I wrote a simple Python script that parses assembly output, finds the ends of functions and replaces common tails with jumps as described above. It uses a simple heuristic and only does the reduction if there are three or more common instructions. Then I ran it on the assembly output of SQLite's "amalgamation" source file. That resulted in reductions such as this one:

Ltail_packer_57:
        setne   %al
Ltail_packer_1270:
        andb    $1, %al
        movzbl  %al, %eax
        popq    %rbp
        retq

This function tail is being used in two different ways, sometimes with the setne command and sometimes without. In total the asm file contained 1801 functions. Out of those 1522 could be dedupped. The most common removals looked like this:

       addq    $48, %rsp
       popq    %rbp
       retq

That is, the common function suffix. Interestingly, when the dedupped asm is compiled, the output is about 10k bigger than without dedupping. The original code was 987 kB. I did not measure where the difference comes from. It could be either because the extra labels need extra metadata or because the jmp instruction takes more space than the instructions it replaces because the jump might need a 32 bit offset. A smarter implementation would look to minimize the jump distance so it would fit in 16 bits and thus in a smaller opcode. (I'm not sure if x86-64 has those opcodes so the previous comment might be wrong in practice but the reasoning behind it is still valid.)

Is this actually worth doing?

On the x86 probably not because the machines have a lot of ram to spare and people running them usually only care about raw performance. The x86 instruction set is also quite compact because it has a variable size encoding. The situation is different in ARM and other embedded platforms. They have fewer instructions and a constant encoding size (usually 32 bits). This means longer instruction sequences which gives more potential for size reductions. Some embedded compilers do this optimization so The Real World would seem to indicate that it is worth it.

I wanted to run the test on ARM assembly as well, but parsing it for function tails is much more difficult than for x86 asm so I just gave up. Thus knowing the real benefits would require getting comments from an actual compiler engineer. I don't even pretend to be one on the Internet, so I just filed a feature request on this to the Clang bug tracker.

keskiviikko 12. heinäkuuta 2017

Is every build system using Ninja just as fast as every other?

One of the most common arguments against Meson is that "it is only fast because it uses Ninja rather than Make, using any other Ninja build generator would be just as fast". This is always stated as fact without any supporting evidence or measurements. But is this really the case? Let's find out.

For testing one needs a project that has both CMake and Meson build definitions. I'm not aware of any so I created one myself. I took the source code of the Mediascanner 2 project, which is using CMake and converted it to use Meson. This project was chosen solely based on the fact that I wrote the original CMake definitions ages ago so I should have a fairly good understanding of the code base. The project itself is a fairly typical small-to-medium project written in C++ with a handful of system dependencies.

Compiling and running the project on a regular laptop gives a fairly straightforward answer. Both build systems are roughly as fast. So, case closed then?

Well no, not really. The project is small and machines today are very fast so a similar result is not very surprising. For a better test we would need to either convert a much larger project or use a slower machine. The former is not really feasible, but the latter can be achieved simply by running the tests on a Raspberry Pi 2. This is a fairly good real world test, as compiling programs of this size on a raspi is a common task.

The measurements

The tests were run on a Rasberry Pi 2+ running Jessie with a sid chroot. The CMake version was the latest in Sid whereas Meson trunk was used. Each measurement was done several times and the fastest time was always chosen. If you try to replicate these results yourself note that there is a lot of variance between consecutive runs so you have to run them many times. The source code can be found in this repository.

The first measurement is how long running the first configuration step takes.

CMake takes 12 seconds whereas Meson gets by with only four. This is fairly surprising as CMake is a C++ executable whereas Meson is implemented in Python so one would expect the former to be faster. Configuration step is run seldom, so it's ultimately not that interesting. Most of the time is probably spent on a full build, so let's look at that next.
CMake takes 2 minutes and 21 seconds to do a full build. Meson is 31 seconds, or 20%, faster clocking at 1 minute 50 seconds. Both systems build the same files and they have the same number of build steps, 63, as reported by Ninja. Finally let's look at the most common task during development: incremental compilation. This is achieved by touching one source file and running Ninja.

In this case Meson is almost 50% faster than CMake. This is due to the smart link skipping functionality that we stole from LibreOffice (who stole it from Chromium :). This improvement can have a big impact during day to day development, as it allows faster iteration cycles.

Conclusions

Ninja is awesome but not made of magic. The quality of the generated Ninja file has a direct impact on build times. Based on experiments run here it seems that Meson performs consistently faster than CMake.

maanantai 26. kesäkuuta 2017

Writing your own simple GUI SSH client

Most Unix users are accustomed to using SSH from the command line. On Windows and other platforms GUI tools are popular and they can do some nice tricks such as opening graphical file transfer windows and initiate port forwardings to an existing connection. You can do all the same things with the command line client but you have to specify all things you want to use when first opening the connection.

This got me curious. How many lines of code would one need to build a GUI client that does all that on Linux. The answer turned out to be around 1500 lines of code whose job is mostly to glue together the libvte terminal emulator widget and the libssh network library. This is what it looks like:

Port forwardings can be opened at any time. This allows you to e.g. forward http traffic through your own proxy server to go around draconian firewalls.
File transfers also work.
A lot of things do not work, such as reverse port forwards or changing the remote directory in file transfers. Some key combinations that have ctrl/alt/etc modifiers can't be sent over the terminal. I don't really know why, but it seems the vte terminal does some internal state tracking to know which modifiers are active. There does not seem to be a way to smuggle corresponding raw keycodes out, it seems to send them directly to the terminal it usually controls. I also did not find an easy way of getting full keyboard status from raw GTK+ events.

torstai 15. kesäkuuta 2017

Of XML, GUIDs and installers

Every now and then I have to use Windows. I'd really like to use Emacs for editing so I need to install it. Unfortunately the official releases from the FSF are zip archives that you need to manually unpack, add shortcuts and all that stuff. This gets tedious and it would be nice if there were MSI installer packages to do all that for you. So I figured I'd create some.

Conceptually what the installer needs to do is the following:

  1. Create the c:\Program Files\GNU Emacs directory
  2. Unzip contents of the official release zip file in said directory
Seems simple, right? There are two main tools for creating installers. The first one is NSIS, but it only creates exe installers, not msi. It also has a scripting language designed by a shaman who has licked a few frogs too many. So let's ignore that.

The other tool is WiX. It creates nice installers but it is Enterprise. Very, Very Enterprise. And XML. But mostly Enterprise. For starters the installer needs to have a GUID (basically a 128 bit random number). It also needs an "upgrade GUID". And a "package GUID". The first two must be the same over all installer versions but the latter must not be.

Having conjured the necessary amount of GUIDs (but not too many), you are ready to tackle copying files around. As you probably guessed, each one of them needs its own GUID. But if you though that each one would require an XML element of their own, you are sadly mistaken. Every file needs two nested XML elements. The files also need a container. And a container-container.

Did I mention that the documentation consists almost entirely of XML fragments? So it is all but impossible to tell which tag should follow which and which ones should be nested?

The Emacs distribution has a lot of files, so obviously writing the XML by hand is out of the question. Unfortunately the WiX toolkit ships a helper utility called heat to generate the XML from a directory tree. Yes, I really did say unfortunately. While the script pretends to work in reality it does not. Following the documentation you might try doing something like this (simplified for purposes of exposition):

heat <directory with unpacked files> <output xml file> other_flags

This does create an installer which installs all the files. But it also does a little bit extra. If your unpack directory was called unpack, then the files will be installed to c:\Program Files\GNU Emacs\unpack. How might we get rid of that extra directory segment? The correct answer is that the system is optimized for in-source Visual Studio builds and trying to do anything else is doomed to fail. But let's try it anyway.

First one might look for a command line switch like -P1 for patch to discard the first path segment. It does not seem to exist.

Next one might try to be clever and cd inside the unpack dir and do this (again simplified):

heat . <output xml file> other_flags

The system reports no error but the output is identical to the previous attempt. The script helpfully notices that you are using a period for the directory name and will do a lookup in the parent directory to see what it would be called and substitutes it in. Because!

Since there are only a few directories in the top level one might try something along the lines of:

heat dir1 dir2 dir3 dir4 <output xml file> other args

Which again succeeds without error. However the output installer only has files from dir1. The tool parses all the input and then dutifully throws away all entries but the first without so much as a warning.

The way to make this work is to generate the XML files and then monkey patch all paths in them before passing them on to the installer generator. For That Is the Enterprise Way! Just be glad there are no files at the root directory.

Join us next time to learn how a randomly generated daddy GUID comes together with a randomly generated mommy GUID to produce a new entry in the start menu.

Is this available somewhere?

The script is here. There are no downloadable binaries because I don't have the resources to maintain those. It would be cool if someone (possibly even the FSF) would provide them.

keskiviikko 7. kesäkuuta 2017

Optimizing code: even the simplest things are unbelievably complex

In the previous post we looked at optimizing this simple function.

uint64_t result = 0;
for(size_t i=0; i<bufsize; i++) {
  if(buf[i] >= 128) {
    result += buf[i];
  }
}

We shall now do more measurements with real world hardware and compilers. The algorithms we use are the following:

  • simple: the above loop as-is
  • lookup: create a lookup table where entries less than 128 have value zero and the rest have the same value as the index
  • bit fiddling: convert the if into a branchless bitmask operation
  • partition: run std::partition on the data and only add the first half
  • zeroing: go over the data and set values not matching to zero, then add all
  • bucket: keep an array of 255 entries and count the number of times each number appears
  • multiply: convert if to a multiplication by 0 or 1, then add
  • parallel add: add several chars in a single packed 64 bit addition
Those interested in the actual implementations should look it up in the repo.

The hardware used is the following:

  • Raspberry Pi, Raspbian, GCC 4.9.2, Clang 3.5.0
  • Ubuntu zesty, GCC 6.3.0, Clang 4.0.0
  • Macbook Pro i5, XCode 8
  • Windows 10, Visual Studio 2017, run in VirtualBox
The test suite runs all available compilers with a selection of optimization types, CPU features (SSE, AVX, Neon etc) and measures the times taken.

The results


Let's start by looking at the simplest build setup.

This seems quite reasonable. Parallel addition is the fastest, others are roughly as fast and the two algoritms that reorder the input array are the slowest. For comparison Raspberry Pi looks like this:
Everything is much flatter as one would expect. Since everything is going smoothly, let's look at the first measurement again, except this time we sort the input data before evaluating. One would expect that the simple loop becomes faster because the branch predictor has an easier task, partitioning becomes faster and nothing becomes noticeably slower.
Well ... ummm ... one out of three ain't bad, I guess. At this point I should probably confess that I don't have a proper grasp on why these things are happening. Any speculation to follow might be completely wrong. The reason bucket slows down is the easier of these two to explain. Since the input is sorted, consecutive iterations of the loop attempt to write to the same memory address, which leads to contention. When the data was random, each iteration wrote to a random location which leads to fewer collisions.

The reason why the simple loop does not get faster may be caused by the processor evaluating both branches of the if clause in any case and thus having better branch prediction does not matter. On the other hand Visual Studio does this:

Bucket is slower for sorted as above, but the simple loop is an order of magnitude slower on unsorted data. Ideas on what could be the cause of that are welcome.

What is the fastest combination?

The fastest combination for each hardware platform is the following.
  • Ubuntu: bit fiddle, g++, release build, -msse, unsorted
  • Raspi: bit fiddle, g++, release build, -mfpu=neon, sorted
  • OSX: simple loop, Clang++, debugoptimized build, -msse4.2, sorted
  • VS2017: lut, debugoptimized build, unsorted
This is basically random. There does not seem to be any one algorithm that is consistently the fastest, every one of them is noticeably slower than others under some circumstances. Even weirder, things that you would expect to be straightforward and true are not. Here are some things to scratch your head over:
  • AVX instructions are never the fastest, and on an i7 the fastest is plain old SSE (for the record MMX was not tested)
  • With Clang, enabling Neon instructions makes everything a lot slower
  • On the Raspberry Pi doing a read only table lookup using Neon is slower than with regular instructions
  • On an i7 multiplication is sometimes faster than arithmetic shifting

keskiviikko 31. toukokuuta 2017

Gee, optimization sure is hard

In a recent Reddit discussion the following piece of code was presented:

for (unsigned c = 0; c < arraySize; ++c) {
    if (data[c] >= 128)
        sum += data[c];
}

This code snippet seems to be optimal but is it? There is a hard to predict branch and a data dependency, both of which can cause slowdowns. To see if this can be implemented faster I created a test repo with a bunch of alternative implementations. Let's see how they stack up.

The implementations

The simplest is the simple loop as described above.

A version using a lookup table has a helper table of 256 entries. Values larger or equal to 128 have identity value and values smaller than 128 have the value zero. The body of the loop then becomes result += lut[buf[i]], which is branch free.

The bit fiddling approach goes through the values one by one and first calculates a mask value which is b >> 7. This is an arithmetic right shift whose outcome is either all zeros or all ones depending whether the value of b is less than 128. Then we add the value to the result by ANDing it with the mask. The core of this loop is result += buf[i] & (((int8_t)buf[i]) >> 7).

The partitioning approach divides the data array with std::partition and then only the part with values we want are added.

The zeroing approach goes over the array and sets all values that are less than 128 to zero. Then it goes over it again and adds all entries unconditionally. This goes over the data twice (and mutates it) but there are no data dependencies.

The bucket approach has an array of 256 entries. The loop goes over the data and increments the count for each input byte like this: ++counts[buf[i]]. The result is then obtained by going over entries 128-255 and evaluating result += i*counts[i].

So which one is the fastest?

Well that depends. A lot. On pretty much everything.

On x86_64 lookup table and bucket are the fastest, but only when using -O2. And GCC. For some reason Clang can't optimize the bucket version and it is consistently slower. On O3 bit fiddling becomes faster.

On a Raspberry Pi and -O2, the fastest are bit fiddling and bucket. But when using -O3, the simple loop is the fastest by a noticeable margin.

There does not seem to be a version that is consistently the fastest. However the versions that mutate their input array (partition and zeroing) are consistently the slowest.

What should I do to get teh fastest codez for my program?

This is a difficult question and depends on many things. The most straightforward solution is to write the simplest code you possibly can. It is easiest to understand and modify and is also the simplest for the compilers to optimize for the largest number of targets.

After that find out if you have a bottleneck. Measure. Measure again. Write an optimized version if you really need it. Measure. Measure with many different compilers and platforms. Keep measuring and fixing issues until the project reaches end of life.

maanantai 15. toukokuuta 2017

Emulating the Rust borrow checker with C++ part II: the borrowining

The most perceptive among you might have noticed that the last blog post did not actually do any borrowing, only single owner semantics with moves. Let's fix that. But first a caveat.

As far as I can tell, it is not possible to emulate Rust's borrow checker fully in compile time with C++. You can get pretty close (for some features at least) but there is some runtime overhead and violations are detected only at runtime, not compile time. See the end of this post for a description why that is. Still, it's better than nothing.

At the core is a resource whose ownership we want to track. We either want to allow either one accessor that can alter the object or many read-only accessors. We put the item inside a class that looks like this:

template<typename T>
class Owner final {
 private:
    T i;
    int roBorrows = 0;
    bool rwBorrow = false;
<rest of class omitted for brevity>

The holder object does not give out pointers or references to the underlying object. Instead you can only request either a read only or read-write (or mutable) reference to it. The read only function looks like this (the rw version is almost identical):

template<typename T>
RoBorrow<T> Owner<T>::borrowReadOnly() {
    if(rwBorrow) {
        throw std::runtime_error("Tried to create read only borrow when a rw borrow exists.");
    } else {
        return ::RoBorrow<T>(this); // increments borrow count
    }
}

Creating read only borrows only succeeds if there are no rw borrows. This code throws an exception but it could also call std::terminate or something else. A rw borrow would fail in a similar way if there are any ro borrows.

The interesting piece is the implementation of the proxy object RoBorrow<T>. It is the kind of move-only type that was described in part I. When it goes out of scope its destructor decrements the owner's borrow count:

~RoBorrow() { if (o) { o->decrementRoBorrows();} }

The magic lies in the conversion operator that is slightly different than the original version:

operator const T&() const { return owner->i; }

The conversion operator only gives out a const reference to the underlying object, which means only const operations can be invoked on it. Obviously this does not work for e.g. raw file descriptors because a "const int" does not protect the stat of the kernel fd object.  In order to call non-const functions you first need to get an RwBorrow whose conversion operator gives a constless reference, and Owner will only provide one if there are no other borrows outstanding.

When using this mechanism the following code fails (at runtime):

 auto b1 = owner.borrowReadOnly();
 auto b2 = owner.borrowReadWrite();
 
But this code works:

{
    auto b1 = owner.borrowReadOnly();
    auto b2 = owner.borrowReadOnly();
}
{
    auto b3 = owner.borrowReadWrite();
}
{
    auto b4 = owner.borrowReadOnly();
}

because the borrows go out of scope and are destroyed decrementing borrow counts to zero.

Why can't this be done at compile time?

I'm not a TMP specialist so maybe it can but as far as I know it is not possible. I'd love to be proven wrong on this one. The issue is due to limitations of constexpr which can be distilled into this dummy function:

template<typename T>
void constexpr foo(boolean b) {
    T x;
    constexpr int refcount = 1;
    if constexpr(b) {
        // do something
        --refcount;
    } else {
        // do something else
        --refcount;
    }
    static_assert(refcount == 0);
}

The static assert is obviously true no matter which code path is executed but the compiler can't prove that. First of all the if clause does not compile because its argument is not a compile-time constant. Second of all, the refcount decrements do not compile because constexpr variables can not be mutated during evaluation. You can create a new variable with the value x-1 but it would go out of scope when the subblocks end and there is no phi-node -like concept to get the value out to the outer scope. Finally, destructors can not be constexpr so even if we could mutate the count variable, it would not be done automatically (even though in theory the compiler has all the information it needs to do that).

lauantai 13. toukokuuta 2017

Emulating the Rust borrow checker with C++ move-only types

Perhaps the greatest thing to come out of C++11 is the notion of move semantics. Originally it was designed to make it efficient to return big objects like matrices from functions. In move operations the destination "steals the guts" of the source object rather than making a copy of it. Usually this means taking hold of some pointer to a reserved memory block and assigning the source's pointer to nullptr.

This mechanism is not reserved to pointers and can be used for any data type. As an example let's write a move-only integer class. A typical use for this would be to store file descriptors to ensure that they are closed. The code is straightforward, let's go through it line by line:

class MoveOnlyInt final {
 private:
    int i;

    void release() { /* call to release function here */ }

This is basic boilerplate. The release function would typically be close, but can be anything.

 public:
    explicit MoveOnlyInt(int i) : i(i) {}
    ~MoveOnlyInt() { release(); }

We can construct this from a regular integer and destroying this object means calling the release function. Simple. Now to the actual meat and bones.

    MoveOnlyInt() = delete;
    MoveOnlyInt(const MoveOnlyInt &) = delete;
    MoveOnlyInt& operator=(const MoveOnlyInt &) = delete;

These declarations specify that this object can not be copied (which is what C++ does by default). Thus creating two objects that hold the same underlying integer is not possible unless you explicitly create two objects with the same bare integer.

    MoveOnlyInt(MoveOnlyInt &&other) { i = other.i; other.i = -1; }
    MoveOnlyInt& operator=(MoveOnlyInt &&other) { release(); i = other.i; other.i = -1; return *this; }

Here we define the move operators. A move means releasing the currently held integer (if it exists) and grabbing the source's integer instead. This is all we need to have a move only type. The last thing to add is a small helper function.

    operator int() const { return i; }
};

This means that this object is convertible to a plain integer. In practice this means that if you have a function like this:

void do_something(int x);

then you can call it like this:

MoveOnlyInt x(42);
do_something(x);

You could achieve the same by creating a get() function that returns the underlying integer but this is nicer and more usable. This is not quite as useful for integers but extremely nice when using plain C libraries with opaque structs. Then your move only wrapper class can be used directly when calling plain C functions.

What does this allow us to do?

All sorts of things, the object behaves much in the same way as Rust's movable types (but is not 100% identical). You can for example return it from a function, which transfers ownership in a compiler enforced way:

MoveOnlyInt returnObject() {
    MoveOnlyInt retval(42);
    return retval;
}

If you try to pass an object as an argument like this:

int byValue(MoveOnlyInt mo);
...
byValue(mo);

a regular class would get copied but for a move-only type you get a compiler error:

./prog.cpp:39:13: error: call to deleted constructor of 'MoveOnlyInt'
    byValue(mo);
            ^~
../prog.cpp:13:5: note: 'MoveOnlyInt' has been explicitly marked deleted here
    MoveOnlyInt(const MoveOnlyInt &) = delete;
    ^
../prog.cpp:22:28: note: passing argument to parameter here
int byValue(MoveOnlyInt mo) {

Instead you have explicitly tell the compiler to move the object:

byValue(std::move(mo));

Some of you might have spotted a potential issue. Since a MoveOnly object can be converted to an int and there is a constructor that takes an int, that could create two objects for the same underlying integer. Like this:

MoveOnlyInt mo(42);
MoveOnlyInt other(mo);

The compiler output looks like the following:

../prog.cpp:37:14: error: call to deleted constructor of 'MoveOnlyInt'
    MoveOnlyInt other(mo);
             ^     ~~
../prog.cpp:13:5: note: 'MoveOnlyInt' has been explicitly marked deleted here
    MoveOnlyInt(const MoveOnlyInt &) = delete;

The compiler prevents creating invalid objects.

The wrapper object has zero memory overhead compared to a plain int and code generation changes are minimal. The interested reader is encouraged to play around with the compiler explorer to learn what those are.

Is this just as good as native Rust then?

No. Rust's borrow checker does more and is stricter. For example in C++ you can use a moved-from object, which may yield a null pointer dereference if your underlying type was a pointer. You won't get a use-after-free error, though. On the other hand this class is 18 lines of code that can be applied to any existing C++ code base immediately whereas Rust is a whole new programming language, framework and ecosystem.

Anyone is also free to do the wrong thing and take a copy of the integer value without telling anyone but this issue remains in every other language as well, even in unsafe Rust.


tiistai 9. toukokuuta 2017

A list of common Meson antipatterns

In the vein of a blog post I wrote years ago, here are some Meson antipatterns that should be avoided in build definitions.

Adding -g, -Wall etc manually

The basic design principle of Meson is that common operations and features should work out of the box without any user intervention. That means that debug information should be enabled for debug builds (which are the default). Similarly basic warning flags should always be on, because anyone coding without -Wall is not doing software development, but practicing astrology.

Many people seem to be burned by existing build systems and add these flags manually as the first thing they do. There is no need, we do it for you. Similarly instead of manually adding flags such as -Wextra, look into changing the option warning_level instead.

Adding these flags manually is also bad for portability because there are compilers that do not support them.

Building full paths to files in the source tree

This pattern looks something like this:

myfiles = files('@0@/@1@'.format(meson.current_source_dir(), 'file.c', [other files here])

This makes sense if you are used to the Make's "everything is a string" semantics. Meson is not like that, it tries to take care of mundane things for you. The simpler version of the above is this:

myfiles = files('file.c', [other files here])

This will verify the existance of the file when called and will store the full location. The output can be used in any other subdirectory in any target and Meson will evaluate the correct path for you. There may be some corners where file objects are not accepted where they should. Please file bugs if you encounter one of these.

The former syntax works currently but is already an error in current Git master.

Using target_machine in cross compilation

Meson supports the full canadian cross environment. It is very powerful but unfortunately quite confusing until you wrap your brain around it. Many people will choose to use target_machine instead of host_machine for their cross compilation, even though that is incorrect and leads to interesting bugs. Fortunately there is a simple rule of thumb which is correct 99% of the time:
  • build_machine is the desktop or laptop computer you are using to do your development
  • host_machine is the low powered IoT device, ARM board or equivalent that will run your program
Unless you really know what you are doing and are building one of the very few packages that need to care (Binutils, GCC, Clang, some others), ignore target_machine. Pretend it does not even exist and everything will become simpler and more correct.

Manually setting up dependencies between targets

This happens often when defining custom targets or run targets that operate on the output of other custom targets. In Meson outputs carry dependency information with them. If you use the output of one target as an input to another, Meson will automatically set up all the dependencies. A consistent antipattern that can be seen in projects in the wild is going through strange contortions to build a string representing the path of a target output and then adding the dependency object to the target's extra dependency keyword argument. That is only meant for programs that do not take all their inputs as arguments because, for example, they do file globbing internally.

Like with files there may be cases where using the output object directly fails. These are all bugs, please report them.

maanantai 1. toukokuuta 2017

The Infinity Horse

The early 1900s saw the popularisation of the automobile. Everyone was jumping on the proverbial bandwagon of this new transport technology. It was cool and exciting so people wanted to be part of it, even if they might not have had full technical understanding of the domain.

One of those movers was a company in the business of raising, selling, renting and feeding horses. The managers understood that horses were a dead-end business and that the future was in motorized vehicles. Thus they started putting a lot of money and effort into creating their own car.

Engineers were hired, meetings were had and factories were being prepped for production. After a few months of work the main obstacles were overcome and the plan to build and launch a car brand were looking good. That is when the CEO scheduled a surprise meeting with the chief engineer of the project.

"We had a top management meetin last week and what we came up with was spectacular! We are going to completely revolutionize the way people think about cars!"

"Really," said the surprised engineer.

"Yes! What is the most annoying thing about cars?" he asked and without giving the engineer a chance to reply continued "Running out of fuel. That never happens with horses, so this is really a step backwards."

The engineer felt like pointing out that there is such a thing as a horse that has run out of fuel. It's called a dead horse and the difference between that and a car was that adding fuel to a dead horse does not make it come back alive. But he chose not to mention this issue, but instead wanted to hear this new master plan.

"So what we are going to do - and this is going to be a world first - is to create a car that never runs out of fuel. We shall call it The Infinity Horse. Marketing department is already creating a campaign for this. I have seen some of the sketches and they are just out of this world!"

"Errrm, never runs out of fuel?" the engineer said with a puzzled look on his face.

"Never!"

"That's not really possible, how are you going to do that?"

"That's your job isn't it? Just make it happen, it can't be that hard."

A very unpleasant two weeks followed. The engineer did everything in his power to try to explain that the aim was impossible, which did not work, and trying to suggest alternative solutions, which did not work either. Nothing short of perfection was acceptable. As an example making the gas tank bigger was not an option because while it made the problem occur less often it was not enough.

"The Vision is The Infinity Horse. Unless we have this feature we have nothing," the CEO repeated every time issues were raised.

This drove our engineer to despair. He was being crushed between the unbreakable laws of physics and the immovable opinion of the CEO. Grasping at straws he started going through automotive parts catalogues. There he saw a new product: a warning light that activated when the car was running out of fuel. And then it came to him.

He took the warning light and modified it slightly. Instead of turning on a light it shut down the engine. This would fulfill the requirement. The car would never run out of fuel, because it could be easily proven that there was still gas in the tank. Said fuel could not be used for driving but it was there. The only thing this modification did was to reduce the maximum range making the car worse by every conceivable objective metric.

The CEO loved it. The Infinity Horse had been achieved and it was perfect! If this system caused people to get stranded, well, that's their own fault for not doing due diligence and adding enough fuel to the tank before embarking on their journey.