lauantai 18. elokuuta 2018

Linker symbol lookup order does not work the way you think

A common problem in linking problems has to do with circular dependencies. Suppose you have a program that looks like this:



Here program calls into function one, which is in library A. That calls into function two, which is in library B. Finally that calls into function three, which is back in library A again.

Let's assume that we use the following linker line to build the final executable:

gcc -o program prog.o liba.a libb.a

Because linkers were originally designed in the 70s, they are optimized for minimal resource usage. In this particular case the linker will first process the object file and then library A. It will detect that function one is used so it will take that function's implementation and then throw the rest of library A away. It will then process library B, take function two in the final program and note that function three is also needed. Because library A was thrown away the linker can not find three and errors out. The fix to this is to specify A twice on the command line.

This is how everyone has been told things work and if you search the Internet you will find many pages explaining this and how to set it up linker command lines correctly.

But is this what actually happens?

Let's start with Visual Studio

Suppose you were to do this in Visual Studio. What do you think would happen? There are four different possiblities:
  1. Linking fails with missing symbol three.
  2. Linking succeeds and program works.
  3. Either 1. or 2. happens, but there is not enough information to tell which.
  4. Linking succeeds but the final executable does not run.
The correct answer is 2. Visual Studio's linker is smart, keeps all specified libraries open and uses them to resolve symbols. This means that you don't have to add any library on the command line twice.

Onwards to macOS

Here we have the same question as above but using macOS's default LLD linker. The choices are also the same as above.

The correct answer is also 2. LLD keeps symbols around just like Visual Studio.

What about Linux?

What happens if you do the same thing on Linux using the default GNU linker? The choices are again the same as above.

Most people would probably guess that the correct answer here is 1. But it's not. What actually happens is 3. That is, the linking can either succeed or fail depending on external circumstances.

The difference here is whether functions one and three are defined in the same source file (and thus end up in the same object file) or not. If they are in the same source file, then linking will succeed and if they are in separate files, then it fails. This would indicate that the internal implementation of GNU ld does not work at the symbol level but instead just copies object files out from the AR archive wholesale if any of their symbols are used.

What does this mean?

For example it means that if you build your targets with unity builds, their entire symbol resolution logic changes. This is probably quite rare but can be extremely confusing when it happens. You might also have a fully working build, which breaks if you move a function from one file to another. This is a thing that really should not happen but when it does things get very confusing.

The bigger issue here is that symbol resolution works differently on different platforms. Normally this should not be an issue because symbol names must be unique (or they must be weak symbols but let's not go there) or the behaviour is undefined. It does, however, place a big burden on cross platform projects and build systems because you need to have very complex logic in place if you wish to deduplicate linker flags. This is a fairly common occurrance even if you don't have circular dependencies. For example when building GStreamer with Meson some time ago the undeduplicated linker line contained hundreds duplicated library entries (it still does but not nearly as many).

The best possible solution would be if GNU ld started behaving the same way as VS linker and LLD. That way all major platforms would behave the same and things would get a lot simpler. In the mean time one should be able to simulate this with linker grouping flags:
  1. Go through all linker arguments and split them to libraries that use link_whole and those that don't. Throw away any existing linker grouping.
  2. Deduplicate and put the former at the beginning of the link line with the requisite link_full arguments.
  3. Deduplicate all entries in the list of libraries that don't get linked fully.
  4. Put the result of 3 on the command line in a single linker group.
This should work and would match fairly accurately what VS and LLD already do, so at least all cross platform projects should work out of the box already.

What about other platforms?

The code is here, feel free to try it out yourself.

perjantai 17. elokuuta 2018

The Internet of 200 Kilogram Things: Challenges of Managing a Fleet of Slot Machines

In a previous post we talked about Finland's Linux powered slot machines. It was mentioned that there are about 20 000 of these machines in total. It turns out that managing and maintaining all those machines is a not as easy as it may first appear.

In the modern time of The Cloud, 20 thousand machines might not seem like much. Basic cloud management software such as Kubernetes scales to hundreds of thousands, even millions of machines without even breaking a sweat. Having "only" 20 thousand machines may seem like a small and simple thing that can be managed by one intern in their spare time. In reality things get difficult as there are many unique challenges to managing slot machines as opposed to regular servers.

The data center

Large scale computer fleets are housed in data centers. Slot machines are not. They are scattered across Finland in supermarkets and gas stations. This means that any management solution based on central control is useless. Another way of looking at this is that the data center housing the machines is around 337 thousand square kilometers in size. It is left as an exercise to the reader to calculate the average distance between two nearest machines assuming they are distributed evenly over the surface area.

Every machine is needed

The mantra of current data center design is that every machine must be expendable. That is, any computer may break down at any time, but the end user does not notice this because all operations are hidden behind a reliable layer. Workloads can be transferred from one machine to another either in the same rack, or possibly even to the other side of the world without anyone noticing.

Slot machines have the exact opposite requirements. Every machine must keep working all the time. If any machine breaks down, money is lost. Transferring the work load from a broken machine in the countryside to Frankfurt or Washington is not feasible, because it would require also moving the players to the new location. This is not very profitable, as atoms are much more expensive and slow to transfer between continents than electrons.

The reliability requirements are further increased by the distributed locations of the machines. It is not uncommon that in the sparsely populated areas the closest maintenance person may be more than 400 km away.

The Internet connection

Data centers nowadays have 10 Gb Ethernet connections or something even faster. In contrast it is the responsibility of the machine operator to provide a net connection to a slot machine. This means that the connections vary quite a lot. At the lowest end are locations that get poor quality 3G reception some of the time.

Remote management is also an issue. Some machines are housed in corporate networks behind ten different firewalls all administered by different IT provider organisations, some of which may be outsourced. Others are slightly less well protected but flakier. Being able to directly access any machine is the norm in data centers. Devices housed in random networks do not have this luxury.

The money problem

Slot machines deal with physical money. That makes them a prime target for criminals. The devices also have no physical security: you must be able to physically touch them to be able to play them. This is a challenging and unusual combination from a security point of view. Most companies would not leave their production servers outside for people to fiddle around with, but for these devices it is a mandatory requirement.

The beer attack

Many machines are located in bars. That means that they need to withstand the forces of angry intoxicated players. And, as we all know, drunk people are surprisingly inventive. A few years ago some people noticed that the machines have ventilation holes. They then noticed that pouring a pint of beer in those holes would cause a short circuit inside the machine causing all the coins to be spit out.

This issue was fixed fairly quickly, because you really don't want to be in a situation where drunk people would have financial motivation to pour liquids on high voltage equipment in crowded rooms. This is not a problem one has to face in most data centers.

Update challenges

There are roughly two different ways of updating an operating system install: image based updates and package based updates. Neither of these works particularly well in slot machine usage. Games are big, so downloading full images is not feasible, especially for machines that have poor network connections. Package based updates have the major downside that they are not atomic. In desktop and server usage this is not really an issue because you can apply updates at a known good time. For remote devices this does not work because they can be powered off at any time without any warning. If this happens during an upgrade you have a broken machine requiring a physical visit from a maintenance person. As mentioned above this is slow and expensive.

sunnuntai 12. elokuuta 2018

Implementing a distributed compilation cluster

Slow compilation times are a perennial problem. There have been many attempts at caching and distributing the problem such as distcc and Icecream. The main bottleneck on both of these is that some work must be done on the "user's desktop" machine which is then transferred over the network. Depending on the implementation this may include things such as fully preprocessing the source file and then sending the result over the net (so it can be compiled on the worker machine without needing any system headers).

This means that the user machine can easily become the bottleneck. In order to remove this slowdown all the work would need to be done on worker machines. Thus the architecture we need would be something like this:


In this configuration the entire source tree is on a shared network drive (such as NFS). It is mounted in the same path on all build workers as well as the user's desktop machine. All workers and the desktop machine also must have an identical setup, that is, same compilers and installed dependencies. This is fairly easy to achieve with Docker or any similar container technology.

The main change needed to distribute the work is to create a compiler wrapper script, much like distcc or icecc, that sends the compilation request to the work distributor. It consists only of a command line to execute and the path to run it in. The distributor looks up the machine with the smallest load, sends the command, waits for the result and then returns the result to the developer machine.

Note that the input or output files do not need to be transferred between the developer machine and the workers. It is taken care of automatically by NFS. This includes any changes made by the user on their local checkout which are not in revision control. The code that implements all of this (in an extremely simple, quick, dirty and unreliable way) can be found in this Github repo. The implementation is under 300 lines of Python.

Experimental results

Since I don't have a data center to spare I tested this on a single 8 core i7 computer. The "native OS" ran the NFS server and work distributor. The workers were two cloned Virtualbox images each having 2 cores. For testing I compiled LLVM, which is a fairly big C++ code base.

Using the wrapper is straightforward and consists of setting up the original build directory with this:

FORCE_INLINE=1 CXX='/path/to/wrapper workserver_address g++' cmake <options>

Force inline is needed so configuration tests are run on the local machine. They write to /tmp, which is not shared and the executables might be run on a different machine than where they are compiled leading to failures. This could also be solved by having a shared temporary folder but that would increase the complexity of this simple experiment.

Compiling the source just over NFS in a single machine using 2 cores took about an hour. Compiling it with two workers took about 47 minutes. This is not particularly close to the optimal time of 30 minutes so there is a fair bit of overhead in the implementation. Most of this is probably due to NFS and the fact that absolutely everything ran on the same physical machine. NFS also had coherency problems. Sometimes some process invocations could not see files created by their dependency tasks. The most common case was linker invocations, which were missing one or more object files. Restarting the build always made it pass. I tried to add sync commands as necessary but could not make it 100% reliable.

Miscellaneous things of note

In this test only compilation was parallelised. However this same approach works with every executable that is standalone, that is, it does not need to talk to any other ongoing process via IPC. Every build system that supports setting the compiler manually can be used with this scheme. It also works for parallelising tests for build systems that support invoking tests with an arbitrary runner. For example, in Meson you could do this:

meson test --wrapper='/path/to/wrapper workserver_address'

The system also works (in theory) identically on other operating systems such as macOS and Windows. Setting up the environment is even easier because most projects do not use "system dependencies" on those platforms, only the compiler. Thus on Windows you could mount a smb drive with the code on, say, D:\code on all machines and, assuming they have the same version of Visual Studio, it should just work (not actually tested).

Adding caching support is fairly easy. All machines need to have a common directory mounted, point CCACHE_DIR to that and set the wrapper command on the desktop machine to:

CXX='/path/to/wrapper workserver_address ccache g++'

torstai 26. heinäkuuta 2018

Building native multiplatform GUI apps with Meson

A recent trend in multiplatform GUI applications is to create the core business logic of the application in something like C++, have it (optionally) expose a plain C interface and then create a gui on top of that using the native widget set of each supported platform. This means that the application uses GTK on Linux and other unixes, Cocoa on macOS, win32 API on Windows, Java widgets on Android and so on. This makes the application fully native on all platforms. The tradeoff is having to write the gui multiple times against not having to wrangle a multiplatform widget toolkit as your dependency.

Regardless of how you build your guis you need to have a build system that can build the application under all these different environments from a single code base. To this end I created a sample application called Platypus, which can be downloaded from this Github repo.

The code and compilation

The application itself is extremely simple. It consists of one shared library that returns a random number between 0 and 100 when called. It is implemented using C++ 11's random number generator functionality to ensure each platform has a toolchain new enough to handle it. The GUI applications built on top of it have a text label and a button. Pressing the button updates the text label with a new random number. There is also a test program that verifies that the library is working.

The GTK version is a plain C application. The gui is defined using a Glade interface definition file rather than building it by hand.

The macOS version has a GUI written in Objective C. The gui is defined as a XIB file created with XCode. It is built into a standard app bundle.

The Windows application is written in C++ (though it does not really use any C++ features) and has a gui laid out by hand.

All these guis have full platform integration with icons, an Info.plist, .desktop files and so on.

The installers

The GTK version can be built as a Flatpak in the usual way. The build manifest can be found in the repository's root.

The macOS version builds a standard .dmg installer that can be directly shipped to end users.

The Windows version builds an .MSI installer providing complete install/uninstall integration.

How complicated is it?

The entire build definition consists of 107 lines of Meson.

Screenshots

Here is the plain GTK version running as a Flatpak application on Kubuntu. Window icons and desktop integration work as you would expect.


Here is the macOS version showing the drive image, the installer window and the application running with proper platform integration.


Finally here is the Windows application showing the installed path location under Program Files, the application itself and the automatic integration to Windows' application uninstaller system.


Future plans

It would be cool to add an Android application as well as an iOS application written in Swift in the code base. Patches are welcome as always.

keskiviikko 25. heinäkuuta 2018

Why Git is terrible in four pictures

I was asked to write a blog post on why I dislike Git in general and its UI in particular. Here is a representative sample in four images.

Recently a pull request was filed that looked like this:


As you can see there is an extra merge commit. As is customary we wanted to get rid of it to get a clean rebase based merge history. To do that you'd first get a checkout of the code and look at the log, which looks like the following.


So far, so good. Now let's do a rebase --interactive. It looks like this:


Suddenly Git has chosen to silently remove the merge commit from this list. Why? I have no idea. The commit had changes in it, so it was not pruned because it was empty. If you then exit the editor without any changes (which usually means "do not change anything") then the commit is deleted and any changes that were in it are gone:


If your latter commits built on those changes, you get yummy merge conflicts for something that is conceptual a no-op.

This is the essence of working with Git. Most of the time it works sort of ok, but every now and then it will, without any warning or reason, completely screw you over, destroy your data and leave you stranded, forced to debug your way out of the resulting mess without any help.

"Of course it breaks, you should have used --do-not-do-the-idiotic-wrong-thing-which-for-some-reason-is-the-default command line option, everyone knows that, duh!"

A common kneejerk response to these kinds of problems is that it is somehow the user's own fault and that they should have memorized every quirk in the software in order to use it correctly (or at all). I'm certain some of you out there on the Internet had already started writing a strongly worded message to let me know that. Don't bother.

Whenever you have a piece of software that silently destroys user data, the fault always, always, ALWAYS lies with the program. Even if "it only happens rarely". Even if you think "it's the user's fault". Even if you personally know how the problem could have been avoided. The flaw is ABSOLUTELY ALWAYS in the software. Never in users. Ever.

Any attempt at shifting the cause to the user, for whatever reason, is victim blaming. Don't do it.

maanantai 16. heinäkuuta 2018

How expensive is globbing for sources in large projects

A common holy war in build systems is whether you should explicitly list all sources that make up a target or use a globbing pattern. There are both technical and non-technical arguments on both sides. The latter mostly deal with reliability and flexibility vs convenience. In this post we are going to ignore them completely and instead focus on the technical parts, specifically the overhead of globbing. The measurement script used can be downloaded from this repo.

In this test we used the full checkout of Chromium source code. The tests were run under Windows, since it is noticeably slower than Linux on both file operations and process invocations. The task simulation consists of roughly three parts:

  1. Scan the source tree for all directories that contain sources
  2. Generate glob patterns for detected directories (corresponding roughly to "one target for all sources in one directory")
  3. Run the globs
This ignores a bunch of steps, such as serialising the glob results to files and calculating the delta between two glob sets. These are probably fairly fast compared to file access operations, though.

Scanning the source tree and generating the globs

There is no direct correlation between this step and a regular build system. It is mostly interesting as a comparison between file operations between a hot and a cold cache. Running the scan on a cold cache takes 2 minutes but for a warm cache about 6 seconds.

Since this step is always run first, the following tests are all operating with a hot cache.

The actual globbing

Running all globs on the Chromium source tree takes between 2 and 6 seconds. This is the absolute lowest time that can be obtained for a no-op build without daemons because all globs must be re-evaluated every time.

The rule of thumb for UI design is that everything under one second is perceived as instantaneous. This means that for these sizes globbing causes a noticeable delay. Whether this is seen as insignificant or aggravating depends on each user.

Extra bonus: C++ modules

Since we have the measurement script, let's use it for something more interesting. Modules are an upcoming C++ feature to increase build times and a ton of other coolness depending on who you ask. The current specification works by having a kind of "module export declaration" at the beginning of source files. The idea is that you first compile those to generate a sort of a module declaration file and then you can start the actual compilation that uses said files.

If you thought "waitaminute, that sounds exactly like how FORTRAN is compiled", you are correct. Because of this it has the same problem that you can't compile source files in an arbitrary order, but instead you must first somehow scan them to find out the interdependencies between source (not header) files. In practice what this means is that instead of single-phase compilation all files must be processed twice. All scan operations must be done before any compilation jobs can start because otherwise you might start to compile a file before its dependencies are fully processed.

The scanning can be done in one of two ways. Either the build system scans the sources meaning it needs to understand the syntax of source files or the compiler can be invoked in a special preprocessing mode. Note that build systems such as Ninja do not do any such operations by themselves but instead always invoke external processes to do their work.

Testing the performance impact of these two is straightforward. The first one can be done by reading the first ten lines of each source file and then throwing them away. Measuring this time gives a fairly good estimate of the file processing overhead. The second way can be measured by doing the exact same thing but also invoking the compiler with no-op command line arguments to get the process invocation overhead.

Scanning the files directly takes roughly 120 seconds. For an 8 core machine this means a 15 second delay (at minimum) before any compilation tasks can begin. This is not great but for a full build it should be tolerable.

When spawning a compiler process the same operation takes 69 minutes. This is intolerably slow and would require an order of magnitude speedup in compilation times to be worthwhile. Unlike regular compilations, dependency scanning can not be sped up with unity builds because the specification requires that the module declaration must be at the very beginning of source files (and presumably there can not be more than one in a single TU).

keskiviikko 13. kesäkuuta 2018

Easy MSI installer creator

Shipping programs on Windows platforms becomes a lot simpler (especially in corporate environments) if you can create an MSI installer. The only Free software solution for that is the WiX installer toolkit. The fairly big downside to this is that it very much tied to how Visual Studio does things with GUIDs and all that. The installer's contents and behavior is defined with an XML file whose format is both verbose and confusing.

Most Unix developers, once faced with this, will almost immediately blurt out something like "Why can't I just do DESTDIR=c:\some\path ninja install and have it make an installer out of the result?" So I created a script that does exactly that.

The basic usage is simple. First you do a staged install into some directory and create a JSON file describing the installation that would look like this:

{
    "update_guid": "YOUR-GUID-HERE",
    "version": "1.0.0",
    "product_name": "Product name here",
    "manufacturer": "Your organization's name here",
    "name": "Name of product here",
    "name_base": "myprog",
    "comments": "A comment describing the program",
    "installdir": "MyProg",
    "license_file": "License.rtf",
    "parts": [
        {"id": "MainProgram",
         "title": "Program name",
         "description": "The MyProg program",
         "absent": "disallow",
         "staged_dir": "staging"
        }
    ]
}

Running the script would then create a standalone MSI installer with the contents of the staging directory.

Multiple components in one installer

Some programs ship with multiple parts that the user can choose whether to install each part. This is supported by the script. First you must split the files in multiple staging directories, one per component and then add entries to the parts array. See the repository for an example.