Saturday, August 18, 2018

Linker symbol lookup order does not work the way you think

A common problem in linking problems has to do with circular dependencies. Suppose you have a program that looks like this:



Here program calls into function one, which is in library A. That calls into function two, which is in library B. Finally that calls into function three, which is back in library A again.

Let's assume that we use the following linker line to build the final executable:

gcc -o program prog.o liba.a libb.a

Because linkers were originally designed in the 70s, they are optimized for minimal resource usage. In this particular case the linker will first process the object file and then library A. It will detect that function one is used so it will take that function's implementation and then throw the rest of library A away. It will then process library B, take function two in the final program and note that function three is also needed. Because library A was thrown away the linker can not find three and errors out. The fix to this is to specify A twice on the command line.

This is how everyone has been told things work and if you search the Internet you will find many pages explaining this and how to set it up linker command lines correctly.

But is this what actually happens?

Let's start with Visual Studio

Suppose you were to do this in Visual Studio. What do you think would happen? There are four different possiblities:
  1. Linking fails with missing symbol three.
  2. Linking succeeds and program works.
  3. Either 1. or 2. happens, but there is not enough information to tell which.
  4. Linking succeeds but the final executable does not run.
The correct answer is 2. Visual Studio's linker is smart, keeps all specified libraries open and uses them to resolve symbols. This means that you don't have to add any library on the command line twice.

Onwards to macOS

Here we have the same question as above but using macOS's default LLD linker. The choices are also the same as above.

The correct answer is also 2. LLD keeps symbols around just like Visual Studio.

What about Linux?

What happens if you do the same thing on Linux using the default GNU linker? The choices are again the same as above.

Most people would probably guess that the correct answer here is 1. But it's not. What actually happens is 3. That is, the linking can either succeed or fail depending on external circumstances.

The difference here is whether functions one and three are defined in the same source file (and thus end up in the same object file) or not. If they are in the same source file, then linking will succeed and if they are in separate files, then it fails. This would indicate that the internal implementation of GNU ld does not work at the symbol level but instead just copies object files out from the AR archive wholesale if any of their symbols are used.

What does this mean?

For example it means that if you build your targets with unity builds, their entire symbol resolution logic changes. This is probably quite rare but can be extremely confusing when it happens. You might also have a fully working build, which breaks if you move a function from one file to another. This is a thing that really should not happen but when it does things get very confusing.

The bigger issue here is that symbol resolution works differently on different platforms. Normally this should not be an issue because symbol names must be unique (or they must be weak symbols but let's not go there) or the behaviour is undefined. It does, however, place a big burden on cross platform projects and build systems because you need to have very complex logic in place if you wish to deduplicate linker flags. This is a fairly common occurrance even if you don't have circular dependencies. For example when building GStreamer with Meson some time ago the undeduplicated linker line contained hundreds duplicated library entries (it still does but not nearly as many).

The best possible solution would be if GNU ld started behaving the same way as VS linker and LLD. That way all major platforms would behave the same and things would get a lot simpler. In the mean time one should be able to simulate this with linker grouping flags:
  1. Go through all linker arguments and split them to libraries that use link_whole and those that don't. Throw away any existing linker grouping.
  2. Deduplicate and put the former at the beginning of the link line with the requisite link_full arguments.
  3. Deduplicate all entries in the list of libraries that don't get linked fully.
  4. Put the result of 3 on the command line in a single linker group.
This should work and would match fairly accurately what VS and LLD already do, so at least all cross platform projects should work out of the box already.

What about other platforms?

The code is here, feel free to try it out yourself.

Friday, August 17, 2018

The Internet of 200 Kilogram Things: Challenges of Managing a Fleet of Slot Machines

In a previous post we talked about Finland's Linux powered slot machines. It was mentioned that there are about 20 000 of these machines in total. It turns out that managing and maintaining all those machines is a not as easy as it may first appear.

In the modern time of The Cloud, 20 thousand machines might not seem like much. Basic cloud management software such as Kubernetes scales to hundreds of thousands, even millions of machines without even breaking a sweat. Having "only" 20 thousand machines may seem like a small and simple thing that can be managed by one intern in their spare time. In reality things get difficult as there are many unique challenges to managing slot machines as opposed to regular servers.

The data center

Large scale computer fleets are housed in data centers. Slot machines are not. They are scattered across Finland in supermarkets and gas stations. This means that any management solution based on central control is useless. Another way of looking at this is that the data center housing the machines is around 337 thousand square kilometers in size. It is left as an exercise to the reader to calculate the average distance between two nearest machines assuming they are distributed evenly over the surface area.

Every machine is needed

The mantra of current data center design is that every machine must be expendable. That is, any computer may break down at any time, but the end user does not notice this because all operations are hidden behind a reliable layer. Workloads can be transferred from one machine to another either in the same rack, or possibly even to the other side of the world without anyone noticing.

Slot machines have the exact opposite requirements. Every machine must keep working all the time. If any machine breaks down, money is lost. Transferring the work load from a broken machine in the countryside to Frankfurt or Washington is not feasible, because it would require also moving the players to the new location. This is not very profitable, as atoms are much more expensive and slow to transfer between continents than electrons.

The reliability requirements are further increased by the distributed locations of the machines. It is not uncommon that in the sparsely populated areas the closest maintenance person may be more than 400 km away.

The Internet connection

Data centers nowadays have 10 Gb Ethernet connections or something even faster. In contrast it is the responsibility of the machine operator to provide a net connection to a slot machine. This means that the connections vary quite a lot. At the lowest end are locations that get poor quality 3G reception some of the time.

Remote management is also an issue. Some machines are housed in corporate networks behind ten different firewalls all administered by different IT provider organisations, some of which may be outsourced. Others are slightly less well protected but flakier. Being able to directly access any machine is the norm in data centers. Devices housed in random networks do not have this luxury.

The money problem

Slot machines deal with physical money. That makes them a prime target for criminals. The devices also have no physical security: you must be able to physically touch them to be able to play them. This is a challenging and unusual combination from a security point of view. Most companies would not leave their production servers outside for people to fiddle around with, but for these devices it is a mandatory requirement.

The beer attack

Many machines are located in bars. That means that they need to withstand the forces of angry intoxicated players. And, as we all know, drunk people are surprisingly inventive. A few years ago some people noticed that the machines have ventilation holes. They then noticed that pouring a pint of beer in those holes would cause a short circuit inside the machine causing all the coins to be spit out.

This issue was fixed fairly quickly, because you really don't want to be in a situation where drunk people would have financial motivation to pour liquids on high voltage equipment in crowded rooms. This is not a problem one has to face in most data centers.

Update challenges

There are roughly two different ways of updating an operating system install: image based updates and package based updates. Neither of these works particularly well in slot machine usage. Games are big, so downloading full images is not feasible, especially for machines that have poor network connections. Package based updates have the major downside that they are not atomic. In desktop and server usage this is not really an issue because you can apply updates at a known good time. For remote devices this does not work because they can be powered off at any time without any warning. If this happens during an upgrade you have a broken machine requiring a physical visit from a maintenance person. As mentioned above this is slow and expensive.

Sunday, August 12, 2018

Implementing a distributed compilation cluster

Slow compilation times are a perennial problem. There have been many attempts at caching and distributing the problem such as distcc and Icecream. The main bottleneck on both of these is that some work must be done on the "user's desktop" machine which is then transferred over the network. Depending on the implementation this may include things such as fully preprocessing the source file and then sending the result over the net (so it can be compiled on the worker machine without needing any system headers).

This means that the user machine can easily become the bottleneck. In order to remove this slowdown all the work would need to be done on worker machines. Thus the architecture we need would be something like this:


In this configuration the entire source tree is on a shared network drive (such as NFS). It is mounted in the same path on all build workers as well as the user's desktop machine. All workers and the desktop machine also must have an identical setup, that is, same compilers and installed dependencies. This is fairly easy to achieve with Docker or any similar container technology.

The main change needed to distribute the work is to create a compiler wrapper script, much like distcc or icecc, that sends the compilation request to the work distributor. It consists only of a command line to execute and the path to run it in. The distributor looks up the machine with the smallest load, sends the command, waits for the result and then returns the result to the developer machine.

Note that the input or output files do not need to be transferred between the developer machine and the workers. It is taken care of automatically by NFS. This includes any changes made by the user on their local checkout which are not in revision control. The code that implements all of this (in an extremely simple, quick, dirty and unreliable way) can be found in this Github repo. The implementation is under 300 lines of Python.

Experimental results

Since I don't have a data center to spare I tested this on a single 8 core i7 computer. The "native OS" ran the NFS server and work distributor. The workers were two cloned Virtualbox images each having 2 cores. For testing I compiled LLVM, which is a fairly big C++ code base.

Using the wrapper is straightforward and consists of setting up the original build directory with this:

FORCE_INLINE=1 CXX='/path/to/wrapper workserver_address g++' cmake <options>

Force inline is needed so configuration tests are run on the local machine. They write to /tmp, which is not shared and the executables might be run on a different machine than where they are compiled leading to failures. This could also be solved by having a shared temporary folder but that would increase the complexity of this simple experiment.

Compiling the source just over NFS in a single machine using 2 cores took about an hour. Compiling it with two workers took about 47 minutes. This is not particularly close to the optimal time of 30 minutes so there is a fair bit of overhead in the implementation. Most of this is probably due to NFS and the fact that absolutely everything ran on the same physical machine. NFS also had coherency problems. Sometimes some process invocations could not see files created by their dependency tasks. The most common case was linker invocations, which were missing one or more object files. Restarting the build always made it pass. I tried to add sync commands as necessary but could not make it 100% reliable.

Miscellaneous things of note

In this test only compilation was parallelised. However this same approach works with every executable that is standalone, that is, it does not need to talk to any other ongoing process via IPC. Every build system that supports setting the compiler manually can be used with this scheme. It also works for parallelising tests for build systems that support invoking tests with an arbitrary runner. For example, in Meson you could do this:

meson test --wrapper='/path/to/wrapper workserver_address'

The system also works (in theory) identically on other operating systems such as macOS and Windows. Setting up the environment is even easier because most projects do not use "system dependencies" on those platforms, only the compiler. Thus on Windows you could mount a smb drive with the code on, say, D:\code on all machines and, assuming they have the same version of Visual Studio, it should just work (not actually tested).

Adding caching support is fairly easy. All machines need to have a common directory mounted, point CCACHE_DIR to that and set the wrapper command on the desktop machine to:

CXX='/path/to/wrapper workserver_address ccache g++'