tiistai 19. marraskuuta 2019

Some intricacies of ABI stability

There is a big discussion ongoing in the C++ world about ABI stability. People want to make a release of the standard that does a big ABI break, so a lot of old cruft can be removed and made better. This is a big and arduous task, which has a lot of "fun" and interesting edge, corner and hypercorner cases. It might be interesting to look at some of the lesser known ones (this post is not exhaustive, not by a long shot). All information here is specific to Linux, but other OSs should be roughly similar.

The first surprising thing to note is that nobody really cares about ABI stability. Even the people who defend stable ABIs in the committee do not care about ABI stability as such. What they do care about is that existing programs keep on working. A stable ABI is just a tool in making that happen. For many problems it is seemingly the only tool. Nevertheless, ABI stability is not the end goal. If the same outcome can be achieved via some other mechanism, then it can be used instead. Thinking about this for a while leads us to the following idea:
Since "C++ness" is just linking against libstdc++.so, could we not create a new one, say libstdc++2.so, that has a completely different ABI (and even API), build new apps against that and keep the old one around for running old apps?
The answer to this questions turns out to be yes. Even better, you can already do this today on any recent Debian based distribution (and probably most other distros too, but I have not tested). By default the Clang C++ compiler shipped by the distros uses the GNU C++ standard library. However you can install the libc++ stdlib via system packages and use it with the -stdlib=libc++ command line argument. If you go even deeper, you find that the GNU standard library's name is libstdc++.so.6, meaning that it has already had five ABI breaking updates.

So … problem solved then? No, not really.

Problem #1: the ABI boundary

Suppose you have a shared library built against the old ABI that exports a function that looks like this:

void do_something(const std::unordered_map<int, int> &m);

If you build code with the new ABI and call this function, the bit representation of the unordered map causes problems. The caller has a pointer to a bunch of bits in the new representation whereas the callee expects bits in the old representation. This code compiles and links but will invoke UB at runtime when called and, at best, crash your app.

Problem #2: the hidden symbols

This one is a bit complicated and needs some background information. Suppose we have a shared library foo that is implemented in C++ but exposes a plain C API. Internally it makes calls to the C++ standard library. We also have a main program that uses said library. So far, so good, everything works.

Let's add a second shared library called bar that also implemented in C++ and exposes a C API. We can link the main app against both these libraries and call them and everything works.

Now comes the twist. Let's compile the bar library against a new C++ ABI. The result looks like this:


A project mimicing this setup can be obtained from this Github repo. In it the abi1 and abi2 libraries both export a function with the same name that returns an int that is either 1 or 2. Libraries foo and bar check the return value and print a message saying whether they got the value they were expecting. It should be reiterated that the use of the abi libraries is fully internal. Nothing about them leaks to the exposed interface. But when we compile and run the program that calls both libraries, we get the following output snippet:

Foo invoked the correct ABI function.
Bar invoked the wrong ABI function.

What has happened is that both libraries invoked the function from abi1, which means that in the real world bar would have crashed in the same way as in problem #1. Alternatively both libraries could have called abi2, which would have broken foo. Determining when this happens is left as an exercise to the reader.

The reason this happens is that the functions in abi1 and abi2 have the same mangled name and the fact that symbol lookup is global to a process. Once any given name is determined, all usages anywhere in the same process will point to the same entity. This will happen even for non-weak symbols.

Can this be solved?

As far as I know, there is no known real-world solution to this problem that would scale to a full operating system (i.e. all of Debian, FreeBSD or the like). If there are any university professors reading this needing problems for your grad students, this could be one of them. The problem itself is fairly simple to formulate: make it possible to run two different, ABI incompatible C++ standard libraries within one process. The solution will probably require changes in the compiler, linker and runtime loader. For example, you might extend symbol resolution rules so that they are not global, but instead symbols from, say library bar would first be looked up in its direct descendents (in this case only abi2) and only after that in other parts of the tree.

To get you started, here is one potential solution I came up with while writing this post. I have no idea if it actually works, but I could not come up with an obvious thing that would break. I sadly don't have the time or know-how to implement this, but hopefully someone else has.

Let's start by defining that the new ABI is tied to C++23 for simplicity. That is, code compiled with -std=c++23 uses the new ABI and links against libstdc++.so.7, whereas older standard versions use the old ABI. Then we take the Itanium ABI specification and change it so that all mangled names start with, say, _^ rather than _Z as currently. Now we are done. The different ABIs mangle to different names and thus can coexist inside the same process without problems. One would probably need to do some magic inside the standard library implementations so they don't trample on each other.

The only problem this does not solve is calling a shared library with a different ABI. This can be worked around by writing small wrapper functions that expose an internal "C-like" interface and can call external functions directly. These can be linked inside the same library without problems because the two standard libraries can be linked in the same shared library just fine. There is a bit of a performance and maintenance penalty during the transition, but it will go away once all code is rebuilt with the new ABI.

Even with this, the transition is not a light weight operation. But if you plan properly ahead and do the switch, say, once every two standard releases (six years), it should be doable.

sunnuntai 17. marraskuuta 2019

What is -pipe and should you use it?

Every now and then you see people using the -pipe compiler argument. This is particularly common on vintage handwritten makefiles. Meson even uses the argument by default, but what does it actually do? GCC manpages say the following:

-pipe
    Use pipes rather than temporary files for communication
    between the various stages of compilation.  This fails
    to work on some systems where the assembler is unable to
    read from a pipe; but the GNU assembler has no trouble.

So presumably this is a compile speed optimization. What sort of an improvement does it actually provide? I tested this by compiling LLVM on my desktop machine both with and without the -pipe command line argument. Without it I got the following time:

ninja  14770,75s user 799,50s system 575% cpu 45:04,98 total

Adding the argument produced the following timing:

ninja  14874,41s user 791,95s system 584% cpu 44:41,22 total

This is an improvement of less than one percent. Given that I was using my machine for other things at the same time, the difference is most likely statistically insignificant.

This argument may have been needed in the ye olden times of supporting tens of broken commercial unixes. Nowadays the only platform where this might make a difference is Windows, given that its file system is a lot slower than Linux's. But is its pipe implementation any faster? I don't know, and I'll let other people measure that.

The "hindsight is perfect" design lesson to be learned

Looking at this now, it is fairly easy to see that this command line option should not exist. Punting the responsibility of knowing whether files or pipes are faster (or even work) on any given platform to the user is poor usability. Most people don't know that and performance characteristics of operating systems change over time. Instead this should be handled inside the compiler with logic roughly like the following:

if(assembler_supports_pipes(...) &&
   pipes_are_faster_on_this_platform(...)) {
    communicate_with_pipes();
} else {
    communicate_with_files();
}

sunnuntai 13. lokakuuta 2019

Apple of 2019 is the Linux of 2000

Last week the laptop I use for macOS development said that there is an XCode update available. I tried to install it but it said that there is not enough free space available to run the installer. So I deleted a bunch of files and tried again. Still the same complaint. Then I deleted some unused VM images. Those would free a few dozen gigabytes, so it should make things work. I even emptied the trash can to make sure nothing lingered around. But even this did not help, I still got the same complaint.

At this point it was time to get serious and launch the terminal. And, true enough, according to df the disk had only 8 gigabytes of free space even though I had just deleted over 40 gigabytes of files from it (using rm, not the GUI, so things really should have been gone). A lot of googling and poking later I discovered that all the deleted files had gone to "reserved space" on the file system. There was no way to access those files or delete them. According to documentation the operating system would delete those files "on demand as more space is needed". This was not very comforting because the system most definitely was not doing that and you'd think that Apple's own software would get this right.

After a ton more googling I managed to find a chat buried somewhere deep in Reddit which listed the magical indentation that purges reserved space. It consisted of running tmutil from the command line and giving it a bunch of command line arguments that did not seem to make sense or have any correlation to the thing that I wanted to do. But it did work and eventually I got XCode updated.

After my blood pressure dropped to healthier levels I got the strangest feeling of déjà vu. This felt exactly like using Linux in the early 2000s. Things break at random for reasons you can't understand and the only way to fix it is to find terminal commands from discussion forums, type them in and hope for the best. Then it hit me.

This was not an isolated incidence. The parallels are everywhere. Observe:

External monitors

Linux 2000: plugging an external monitor will most likely not work. Fanboys are very vocal that this is the fault of monitor manufacturers for not providing modeline info.

Apple 2019: plugging an external projector will most likely not work. Fanboys are very vocal that this is the fault of projector manufacturers for not ensuring that their HW works with every Apple model.

Software installation

Linux 2000: There is only One True Way of installing software: using distro packages. If you do anything else you are bad and you should feel bad.

Apple 2019: There is only True Way of installing software: using the Apple store. If you do anything else you are bad and you should feel bad.

Hardware compatibility

Linux 2000: only a limited number of hardware works out of the box, even for popular devices like 3D graphics cards. Things either don't work at all, have reduced functionality, or kinda work but fail spuriously every now and then for no discernible reason.

Apple 2019: only a limited number of hardware works out of the box, even for popular devices like Android phones. Things either don't work at all, have reduced functionality, or kinda work but fail spuriously every now and then for no discernible reason.

Technical support

Linux 2000: if your problem is not google-trivial, there's nothing you can do. Asking friends for assistance does not help, because they will just type your problem description into Google and read the first hit.

Apple 2019: if your problem is not google-trivial, there's nothing you can do. Calling Apple's tech support line does not help, because they will just type your problem description into Google and read the first hit.

Laptop features

Linux 2000: it is very difficult to find a laptop with more than two USB ports.

Apple 2019: it is very difficult to find a laptop with more than two USB ports.

Advocate behaviour

Linux 2000: fanboys will let you know in no uncertain terms that their system is the best and will take over all desktop computer usage. Said fanboys are condescending elitist computer nerds.

Apple 2019: fanboys will let you know in no uncertain terms that their system is the best and will take over all desktop computer usage. Said fanboys are condescending elitist hipster latte web site designers.

perjantai 27. syyskuuta 2019

A look into building C++ modules with a scanner

At CppCon there was a presentation on building C++ modules using a standalone dependency scanner executable provided by the compiler toolchain. The integration (as I understand it) would go something like this:

  1. The build system creates a Ninja file as usual
  2. It sets up a dependency so that every compilation job depends on a prescan step.
  3. The scanner goes through all source files (using compilation_commands.json), determines module interdependencies and writes this out to a file in a format that Ninja understands.
  4. After the scan step, Ninja will load the file and use it to execute commands in the correct order.
This seems like an interesting approach for experimentation, but unfortunately it depends on functionality that is not yet in Ninja. It is unclear if and when these would be added to Ninja, as its current maintainers are extremely conservative in adding any new code. It is quite difficult to run experiments on approaches that have neither usable code nor all the required features in various parts of the toolchain.

Can we do it regardless? Yes we can!

Enter self-modifying build system code

The basic approach is simple
  1. Write a Ninja file as usual, but make all the top level commands (or, for this test, only all) run a secret internal command.
  2. The command will do the scanning, and change the Ninja file on the fly, rewriting it to have the module dependency information.
  3. Invoke Ninja on the new file giving it a secret target name that runs the actual build.
  4. Build proceeds as usual.
The code that does this can be found in the vsmodtest branch in the main Meson repository. To run it you need to use Visual Studio's module implementation, the test project is in the modtest directory. It actually does work, but there are a ton of disclaimers:
  • incremental builds probably won't work
  • the resulting binary never finishes (it is running a job with exponential complexity)
  • it does not work on any other project than the demo one (but it should be fixable)
  • the dependencies are on object files rather than module BMI files due to a Ninja limitation
  • module dep info is not cached, all files are fully rescanned every time
  • the scanner is not reliable, it does the equivalent of dumb regex parsing
  • any and all things may break at any time and if they do you get to keep both pieces
All in all nothing even close to production ready but a fairly nice experiment for ~100 lines of Python. This is of course a hack and should not go anywhere near production, but assuming Ninja gets all the required extra functionality it probably could be made to work reliably.

Is this the way C++ module building will work?

Probably not, because there is one major use case that this approach (or indeed any content scanning approach) does not support: code generation. Scanning assumes that all source code is available at the same time but if you generate source code on the fly, this is not the case. There would need to be some mechanism of making Ninja invoke the scanner anew every time source files appear and such a mechanism does not exist as far as I know. Even if it does there is a lot of state to transfer between Ninja and the scanner to ensure both reliable and minimal dependency scans.

There are alternative approaches one can take to avoid the need for scanning completely, but they have their own downsides.

sunnuntai 1. syyskuuta 2019

New Meson mugs, get them while they're hot!

Since there is no Meson swag available, I decided to make some. Here is the first one.


Now sadly there is no web store and shipping ceramics is expensive, so the only way to get them is to be in the same physical space as me. Unless you live in Finland (or have friends that do and can be persuaded to get one for you) the only real way is to meet up with me in some conference. The next one I'll be going to is CppCon in two weeks. I'm going to be on vacation in New York for one week before that so that is also an option. Supply is limited by how many I can pack in my luggage and survive airport baggage handlers.

Bonus meta logo picture


perjantai 9. elokuuta 2019

C++ modules with a batch compiler feasibility study

In our previous blog post an approach for building C++ modules using a batch compiler model was proposed. The post raised a lot of discussion on Reddit. Most of it was of the quality one would expect, but some interesting questions were raised.

How much work for compiler developers?

A fairly raised issue was that the batch approach means adding new functionality to the compiler. This is true, but not particulary interesting as a standalone statement. The real question is how much more work is it, especially compared to the work needed to support other module implementation methods.

In the batch mode there are two main pieces of work. The first is starting compiler tasks, detecting when they freeze due to missing modules and resuming them once the modules they require have been built. The second one is calculating the minimal set of files to recompile when doing incremental builds.

The former is the trickier one because no-one has implemented it yet. The latter is a well known subject, build systems like Make and Ninja have done it for 40+ years. To test the former I wrote a simple feasibility study in Python. What it does is generate 100 source files containing modules that call each other and then compiles them all in the manner a batch compiler would. There is no need to scan the contents of files, the system will automatically detect the correct build order or error out if it can not be done.

Note that this is just a feasibility study experiment. There are a lot of caveats. Please go through the readme before commenting. The issue you want to raise may already be addressed there. Especially note that it only works with Visual Studio.

The code that is responsible for running the compile is roughly 60 lines of Python. It is conceptually very simple. A real implementation would need to deal with threading, locking and possibly IPC, which would take a fair bit of work.

The script does not support incremental builds. I'd estimate that getting a minimal version going would take something like 100-200 lines of Python.

I don't have any estimates on how much work this would mean on a real compiler code base.

The difference to scanning

A point raised in the Reddit discussion is that there is an alternative approach that uses richer communication between the build system and the compiler. If you go deeper you may find that the approaches are not that different after all. It's more of a question of how the pie is sliced. The scanning approach looks roughly like this:


In this case the build system and compiler need to talk to each other, somehow, via some sort of a protocol. This protocol must be implemented by all compilers and all build tools and they must all interoperate. It must also remain binary stable and all that. There is a proposal for this protocol. The specification is already fairly long and complicated especially considering that it supports versioning so future versions may be different in arbitrary ways. This has an ongoing maintenance cost of unknown size. This also complicates distributed builds because it is common for the build system and the compiler to reside on different machines or even data centres so setting up an IPC channel between them may take a fair bit of work.

The architectural diagram of the batch compiler model looks like this:


Here the pathway for communicating module information is a compiler implementation detail. Every compiler can choose to implement it in the way that fits their internal compiler implementation the best. They can change its implementation whenever they wish because it is never exposed outside the compiler. The only public facing API that needs to be kept stable is the compiler command line, which all compilers already provide. This also permits them to ship a module implementation faster, since there is no need to fully specify the protocol and do interoperability tests. The downside is that the compiler needs to determine which sources to recompile when doing incremental builds.

Who will eventually make the decision?

I don't actually know. But probably it will come down to what the toolchain providers (GCC, Clang, Visual Studio) are willing to commit to.

It is my estimate (which is purely a guess, because I don't have first hand knowledge) that the batch system would take less effort to implement and would present a smaller ongoing maintenance burden. But compiler developers are the people who would actually know.

keskiviikko 7. elokuuta 2019

Building C++ modules, take N+1

Modules were voted in C++20 some time ago. They are meant to be a replacement for #include statements to increase build speeds and to also isolate translation units so, for example, macros defined in one file do not affect the contents of another file. There are three major different compilers and each of them has their own prototype implementation available (GCC documentation, Clang documentation, VS documentation).

As you would expect, all of these implementations are wildly different and, in the grand C++ tradition, byzantinely complicated. None of them also have a really good solution to the biggest problem of C++ modules, namely that of dependency tracking. A slightly simplified but mostly accurate description of the problem goes like this:

Instead of header files, all source code is written in one file. It contains export statements that describe what functions can be called from the outside. An analogy would be that functions declared as exported would be in a public header file and everything else would be internal and declared in an internal header file (or would be declared static or similar). The module source can not be included directly, instead when you compile the source code the compiler will output an object file and also a module interface file. The latter is just some sort of a binary data file describing the module's interface. An import statement works by finding this file and reading it in.

If you have file A that defines a module and file B that uses it, you need to first fully compile file A and only after the module interface file has been created can you compile file B. Traditionally C and C++ files can be compiled in parallel because everything needed to compile each file is already in the header files. With modules this is no longer the case. If you have ever compiled Fortran and this seems familiar, it's because it is basically the exact same architecture.

Herein lies the problem

The big, big problem is how do you determine what order you should build the sources in. Just looking at the files is not enough, you seemingly need to know something about their contents. At least the following approaches have toyed with:
  1. Writing the dependencies between files manually in Makefiles. Yes. Really. This has actually been but forth as a serious proposal.
  2. First scan the contents of every file, determine the interdependencies, write them out to a separate dependency file and then run the actual build based on that. This requires parsing the source files twice and it has to be done by the compiler rather than a regex because you can define modules via macros (at least in VS currently).
  3. When the compiler finds a module import it can not resolve, it asks the build system via IPC to generate one. Somehow.
  4. Build an IPC mechanism between the different compiler instances so they can talk to each other to determine where the modules are. This should also work between compilers that are in different data centers when doing distributed builds.
Some of these approaches are better than others but all of them fail completely when source code generators enter the picture, especially if you want to build the generator executable during the build (which is fairly common). Scanning all file contents at the start of the build is not possible in this case, because some of the source code does not yet exist. It only comes into existence as build steps are executed. This is hideously complicated to support in a build system.

Is there a better solution?

There may well be, though I'd like to emphasize that none of the following has actually been tested and that I'm not a compiler developer. The approach itself does require some non-trivial amount of work on the compiler, but it should be less than writing a full blown IPC mechanism and distributed dataflow among the different parts of the system.

At the core of the proposed approach is the realization that not all module dependencies between files are the same. They can be split into two different types. This is demonstrated in the following diagram that has two targets: a library and an executable that uses it.


As you can see the dependencies within each target can get fairly complicated. The dependencies between targets can be just as complicated, but they have been left out of the picture to keep it simple. Note that there are no dependency cycles anywhere in the graph (this is mandated by the module specification FWICT). This gives us two different kinds of module dependencies: between-targets module dependencies and within-targets module dependencies.

The first one of these is actually fairly simple to solve. If you complete all compilations (but not the linking step) of the dependency library before starting any compilations in the executable, then all library module files that the executable could possibly need are guaranteed to exist. This is easy to implement with e.g. a Ninja pseudotarget.

The second case is the difficult one and leads to all the scanning problems and such discussed above. The proposed solution is to slightly change the way the compiler is invoked. Rather than starting one process per input file, we do something like the following:

g++ <other args> --outdir=somedir [all source files of this target]

What this means conceptually is that the compiler needs to take all the input files and compile each of them. Thus file1.cpp should end up as somedir/file1.o and so on. In addition it must deal with this target's internal module interrelations transparently behind the scenes. When run again it must detect which output files are up to date and rebuild only the outdated ones.

One possible implementation is that the compiler may launch one thread per input file (but no more than there are CPUs available). Each compilation proceeds as usual but when it encounters a module import that it can not find, it halts and waits on, say, a condition variable. Whenever a compilation job finishes writing a module, it will signal all the other tasks that a new module is available. Eventually either all jobs finish or every remaining task is deadlocked because they need a module that can't be found anywhere.

This approach is similar to the IPC mechanism described on GCC's documentation but it is never exposed to any third party program. It is fully an internal implementation detail of the compiler and as such there are no security risks or stability requirements for the protocol.

With this approach we can handle both internal and external module dependencies reliably. There is no need to scan the sources twice or write complicated protocols between the compiler and the build system. This even works for generated sources without any extra work, which no other proposed approach seems to be able to do.

As an added bonus the resulting command line API is so simple it can be even be driven with plain Make.

Extra bits

This approach also permits one to do ZapCC style caching. Since compiler arguments for all sources within one target must be the same under this scheme (which is a good thing to do in general), imports and includes can be potentially shared between different compiler tasks. Even further, suppose you have a type that is used in most sources like std::vector<std::string>.  Normally the instantiated and compiled code would need to be written in every object file for the linker to eventually deduplicate. In this case, since we know that all outputs will go to the same target it is enough to write the code out in one object file. This can lead to major build footprint reductions. It should also reduce the linker's memory usage since there is a lot less debug info to manage. In most large projects linking, rather than compiling, is the slow and resource intensive step so making it faster is beneficial.

The module discussions have been going around in circles about either "not wanting to make the compiler a build system" or "not wanting to make the build system into a compiler". This approach does neither. The compiler is still a compiler, it just works with slightly bigger work chunks at a time. It does need to track staleness of output objects, though, which it did not need to do before.

There needs to be some sort of a load balancing system so that you don't accidentally spawn N compile jobs each of which spawns N internal work threads.

If you have a project that consists of only one executable with a gazillion files and you want to use a distributed build server then this approach is not the greatest. The solution to this is obvious: split your project into independent logical chunks. It's good design and you should do it regardless.

The biggest downside of this approach that I could come up with was that CCache will probably no longer work without a lot of work. But if modules make compilation 5-10× faster (which is a given estimate, there are no public independent measurements yet) then it could be worth it.