Wednesday, August 7, 2019

Building C++ modules, take N+1

Modules were voted in C++20 some time ago. They are meant to be a replacement for #include statements to increase build speeds and to also isolate translation units so, for example, macros defined in one file do not affect the contents of another file. There are three major different compilers and each of them has their own prototype implementation available (GCC documentation, Clang documentation, VS documentation).

As you would expect, all of these implementations are wildly different and, in the grand C++ tradition, byzantinely complicated. None of them also have a really good solution to the biggest problem of C++ modules, namely that of dependency tracking. A slightly simplified but mostly accurate description of the problem goes like this:

Instead of header files, all source code is written in one file. It contains export statements that describe what functions can be called from the outside. An analogy would be that functions declared as exported would be in a public header file and everything else would be internal and declared in an internal header file (or would be declared static or similar). The module source can not be included directly, instead when you compile the source code the compiler will output an object file and also a module interface file. The latter is just some sort of a binary data file describing the module's interface. An import statement works by finding this file and reading it in.

If you have file A that defines a module and file B that uses it, you need to first fully compile file A and only after the module interface file has been created can you compile file B. Traditionally C and C++ files can be compiled in parallel because everything needed to compile each file is already in the header files. With modules this is no longer the case. If you have ever compiled Fortran and this seems familiar, it's because it is basically the exact same architecture.

Herein lies the problem

The big, big problem is how do you determine what order you should build the sources in. Just looking at the files is not enough, you seemingly need to know something about their contents. At least the following approaches have toyed with:
  1. Writing the dependencies between files manually in Makefiles. Yes. Really. This has actually been but forth as a serious proposal.
  2. First scan the contents of every file, determine the interdependencies, write them out to a separate dependency file and then run the actual build based on that. This requires parsing the source files twice and it has to be done by the compiler rather than a regex because you can define modules via macros (at least in VS currently).
  3. When the compiler finds a module import it can not resolve, it asks the build system via IPC to generate one. Somehow.
  4. Build an IPC mechanism between the different compiler instances so they can talk to each other to determine where the modules are. This should also work between compilers that are in different data centers when doing distributed builds.
Some of these approaches are better than others but all of them fail completely when source code generators enter the picture, especially if you want to build the generator executable during the build (which is fairly common). Scanning all file contents at the start of the build is not possible in this case, because some of the source code does not yet exist. It only comes into existence as build steps are executed. This is hideously complicated to support in a build system.

Is there a better solution?

There may well be, though I'd like to emphasize that none of the following has actually been tested and that I'm not a compiler developer. The approach itself does require some non-trivial amount of work on the compiler, but it should be less than writing a full blown IPC mechanism and distributed dataflow among the different parts of the system.

At the core of the proposed approach is the realization that not all module dependencies between files are the same. They can be split into two different types. This is demonstrated in the following diagram that has two targets: a library and an executable that uses it.


As you can see the dependencies within each target can get fairly complicated. The dependencies between targets can be just as complicated, but they have been left out of the picture to keep it simple. Note that there are no dependency cycles anywhere in the graph (this is mandated by the module specification FWICT). This gives us two different kinds of module dependencies: between-targets module dependencies and within-targets module dependencies.

The first one of these is actually fairly simple to solve. If you complete all compilations (but not the linking step) of the dependency library before starting any compilations in the executable, then all library module files that the executable could possibly need are guaranteed to exist. This is easy to implement with e.g. a Ninja pseudotarget.

The second case is the difficult one and leads to all the scanning problems and such discussed above. The proposed solution is to slightly change the way the compiler is invoked. Rather than starting one process per input file, we do something like the following:

g++ <other args> --outdir=somedir [all source files of this target]

What this means conceptually is that the compiler needs to take all the input files and compile each of them. Thus file1.cpp should end up as somedir/file1.o and so on. In addition it must deal with this target's internal module interrelations transparently behind the scenes. When run again it must detect which output files are up to date and rebuild only the outdated ones.

One possible implementation is that the compiler may launch one thread per input file (but no more than there are CPUs available). Each compilation proceeds as usual but when it encounters a module import that it can not find, it halts and waits on, say, a condition variable. Whenever a compilation job finishes writing a module, it will signal all the other tasks that a new module is available. Eventually either all jobs finish or every remaining task is deadlocked because they need a module that can't be found anywhere.

This approach is similar to the IPC mechanism described on GCC's documentation but it is never exposed to any third party program. It is fully an internal implementation detail of the compiler and as such there are no security risks or stability requirements for the protocol.

With this approach we can handle both internal and external module dependencies reliably. There is no need to scan the sources twice or write complicated protocols between the compiler and the build system. This even works for generated sources without any extra work, which no other proposed approach seems to be able to do.

As an added bonus the resulting command line API is so simple it can be even be driven with plain Make.

Extra bits

This approach also permits one to do ZapCC style caching. Since compiler arguments for all sources within one target must be the same under this scheme (which is a good thing to do in general), imports and includes can be potentially shared between different compiler tasks. Even further, suppose you have a type that is used in most sources like std::vector<std::string>.  Normally the instantiated and compiled code would need to be written in every object file for the linker to eventually deduplicate. In this case, since we know that all outputs will go to the same target it is enough to write the code out in one object file. This can lead to major build footprint reductions. It should also reduce the linker's memory usage since there is a lot less debug info to manage. In most large projects linking, rather than compiling, is the slow and resource intensive step so making it faster is beneficial.

The module discussions have been going around in circles about either "not wanting to make the compiler a build system" or "not wanting to make the build system into a compiler". This approach does neither. The compiler is still a compiler, it just works with slightly bigger work chunks at a time. It does need to track staleness of output objects, though, which it did not need to do before.

There needs to be some sort of a load balancing system so that you don't accidentally spawn N compile jobs each of which spawns N internal work threads.

If you have a project that consists of only one executable with a gazillion files and you want to use a distributed build server then this approach is not the greatest. The solution to this is obvious: split your project into independent logical chunks. It's good design and you should do it regardless.

The biggest downside of this approach that I could come up with was that CCache will probably no longer work without a lot of work. But if modules make compilation 5-10× faster (which is a given estimate, there are no public independent measurements yet) then it could be worth it.

1 comment:

  1. Yeah as far as I can tell, the only way to use a build server is the IPC solution where the compiler interacts with the build system to inform it about isolated dependency clusters that can be farmed out. Alternately, just go all-out and add distcc/ccache support to g++ directly.

    ReplyDelete