Monday, October 12, 2015

Some comments on the C++ modules talk

Gabriel Dos Reis had a presentation on C++ modules at cppcon. The video is available here. I highly recommend that you watch it, it's good stuff.

However as a build system developer, one thing caught my eye. The way modules are used (at around 40 minutes in the presentation) has a nasty quirk. The basic approach is that you have a source file foo.cpp, which defines a module Foobar. To compile this you should say this:

cl -c /module foo.cxx

The causes the compiler to output foo.o as well as Foobar.ifc, which contains the binary definition of the module. To use this you would compile a second source file like this:

cl -c baz.cpp /module:reference Foobar.ifc

This is basically the same way that Fortran does its modules and it suffers from the same problem which makes life miserable for build system developers.

There are two main reasons. One: the name of the ifc file can not be known beforehand without scanning the contents of source files. The second is that you can't know what filename to give to the second command line without scanning it to see what imports it uses _and_ scanning potentially every source file in your project to find out what file actually provides it.

Most modern build systems work in two phases. First you parse the build definition and determine how and which order to do individual build steps in. Basically it just serialises the dependency DAG to disk. The second phase loads the DAG, checks its status and takes all the steps necessary to bring the build up to date.

The first phase of the two takes a lot more effort and is usually much slower than the second part. A typical ratio for a medium project is that first phase takes roughly ten seconds of CPU time and the second step takes a fraction of a second. In contemporary C++ almost all code changes only require rerunning the second step, whereas changing build config (adding new targets etc) requires doing the first step as well.

This is caused by the fact that output files and their names are fully knowable without looking at the contents of the source files. With the proposed scheme this no longer is the case. A simple (if slightly pathological) example should clarify the issue.

Suppose you have file A that defines a module and file B that uses it. You compile A first and then B. Now change the source code so that the module definition goes to B and A uses it. How would a build system know that it needs to compile B first and only then A?

The answer is that without scanning the contents of A and B before running the compiler this is impossible. This means that to get reliable builds either all build systems need to grow a full C++ parser or all C++ compilers must grow a full build system. Neither of these is particularly desirable. Even if build systems got these parsers they would need to reparse the source of all changed files before starting the compiler and it would need to change the compiler arguments to use. This makes every rebuild take the slow path of step one instead of the fast step two.

Potential solutions


There are a few possible solutions to this problems, none of which are perfect. The first is the requirement that module Foo must be defined in a source file Foo.cpp. This makes everything deterministic again but has that iffy Java feeling about it. The second option is to define the module in a "header-like" file rather than in source code. Thus a foo.h file would become foo.ifc and the compiler could pick that up automatically instead of the .h file.

3 comments:

  1. Yes, this is a problem for how Microsoft has approached modules, which is the only approach Gaby spoke about. This is not necessarily how the C++ Standard will choose to implement modules.

    Clang currently offers a different approach where the compiler self-manages a module cache. This too has it's issues, notably with distributed build systems, but it does illustrate that other approaches are possible.

    ReplyDelete
  2. Modules are a hard problem, no doubt. But this is the only way to conquer it: having different people try different things which are then compared and refined until a final solution appears. Unfortunately it can take a loooooong time. :(

    ReplyDelete
  3. > Thus a foo.h file would become foo.ifc and the compiler could pick that up automatically instead of the .h file.

    How does the build system generate the foo.ifc file from foo.h, and know that depends-on-foo.cpp cannot be built until foo.ifc is generated? Clang's "implicit modules" approach that Alex mentioned has the compiler maintain the modules cache.

    There is another approach, which has been used in production at Google with Clang which is that the user must maintain the dependency information explicitly in the "build config" files. Therefore, only the "first phase" is affected. Any distributed build system that supports fine grained dependencies will be able to use this approach, since it already needs to know these dependencies in order to ship the right files to the remote machine.

    ReplyDelete