Saturday, August 10, 2019

C++ modules with a batch compiler feasibility study

In our previous blog post an approach for building C++ modules using a batch compiler model was proposed. The post raised a lot of discussion on Reddit. Most of it was of the quality one would expect, but some interesting questions were raised.

How much work for compiler developers?

A fairly raised issue was that the batch approach means adding new functionality to the compiler. This is true, but not particulary interesting as a standalone statement. The real question is how much more work is it, especially compared to the work needed to support other module implementation methods.

In the batch mode there are two main pieces of work. The first is starting compiler tasks, detecting when they freeze due to missing modules and resuming them once the modules they require have been built. The second one is calculating the minimal set of files to recompile when doing incremental builds.

The former is the trickier one because no-one has implemented it yet. The latter is a well known subject, build systems like Make and Ninja have done it for 40+ years. To test the former I wrote a simple feasibility study in Python. What it does is generate 100 source files containing modules that call each other and then compiles them all in the manner a batch compiler would. There is no need to scan the contents of files, the system will automatically detect the correct build order or error out if it can not be done.

Note that this is just a feasibility study experiment. There are a lot of caveats. Please go through the readme before commenting. The issue you want to raise may already be addressed there. Especially note that it only works with Visual Studio.

The code that is responsible for running the compile is roughly 60 lines of Python. It is conceptually very simple. A real implementation would need to deal with threading, locking and possibly IPC, which would take a fair bit of work.

The script does not support incremental builds. I'd estimate that getting a minimal version going would take something like 100-200 lines of Python.

I don't have any estimates on how much work this would mean on a real compiler code base.

The difference to scanning

A point raised in the Reddit discussion is that there is an alternative approach that uses richer communication between the build system and the compiler. If you go deeper you may find that the approaches are not that different after all. It's more of a question of how the pie is sliced. The scanning approach looks roughly like this:


In this case the build system and compiler need to talk to each other, somehow, via some sort of a protocol. This protocol must be implemented by all compilers and all build tools and they must all interoperate. It must also remain binary stable and all that. There is a proposal for this protocol. The specification is already fairly long and complicated especially considering that it supports versioning so future versions may be different in arbitrary ways. This has an ongoing maintenance cost of unknown size. This also complicates distributed builds because it is common for the build system and the compiler to reside on different machines or even data centres so setting up an IPC channel between them may take a fair bit of work.

The architectural diagram of the batch compiler model looks like this:


Here the pathway for communicating module information is a compiler implementation detail. Every compiler can choose to implement it in the way that fits their internal compiler implementation the best. They can change its implementation whenever they wish because it is never exposed outside the compiler. The only public facing API that needs to be kept stable is the compiler command line, which all compilers already provide. This also permits them to ship a module implementation faster, since there is no need to fully specify the protocol and do interoperability tests. The downside is that the compiler needs to determine which sources to recompile when doing incremental builds.

Who will eventually make the decision?

I don't actually know. But probably it will come down to what the toolchain providers (GCC, Clang, Visual Studio) are willing to commit to.

It is my estimate (which is purely a guess, because I don't have first hand knowledge) that the batch system would take less effort to implement and would present a smaller ongoing maintenance burden. But compiler developers are the people who would actually know.

No comments:

Post a Comment