Saturday, December 16, 2023

On the scalability of C++ module implementations or lack thereof

This is a follow up to an earlier blog post called The road to hell is paved with good intentions and C++ modules. As it was a bit on the gloomy side so maybe I should preface this post by reiterating that adding module support to C++ is a huge task. All people who have worked on the specification and implementations in the various toolchains have had to do an incredible amount of unrewarding work. Everything said here (and in the previous post) is not to badmouth their existing work. It is merely to say that where we currently are is not a place we want to stay in, we need to move forward to get a better foundation to base decades of future work on.

How many ways do I #include thee? Let me count the ways.

In the Reddit discussion on the previous post user mathstuf (whose flair indicates that they are a CMake dev) makes the following claim:

Clang and MSVC need the transitive closure of imported modules specified as well, so while you M0 might only import M1, if it imports M2 through M9, they need to be there too (GCC doesn't need this today, but Clang 17 warns if they're not there and I believe Clang 18 will error; MSVC has always errored without them).

"Needs to be defined" means adding a compiler flag like -fmodule-file=M0=path/to/M0.mod. Let's assume this to be true. What does this imply? How many compiler flags would this add in the general case?

We don't know, because there are no large scale C++ codebases that use modules exclusively. Thus we need do Fermi estimation based on the data we do have. Let's start simple with the following C++ source file.


If this were a "module library" the contents would be import vector (we'll ignore the std part for simpllicity). On the Ubuntu machine I'm typing this using such a module would require using a command line argument -fmodule-file=vector=/usr/include/c++/13/vector.mod to match where the header is. But this is not enough. The header file itself is only 151 lines long. Inside it we find things like this:

#include <bits/stl_algobase.h>
#include <bits/allocator.h>
#include <bits/stl_construct.h>
#include <bits/stl_uninitialized.h>
#include <bits/stl_vector.h>
#include <bits/stl_bvector.h>
#include <bits/refwrap.h>
#include <bits/range_access.h>

If we assume that a "modular version" of vector would be split roughly among the same lines as the header version then each one of these includes would a import statement and a corresponding compiler arugment and so on recursively. 

How many is that in total?

Computing this is a fairly simple one-liner:

g++ -std=c++20 -E includecount.cpp|grep /usr/include/c |cut -d ' ' -f 3|sort|uniq

This yields the answer 55.

In a library split up in this way merely importing this one module means having to add 55 compiler flags to get it to compile.

This is of course simplistic and in a real world use case the library would probably be set up in a different way. But the point is that currently we do not know what that setup would be because these kinds of libraries do not exist. Even if they did exist the bigger fundamental problem here is that the number of compiler arguments for this must be at most one (or two for reasons discussed below). Anything more than that is wrong both conceptually and in practice.

Fine, but does it work in theory?

Suppose you don't use any external libraries or modules but instead hand craft your very own artisanal humongous batch code base. What would happen then? Let's assume that each source file specifies one module for simplicity.  At the bottom of the dependency chain are source files that area fully standalone. Above that are files that import one or more of the given standalone modules and so on.

In order to compute the total amount of work we need to compute the transitive closure of each file. This is unknown so again we have to estimate. A reasonable first order approximation is that each module's transitive closure encompasses half of the modules "below it". If we order all source files from "lowest level" to "highest level" we can say that for each source file i sizeof_transitive_closure(i) = i/2. Thus the total number of compiler arguments we need to specify all these transitive closures is (according to Wolfram Alpha):

Or, in other words, this is an O(N²) algorithm. Those do not scale. A quick check says that Clang's code base has almost 30 000 cpp files. That would produce approximately

If each of these was just 10 characters long, the corresponding Ninja file would be 10 gigabytes in size if it were possible to build it up front. It is not, instead you have to split it into a million horcruxes and then glue them together. 

For something Chromium all these numbers would be a lot bigger.

Back to the basics

Let's look at the problem from yet another angle. Conceptually, module files are identical to header files. We should treat them as such. There is decades of experience in making that scheme work, even for header files that are generated during build time (which is how modules work).

Instead of manually telling the compiler which file contains which module, we instead pass it a list of module search directories. It will then look up modules in those directories automatically. Just like it does for headers currently. This is simple, understandable and we even know that it scales, because this is how header inclusion has worked for almost 50 years now. Not only that, we also know that it scales to binary modules, because this is exactly how Fortran compilers do it. They compiler flag they use is -J.

The only downside is that this approach might not be easily usable in projects whose source code layout is batshit insane. Fortunately those do not need to be supported, since currently there are no such code bases. C++ modules have only been used by early adopters and test projects and those people are typically fine with some breakage in exchange for a much better experience going forward. There is no burden of backwards compatibility for modules yet. Let's improve things so we don't have it in the future either.

No comments:

Post a Comment