Note: Everything that follows is purely my personal opinion as an individual. It should not be seen as any sort of policy of the Meson build system or any other person or organization. It is also not my intention to throw anyone involved in this work under a bus. Many people have worked to the best of their abilities on C++ modules, but that does not mean we can't analyze the current situation with a critical eye.
The lead on this post is a bit pessimistic, so let's just get it out of the way.
If C++ modules can not show a 5× compilation time speedup (preferably 10×) on multiple existing open source code base, modules should be killed and taken out of the standard. Without this speedup pouring any more resources into modules is just feeding the sunk cost fallacy.
That seems like a harsh thing to say for such a massive undertaking that promises to make things so much better. It is not something that you can just belt out and then mic drop yourself out. So let's examine the whole thing in unnecessarily deep detail. You might want to grab a cup of $beverage before continuing, this is going to take a while.
What do we want?
For the average developer the main visible advantages would be the following, ordered from the most important to the least.
- Much faster compilation times.
Then, little by little, build speed seems to fall by the wayside and the focus starts shifting towards "build isolation". This means avoiding bugs caused by things like macro leakage, weird namespace lookup issues and so on. Performance is still kind of there, but the numbers are a lot smaller, spoken aloud much more rarely and often omitted entirely. Now, getting rid of these sorts of bugs is fundamentally a good thing. However it might not be the most efficient use of resources. Compiler developer time is, sadly, a zero sum game so we should focus their skills and effort on things that provide the best results.
Macro leakage and other related issues are icky but they are on average fairly rare. I have encountered a bug caused by them maybe once or twice a year. They are just not that common for the average developer. Things are probably different for people doing deep low level metaprogramming hackery, but they are a minuscule fraction of the total developer base. On the other hand slow build times are the bane of existence of every single C++ developer every single day. It is, without question, the narrowest bottleneck for developer productivity today and is the main issue modules were designed to solve. They don't seem to be doing that nowadays.
How did we end up here in the first place?
C++ modules were a C++ 20 feature. If a feature takes over five years of implementation work to get even somewhat working, you might ponder how it was accepted in the standard in the first place. As I was not there when it happened, I do not really know. However I have spoken to people who were present at the actual meetings where things were discussed and voted on. Their comments have been enlightening to say the least.
Apparently there were people who knew about the implementation difficulty and other fundamental problems and were quite vocal that modules as specified are borderline unimplementable. They were shot down by a group of "higher up" people saying that "modules are such an important feature that we absolutely must have them in C++ 20".
One person who was present told me: "that happened seven years ago [there is a fair bit of lead time in ISO standards] and [in practice] we still have nothing. In another seven years, if we are very lucky, we might have something that sort of works".
The integration task from hell
What sets modules apart from almost all other features is that they require very tight integration between compilers and build systems. This means coming up with schemes for things like what do module files actually contain, how are they named, how are they organized in big projects, how to best divide work between the different tools. Given that the ISO standard does not even acknowledge the fact that source code might reside in a file, none of this is in its purview. It is not in anybody's purview.
The end result of all that is that everybody has gone in their own corner, done the bits that are the easiest for them and hoping for the best. To illustrate how bad things are, I have been in discussions with compiler developers about this. In said discussion various avenues were considered on how to get things actually working, but one compiler developer replied "we do not want to turn the compiler into a build system" to every single proposal, no matter what it was. The experience was not unlike talking to a brick wall. My guess is that the compiler team in question did not have resources to change their implementation so vetoing everything became the sensible approach for them (though not for the module world in general).
The last time I looked into adding module support to Meson, things were so mind-bogglingly terrible, that you needed to create, during compilation time, additional compiler flags, store them in temp files and pass them along to compilation commands. I wish I was kidding but I am not. It's quite astounding that the module work started basically from Fortran modules, which are simple and work (in production even), and ended up in their current state, a kafkaesque nightmare of complexity which does not work.
If we look at the whole thing from a project management viewpoint, the reason for this failure is fairly obvious. This is a big change across multiple isolated organizations. The only real way to get those done is to have a product owner who a) is extremely good at their job b) is tasked with and paid to get the thing done properly c) has sufficient stripes to give orders to the individual teams and d) has no problems slapping people on metaphorical wrists if they try to weasel out of doing their part.
Such a person does not exist in the modules space. It is arguable whether such a person could exist even in theory. Because of this modules can never become good, which is a reasonable bar to expect a foundational piece of technology to reach.
The design that went backwards
If there is one golden rule of software design, it is "Do not do a grand design up front". This is mirrored in the C++ committee's guideline of "standardize existing practice".
C++ modules may be the grandest up-frontest design the computing world has ever seen. There were no implementations (one might argue there still aren't, but I digress), no test code, no prototypes, nothing. Merely a strong opinion of "we need this and we need it yesterday".
For the benefit of future generations, one better way to approach the task would have gone something like this. First you implement enough in the compiler to be able to produce one module file and then consume it in a different compilation unit. Keep it as simple as possible. It's fine to only serialize a subset of functionality and error out if someone tries to go outside the lines. Then take a build system that runs that. Then expand that to support a simple project, say, one that has ten source files and produces one executable. Implement features in the module file until you can compile the whole thing. Then measure the output. If you do not see performance increases, stop further development until you either find out why that is or you can fix your code to work better. Now you update the API so that no part of the integration makes people's eyes bleed of horror. Then scale the prototype to handle project with 100 sources. Measure again. Improve again. Then do two 100 source pairs, one that produces a library and one that creates an executable that uses the library. Measure again. Improve again. Then do 1000 sources in 10 subprojects. Repeat.
If the gains are there, great, now you have base implementation that has been proven to work with real world code and which can be expanded to a full implementation. If the implementation can't be made fast and clean, that is a sign that there is a fundamental design flaw somewhere. Throw your code away and either start from scratch or declare the problem too difficult and work on something else instead.
Hacking on an existing C++ compiler is really difficult and it takes months of work to even get started. If someone wants to try to work on modules but does not want to dive into compiler development, I have implemented a "module playground", which consists of a fake C++ compiler, a fake build system and a fake module scanner all in ~300 lines of Python.
The promise of import std
There is a second way of doing modules in an iterative fashion and it is actually being pursued by C++ implementers, namely import std. This is a very good approach in several different ways. First of all, the most difficult part of modules is the way compilations must be ordered. For the standard library this is not an issue, because it has no dependencies and you can generate all of it in one go. The second thing is the fact that most of the slowness of most of C++ development comes from the standard library. For reference, merely doing an #include<vector> brings in 27 000 lines of code and that is fairly small amount compared to many other common headers.
What sort of an improvement can we expect from this on real world code bases? Implementations are still in flux, so let's estimate using information we have. The way import std is used depends on the compiler but roughly:
- Replace all #include statements for standard library headers with import std.
- Run the compiler in a special mode.
- The compiler parses headers of the standard library and produces some sort of a binary representation of them
- The representation is written to disk.
- When compiling normally, add compiler flags that tell the compiler to load the file in question before processing actual source code
If you are thinking "wait a minute, if we remove step #1, this is exactly how precompiled headers work", you are correct. Conceptually it is pretty much the same and I have been told (but have not verified myself) that in GCC at least module files are just repurposed precompiled headers with all the same limitations (e.g. you must use all the same compiler flags to use a module file as you did when you created it).
Barring a major breakthrough in compiler data structure serialization, the expected speedup should be roughly equivalent to the speedup you get from precompiled headers. Which is to say, maybe 10-20% with Visual Studio and a few percentage points on Clang and GCC. OTOH if such a serialization improvement has occurred, it could probably be adapted to be usable in precompiled headers, too. Until someone provides verifiable measurements proving otherwise, we must assume that is the level of achievable improvement.
For reference, here is a Reddit thread where people report improvements in the 10-20% range.
But why 5×?
A reasonable requirement for the speedup would be "better than can be achieved using currently available tools and technologies". As an experiment I wrote a custom standard library (not API compatible with the ISO one on purpose) whose main design goal was to be fast to compile. I then took an existing library, converted that to use the new library and measured. The code compiled four times faster. In addition the binary it produced was smaller and, unexpectedly, ran faster. Details can be found in this blog post.
Given that 4× is already achievable (though, granted, only tested on one project, not proven in general), 5× seems like a reasonable target.
But what's in it for you?
The C++ standard committee has done a lot of great (and highly underappreciated) work to improve the language. On several occasions Herb Sutter has presented new functionality with "all you have to do is to recompile your code with a new compiler and the end result runs faster and is safer". It takes a ton of work to get these kinds of results, and it is exactly where you want to be.
Modules are not there. In fact they are in the exact opposite corner.
Using modules brings with it the following disadvantages:
- Need to rewrite (possibly refactor) your code.
- Loss of portability.
- Module binary files (with the exception of MSVC) are not portable so you need to provide header files for libraries in any case.
- The project build setup becomes more complicated.
- Any toolchain version except the newest one does not work (at the time of writing Apple's module support is listed as "partial")
- Nothing.
No comments:
Post a Comment