According to information I have picked up somewhere (but can't properly confirm via web searches ATM) there was a compiler in the 90s (the IBM VisualAge compiler maybe?) which had a special caching daemon mode. The basic idea was that you would send your code to that process and then it could return cached compile results without needing to reparse and reprocess same bits of code over and over. A sort of an in-compiler CCache, if you will. These compilers no longer seem to exist, probably because you can't just send snippets of code to be compiled, you have to send the entire set of code up to the point you want to compile. If it is different, for example because some headers are included in a different order, the results can not be reused. You have to send everything over and at that point it becomes distcc.
I was thinking about this some time ago (do not ask why, I don't know) and while this approach does not work in the general case, maybe it could be made to work for a common special case. However I am not a compiler developer so I have no idea if the following idea could work or not. But maybe someone skilled in the art might want to try this or maybe some university professor could make their students test the approach for course credit.
The basic idea is quite simple. Rather than trying to cache compiler internal state to disk somehow, persist it in a process without even attempting to be general.
The steps to take
Create a C++ project with a dozen source files or so. Each of those sources include some random set of std headers and have a single method that does something simple like returns the sum of its arguments. What they do is irrelevant, they just have to be slow to compile.
Create a PCH file that has all the std headers used in the source files. Compile that to a file.
Start compiling the actual sources one by one. Do not use parallelism to emphasize the time difference.
When the first compilation starts, read the PCH file contents into memory in the usual way. Then fork the process. One of the processes carries on compiling as usual. The second process opens a port and waits for connections, this process is the zygote server process.
When subsequent compilations are done, they connect to the port opened by the zygote process, send the compilation flags over the socket and wait for the server process to finish.
The zygote process reads the command line arguments over the socket and then forks itself. One process starts waiting on the socket again whereas the other compiles code according to the command line arguments it was given.
The performance boost comes from the fact that the zygote process already has stdlib headers in memory in compiler native data structures. In the optimal case loading the PCH file takes effectively zero time. What makes this work (in this test at least) is that the PCH file is the same for all compilations and it is the first thing the compiler starts processing. Thus it is always the same for all compilations. Conceptually at least, the actual compiler might do something else. There may be a dozen other reasons it might not work.
If someone tries this out, do let us know whether it actually worked.
I haven't implemented the full proposal. But for testing purposes I've tried in clang to `fork` after the precompiled header is loaded and to build the same file multiple times. By doing the same compilation in a loop I hope to see the benefit of `fork`. You can find my changes at [shared_process_state-build_mode branch in my LLVM fork](https://github.com/vsapsai/llvm-project/tree/shared_process_state-build_mode). Is it something you had in mind?
ReplyDeleteAs for the feasibility, the entire proposal seems to be implementable in clang. It is somewhat annoying that we handle a preprocessed header /after/ we've decided about the input file (if I'm not mistaken, but I didn't step through the relevant code in the debugger). That was one of the main reasons why I've decided to experiment on a smaller scope and not to perform an invasive clang surgery without knowing the benefits.
Wow, that's really cool. I did not look in the code too deeply (not really an LLVM expert) but FWICS that is pretty much exactly how I thought it would work.
DeleteYou might consider testing it with some heavy duty C++ standard library headers like algorithm or just a selection of the most common ones like vector, unordered_map, filesystem etc. That should give more realistic results on how big of a speedup "regular cross platform programs" might see from this approach.
Unfortunately, a relatively big C++ file (2149 lines, lots of includes) doesn't really benefit from `fork` approach or from a precompiled header. Though in my experience on macOS precompiled headers aren't as useful for C++.
Delete