Since times immemorial, compilers have been run as standalone batch processes. If you have 50 files to compile, then you invoke the compiler 50 times, once on each file. Since each compilation is independent of all others, the work can be parallelised perfectly. This seems like a simple and optimal solution.
But, as is commonly the case, this is not the whole truth. When compiling code, there are many subtasks that are common to each individual compilation and this causes a lot of duplication of effort. Perhaps the best known case of this are C++ templates. They are parsed and codegenerated for each file that uses them yielding in the same code in dozens of files. Then the linker comes along and throws all but one of them away. There are a bunch of other issues which are discussed in this video from LLVM developer's conference:A problem of state preservation
One of the best known solution to this problem are precompiled headers. They work roughly like this:
- Parse the contents of headers
- Dump compiler internal state to a file
- Load the file on each compiler invocation
Ideally we would like to preserve as much data between two compiler invocations as possible without needing to serialise it to disk. As discussed in the above video, one solution is to have a "compiler plugin".
Almost every build system currently works roughly like this:
- Read build definition (such as a Ninja file)
- For each compilation, spawn a new compiler process and invoke the compiler executable
- Shutdown
- Read build definition
- dlopen the compiler shared library file
- For each compilation, create a new compiler object and invoke compilation using e.g. a thread pool
- Destroy compiler objects and dclose the file
- Shutdown
The big question remaining here is the API to use. It should have the following requirements:
- Must be ABI stable in the C sense
- Must be supportable on all compilers for all languages
- Must expose the full functionality of the compiler
- Must support an arbitrary number of compiler tasks within a single process
An API proposal for compiler invocation
On the face of it this seems like an impossible task. The API surface of a compiler is enormous and differs from compiler to compiler. However all of them already expose a stable ABI: the command line argument arrays. Exploiting this allows us to create an API supporting all of the requirements above with only six functions.
First we initialise the library:
CompilerService* compiler_init_service();
Here CompilerService is an opaque struct to a state object. There is one of these per process and it holds (internally) all the cached state and related things. Then we create a compiler object, one per compilation task:
Compiler* compiler_create_compiler(CompilerService *service);
Now we can invoke the compilation:
CompilationResult* compiler_compile(Compiler *c, int argc, const char **argv);
This invocation matches the signature of the main function. Since we are not going through the shell/kernel we can pass an arbitrary number of arguments without needing to use response files, quote shell characters or any other nastiness. The return value contains the return code and the strings for stdout and stderr. The standalone compiler executable such as cl.exe could (in theory ;-) be implemented by just calling these functions and returning the results to the calling process.
The last thing we need are the deallocation functions:
void compiler_free_compilation_result(CompilationResult *r);
void compiler_free_compiler(Compiler *c);
void compiler_free_service(CompilerService *s);
When will this be available in <my favorite compiler>?
Probably not soon, this is all slideware. There is no actual code to implement this (that I know of at least). The big problem here is that most compilers have not been written with this sort of usage in mind. The have global variables and other things hostile to usage as a shared library. Fixing all that to be thread safe and isolated is a lot of work. LLVM is probably the compiler that could most easily get this done since it has been designed to be used as a library from the beginning.
I had something working like this with https://github.com/stijnsanders/strato but it's very much a work in progress. Coming from years of work on Delphi, where linking is an afterthought, this parser adds to a "sphere" and uses 'early linking' where anything referenced must already be present in the "sphere", complete with type info.
ReplyDelete