Sunday, September 9, 2018

The compiler as a shared library

Since times immemorial, compilers have been run as standalone batch processes. If you have 50 files to compile, then you invoke the compiler 50 times, once on each file. Since each compilation is independent of all others, the work can be parallelised perfectly. This seems like a simple and optimal solution.

But, as is commonly the case, this is not the whole truth. When compiling code, there are many subtasks that are common to each individual compilation and this causes a lot of duplication of effort. Perhaps the best known case of this are C++ templates. They are parsed and codegenerated for each file that uses them yielding in the same code in dozens of files. Then the linker comes along and throws all but one of them away. There are a bunch of other issues which are discussed in this video from LLVM developer's conference:

A problem of state preservation

One of the best known solution to this problem are precompiled headers. They work roughly like this:
  1. Parse the contents of headers
  2. Dump compiler internal state to a file
  3. Load the file on each compiler invocation
The two main problems with this is that it requires someone to design and implement a full serialisation format for the compiler-internal data. That is a lot of tedious work that very few people will volunteer to do. The other downside is that the files need to be loaded explicitly from disk in every compilation process, which takes time, and that the build system needs to tell the compiler how to get this done. The granularity is also fairly coarse.

Ideally we would like to preserve as much data between two compiler invocations as possible without needing to serialise it to disk. As discussed in the above video, one solution is to have a "compiler plugin".

Almost every build system currently works roughly like this:
  1. Read build definition (such as a Ninja file)
  2. For each compilation, spawn a new compiler process and invoke the compiler executable
  3. Shutdown
The proposed new model would go like this (no build system currently supports this, but adding it to e.g. Ninja is not a massive undertaking):
  1. Read build definition
  2. dlopen the compiler shared library file
  3. For each compilation, create a new compiler object and invoke compilation using e.g. a thread pool
  4. Destroy compiler objects and dclose the file
  5. Shutdown
In this model all compilation jobs live in the same process, thus they can coordinate work behind the scenes however they wish. This requires some tricky code with thread safe caches and the like but it all internal to the compiler and never exposed. Even without caching this makes a difference on platforms such as Windows where process spawning is slow.

The big question remaining here is the API to use. It should have the following requirements:
  1. Must be ABI stable in the C sense
  2. Must be supportable on all compilers for all languages
  3. Must expose the full functionality of the compiler
  4. Must support an arbitrary number of compiler tasks within a single process

An API proposal for compiler invocation

On the face of it this seems like an impossible task. The API surface of a compiler is enormous and differs from compiler to compiler. However all of them already expose a stable ABI: the command line argument arrays. Exploiting this allows us to create an API supporting all of the requirements above with only six functions.

First we initialise the library:

CompilerService* compiler_init_service();

Here CompilerService is an opaque struct to a state object. There is one of these per process and it holds (internally) all the cached state and related things. Then we create a compiler object, one per compilation task:

Compiler* compiler_create_compiler(CompilerService *service);

Now we can invoke the compilation:

CompilationResult* compiler_compile(Compiler *c, int argc, const char **argv);

This invocation matches the signature of the main function. Since we are not going through the shell/kernel we can pass an arbitrary number of arguments without needing to use response files, quote shell characters or any other nastiness. The return value contains the return code and the strings for stdout and stderr. The standalone compiler executable such as cl.exe could (in theory ;-) be implemented by just calling these functions and returning the results to the calling process.

The last thing we need are the deallocation functions:

void compiler_free_compilation_result(CompilationResult *r);
void compiler_free_compiler(Compiler *c);
void compiler_free_service(CompilerService *s);

When will this be available in <my favorite compiler>?

Probably not soon, this is all slideware. There is no actual code to implement this (that I know of at least). The big problem here is that most compilers have not been written with this sort of usage in mind. The have global variables and other things hostile to usage as a shared library. Fixing all that to be thread safe and isolated is a lot of work. LLVM is probably the compiler that could most easily get this done since it has been designed to be used as a library from the beginning.