perjantai 9. elokuuta 2019

C++ modules with a batch compiler feasibility study

In our previous blog post an approach for building C++ modules using a batch compiler model was proposed. The post raised a lot of discussion on Reddit. Most of it was of the quality one would expect, but some interesting questions were raised.

How much work for compiler developers?

A fairly raised issue was that the batch approach means adding new functionality to the compiler. This is true, but not particulary interesting as a standalone statement. The real question is how much more work is it, especially compared to the work needed to support other module implementation methods.

In the batch mode there are two main pieces of work. The first is starting compiler tasks, detecting when they freeze due to missing modules and resuming them once the modules they require have been built. The second one is calculating the minimal set of files to recompile when doing incremental builds.

The former is the trickier one because no-one has implemented it yet. The latter is a well known subject, build systems like Make and Ninja have done it for 40+ years. To test the former I wrote a simple feasibility study in Python. What it does is generate 100 source files containing modules that call each other and then compiles them all in the manner a batch compiler would. There is no need to scan the contents of files, the system will automatically detect the correct build order or error out if it can not be done.

Note that this is just a feasibility study experiment. There are a lot of caveats. Please go through the readme before commenting. The issue you want to raise may already be addressed there. Especially note that it only works with Visual Studio.

The code that is responsible for running the compile is roughly 60 lines of Python. It is conceptually very simple. A real implementation would need to deal with threading, locking and possibly IPC, which would take a fair bit of work.

The script does not support incremental builds. I'd estimate that getting a minimal version going would take something like 100-200 lines of Python.

I don't have any estimates on how much work this would mean on a real compiler code base.

The difference to scanning

A point raised in the Reddit discussion is that there is an alternative approach that uses richer communication between the build system and the compiler. If you go deeper you may find that the approaches are not that different after all. It's more of a question of how the pie is sliced. The scanning approach looks roughly like this:

In this case the build system and compiler need to talk to each other, somehow, via some sort of a protocol. This protocol must be implemented by all compilers and all build tools and they must all interoperate. It must also remain binary stable and all that. There is a proposal for this protocol. The specification is already fairly long and complicated especially considering that it supports versioning so future versions may be different in arbitrary ways. This has an ongoing maintenance cost of unknown size. This also complicates distributed builds because it is common for the build system and the compiler to reside on different machines or even data centres so setting up an IPC channel between them may take a fair bit of work.

The architectural diagram of the batch compiler model looks like this:

Here the pathway for communicating module information is a compiler implementation detail. Every compiler can choose to implement it in the way that fits their internal compiler implementation the best. They can change its implementation whenever they wish because it is never exposed outside the compiler. The only public facing API that needs to be kept stable is the compiler command line, which all compilers already provide. This also permits them to ship a module implementation faster, since there is no need to fully specify the protocol and do interoperability tests. The downside is that the compiler needs to determine which sources to recompile when doing incremental builds.

Who will eventually make the decision?

I don't actually know. But probably it will come down to what the toolchain providers (GCC, Clang, Visual Studio) are willing to commit to.

It is my estimate (which is purely a guess, because I don't have first hand knowledge) that the batch system would take less effort to implement and would present a smaller ongoing maintenance burden. But compiler developers are the people who would actually know.

keskiviikko 7. elokuuta 2019

Building C++ modules, take N+1

Modules were voted in C++20 some time ago. They are meant to be a replacement for #include statements to increase build speeds and to also isolate translation units so, for example, macros defined in one file do not affect the contents of another file. There are three major different compilers and each of them has their own prototype implementation available (GCC documentation, Clang documentation, VS documentation).

As you would expect, all of these implementations are wildly different and, in the grand C++ tradition, byzantinely complicated. None of them also have a really good solution to the biggest problem of C++ modules, namely that of dependency tracking. A slightly simplified but mostly accurate description of the problem goes like this:

Instead of header files, all source code is written in one file. It contains export statements that describe what functions can be called from the outside. An analogy would be that functions declared as exported would be in a public header file and everything else would be internal and declared in an internal header file (or would be declared static or similar). The module source can not be included directly, instead when you compile the source code the compiler will output an object file and also a module interface file. The latter is just some sort of a binary data file describing the module's interface. An import statement works by finding this file and reading it in.

If you have file A that defines a module and file B that uses it, you need to first fully compile file A and only after the module interface file has been created can you compile file B. Traditionally C and C++ files can be compiled in parallel because everything needed to compile each file is already in the header files. With modules this is no longer the case. If you have ever compiled Fortran and this seems familiar, it's because it is basically the exact same architecture.

Herein lies the problem

The big, big problem is how do you determine what order you should build the sources in. Just looking at the files is not enough, you seemingly need to know something about their contents. At least the following approaches have toyed with:
  1. Writing the dependencies between files manually in Makefiles. Yes. Really. This has actually been but forth as a serious proposal.
  2. First scan the contents of every file, determine the interdependencies, write them out to a separate dependency file and then run the actual build based on that. This requires parsing the source files twice and it has to be done by the compiler rather than a regex because you can define modules via macros (at least in VS currently).
  3. When the compiler finds a module import it can not resolve, it asks the build system via IPC to generate one. Somehow.
  4. Build an IPC mechanism between the different compiler instances so they can talk to each other to determine where the modules are. This should also work between compilers that are in different data centers when doing distributed builds.
Some of these approaches are better than others but all of them fail completely when source code generators enter the picture, especially if you want to build the generator executable during the build (which is fairly common). Scanning all file contents at the start of the build is not possible in this case, because some of the source code does not yet exist. It only comes into existence as build steps are executed. This is hideously complicated to support in a build system.

Is there a better solution?

There may well be, though I'd like to emphasize that none of the following has actually been tested and that I'm not a compiler developer. The approach itself does require some non-trivial amount of work on the compiler, but it should be less than writing a full blown IPC mechanism and distributed dataflow among the different parts of the system.

At the core of the proposed approach is the realization that not all module dependencies between files are the same. They can be split into two different types. This is demonstrated in the following diagram that has two targets: a library and an executable that uses it.

As you can see the dependencies within each target can get fairly complicated. The dependencies between targets can be just as complicated, but they have been left out of the picture to keep it simple. Note that there are no dependency cycles anywhere in the graph (this is mandated by the module specification FWICT). This gives us two different kinds of module dependencies: between-targets module dependencies and within-targets module dependencies.

The first one of these is actually fairly simple to solve. If you complete all compilations (but not the linking step) of the dependency library before starting any compilations in the executable, then all library module files that the executable could possibly need are guaranteed to exist. This is easy to implement with e.g. a Ninja pseudotarget.

The second case is the difficult one and leads to all the scanning problems and such discussed above. The proposed solution is to slightly change the way the compiler is invoked. Rather than starting one process per input file, we do something like the following:

g++ <other args> --outdir=somedir [all source files of this target]

What this means conceptually is that the compiler needs to take all the input files and compile each of them. Thus file1.cpp should end up as somedir/file1.o and so on. In addition it must deal with this target's internal module interrelations transparently behind the scenes. When run again it must detect which output files are up to date and rebuild only the outdated ones.

One possible implementation is that the compiler may launch one thread per input file (but no more than there are CPUs available). Each compilation proceeds as usual but when it encounters a module import that it can not find, it halts and waits on, say, a condition variable. Whenever a compilation job finishes writing a module, it will signal all the other tasks that a new module is available. Eventually either all jobs finish or every remaining task is deadlocked because they need a module that can't be found anywhere.

This approach is similar to the IPC mechanism described on GCC's documentation but it is never exposed to any third party program. It is fully an internal implementation detail of the compiler and as such there are no security risks or stability requirements for the protocol.

With this approach we can handle both internal and external module dependencies reliably. There is no need to scan the sources twice or write complicated protocols between the compiler and the build system. This even works for generated sources without any extra work, which no other proposed approach seems to be able to do.

As an added bonus the resulting command line API is so simple it can be even be driven with plain Make.

Extra bits

This approach also permits one to do ZapCC style caching. Since compiler arguments for all sources within one target must be the same under this scheme (which is a good thing to do in general), imports and includes can be potentially shared between different compiler tasks. Even further, suppose you have a type that is used in most sources like std::vector<std::string>.  Normally the instantiated and compiled code would need to be written in every object file for the linker to eventually deduplicate. In this case, since we know that all outputs will go to the same target it is enough to write the code out in one object file. This can lead to major build footprint reductions. It should also reduce the linker's memory usage since there is a lot less debug info to manage. In most large projects linking, rather than compiling, is the slow and resource intensive step so making it faster is beneficial.

The module discussions have been going around in circles about either "not wanting to make the compiler a build system" or "not wanting to make the build system into a compiler". This approach does neither. The compiler is still a compiler, it just works with slightly bigger work chunks at a time. It does need to track staleness of output objects, though, which it did not need to do before.

There needs to be some sort of a load balancing system so that you don't accidentally spawn N compile jobs each of which spawns N internal work threads.

If you have a project that consists of only one executable with a gazillion files and you want to use a distributed build server then this approach is not the greatest. The solution to this is obvious: split your project into independent logical chunks. It's good design and you should do it regardless.

The biggest downside of this approach that I could come up with was that CCache will probably no longer work without a lot of work. But if modules make compilation 5-10× faster (which is a given estimate, there are no public independent measurements yet) then it could be worth it.

torstai 1. elokuuta 2019

Metafont-inspired font design using nonlinear constraint optimization and Webassembly

Modern fonts work by drawing the outline of each letter out by hand. This is a very labour-intensive operation as each letter has to be redrawn for each weight. Back in the late 70s Donald Knuth created METAFONT, which has a completely different approach that mimics the way letters are drawn by hand. In this system you specify a pen shape and a stroke path. The computer would then calculate what sort of a mark this stroke would leave on paper and draw the result. The pen shape as well as the stroke were defined as mathematical equations of type "the stroke shall begin so that its topmost part is exactly at x-height and shall end so that its bottom is at the baseline".

The advantage of this approach is that it is fully parametrizable. If you change the global x-height, then every letter is automatically recalculated. Outline fonts can't be slanted by shearing because it changes the strokes' widths. Parametric fonts do not have this issue. Different optical weights can be obtained simply by changing the size of the pen nib. You could even go as far as change the widths of individual letters during typesetting to make text justification appear even smoother.

METAFONT was never widely used for font design. The idea behind it was elegant, though, and is worth further examination, for example are there other ways of defining fonts parametrically and what sort of letter shapes would they produce. Below we describe one such attempt.

How would that look in practice?

In a nutshell you write a definition that mathematically specifies the shape:

Then the system will automatically create the "best possible" shape that fulfills those requirements:

Note that this output comes from the Python Scipy prototype version, not the Webassembly one. It calculates a slightly different shape for the letter and does not do the outline yet.

How does it work?

First the program parses the input file, which currently can specify only one stroke. Said stroke consists of one or more cubic splines one after the other. That is, the end point of spline n is the starting point of spline n+1. It then converts the constraints into mathematical form. For example if you have a constraint that a point must be on a diagonal line between (0, 0) and (1, 1), then its coordinates must be (t, t) where t is a free parameter with values in the range [0, 1]. The free parameters can be assigned any values and the resulting curve will still fulfill all the requirements.

The followup question is how do we select the free parameters to pick the most pleasing of all curves to quote Knuth's terminology. If you thought "deep learning" then you thought wrong. This is actually a well known problem called nonlinear optimization. There are many possible approaches but we have chosen to use the (draws a deep breath) Broyden–Fletcher–Goldfarb–Shanno algorithm. This is the default optimization method in Scipy. There is also a C port of the original Fortran implementation.

The last thing we need is a target function to minimize. There are again many choices (curve length, energy, etc). After some experimentation we chose the maximum absolute value of the second derivative's perpendicular component over the entire curve. It gave the smoothest end result. Having defined a target function, free variables and a solver method we can let the computer do what it does best: crunch numbers. The end result is written as SVG using TinyXML2.

There are known problems with the current evaluation algorithm. Sometimes it produces weird and non-smooth results. LBFGS seems to be sensitive to the initial estimate and there are interesting numerical problems when evaluating the target function when switching between two consecutive curves.

Try it yourself

The Webassembly version is available here. You can edit the definition text and then recompute the letter shape with the action button. Things to try include changing the width and height (variables w and h, respectively) or changing the location constraints for curve points.

The code can also be compiled and run as a native program. It is available in this Github repository.

Future work and things that need fixing

  • Firefox renders the resulting SVG files incorrectly
  • All mathy bits are hardcoded, optimization functions et al should be definable in the character file instead
  • Designing strokes by hand is not nice, some sort of a UI would be useful
  • The text format for specifying strokes needs to be designed
  • Should be able to directly specify some path as a straight line segment rather than a bezier
  • You should be able to specify more than one stroke in a character
  • Actual font file generation, possibly via FontForge
  • "Anti-drawing" or erasing existing strokes with a second stroke