Sunday, October 15, 2023

The road to hell is paved with good intentions and C++ modules

The major C++ compilers are starting to ship modules implementations so I figured I'd add more support for those in Meson. That resulted in this blog post. It will not be pleasant or fun. Should you choose to read it, you might want to keep your emergency kitten image image reserve close at hand.

At first you run into all the usual tool roughness that you'd expect from new experimental features. For example, GCC generates dependency files that don't work with Ninja. Or that Clang's module scanner does not support a -o command line argument to write results to a file, but instead it dumps them to stdout. (If you are the sort of person who went "pfft, what a moron, he should learn to use shell forwarding" then how would you like it if all compilers dumped their .o files to stdout?) The output of the scanner is not usable for actually building the code, mind you, it needs to be preprocessed with a different tool to convert its "high level" JSON output to "file based" dep files that Ninja understands.

Then you get into deeper problems like the fact that currently the three major C++ compilers do not support a common file extension to specify C++ module sources. (Or output binaries for that matter.) Thus the build system must have magic code to inject the "compile this thing as that thing" arguments for every source. Or, in the case of CMake, have the end user manually type the format of each source file, even if it had an extension that uniquely specifies it as a module. No, I am not kidding.

Then things get even worse, but first let's set a ground rule.

The basic axiom for command line tools

One could write a blog post as long as, or longer than, this one just explaining why the following requirement is necessary. I'm not going to, but instead just state it as an axiom.

For a given data processing tool, one must be able to determine all command line arguments the tool needs to do its job without needing to examine contents of its input files.

As a very simple example, suppose you have an image file format named .bob, which can be either a png file or a jpg file and you have a program to display image files on screen. Invoking it as showimage image.bob would be good. Requiring users to invoke the program with showimage image.bob --image-is-actually-a=png would be bad.

Every time you end up with a design that would violate this requirement, the correct thing to do is to change the design somehow so that the violation goes away. If you disagree with this claim, you are free to do so. Just don't send me email about it.

A brief overview of how build systems work

The main pipeline commonly looks like this for systems that generate e.g. Makefiles or Ninja files.

The important step here is #3. A Ninja file is static, thus we need to know a) the compilation arguments to use and b) interdependencies between files. The latter can't be done for things like Fortran or C++ modules. This is why Ninja has a functionality called dyndeps. It is a way to call an external program that scans the sources to be built and then writes a simple Makefile-esque snippet describing which order things should be built in. This is a simple and understandable step that works nicely for Fortran modules but is by itself insufficient for C++ modules.

Once more into the weeds, my friends!

As an example let's examine the Clang compiler. For compiling C++ modules it has a command line argument for specifying where it should put the output module file: -fmodule-output=somewhere/modulename.pcm. The output file name must be same as the module it exports (not actually true, but sufficiently true for our purposes). Immediately we see the problem. When the Ninja file is first generated, we must already know what module said file exports. Or, in other words, we have to scan the contents of the file in order to know what command line argument we need to pass to it.

In reality even that is not sufficient. You need to parse the contents of said file and all its compiler arguments and possibly other things, because the following is perfectly valid code (or at least it compiles on Clang):

module;
#include<evil.h>
export module I_AM_A_DEFINE_GOOD_LUCK_FINDING_OUT_WHERE_I_COME_FROM;

In other words in order to be able to compile this source file you first need to parse the source file and all included sources until you hit the export declaration and then throw away the result. Simpler approaches don't work because the only thing that can reliably parse a C++ source file is a full C++ compiler.

C++ has to support a lot of stupid things because of backwards compatibility. In this case that does not apply, because there is no existing code that would require this to keep working. Defining a module name with the preprocessor should just be made invalid (or at least ill defined no diagnostics required).

Even worse, now you get into problems with reconfiguration. You want to minimize the number of reconfigurations you do, because they are typically very slow compared to just rerunning Ninja. In order to be reliable, we must assume that any change in a module source file might change the module name it generates. Thus we need to change the compiler flags needed to build the file, which means recreating the Ninja file, which can only be done by running reconfigure.

Ninja does not support dynamic compiler argument generation. Which is good, because it makes things faster, simpler and more reliable. This does not stop CMake, which hacked it on top anyway. The compiler command they invoke looks like the following.

clang++ <some compiler flags> @path/to/somefile.modmap <more flags>

The @file syntax means "read the contents of the file and pretend that it contains command line arguments". The file itself is created at build time with a scanner program. Here's what its contents look like.

-x c++-module
-fmodule-output=CMakeFiles/modtest.dir/M0.pcm
-fmodule-file=M1=CMakeFiles/modtest.dir/M1.pcm
-fmodule-file=M2=CMakeFiles/modtest.dir/M2.pcm
-fmodule-file=M3=CMakeFiles/modtest.dir/M3.pcm
-fmodule-file=M4=CMakeFiles/modtest.dir/M4.pcm
-fmodule-file=M5=CMakeFiles/modtest.dir/M5.pcm
-fmodule-file=M6=CMakeFiles/modtest.dir/M6.pcm
-fmodule-file=M7=CMakeFiles/modtest.dir/M7.pcm
-fmodule-file=M8=CMakeFiles/modtest.dir/M8.pcm
-fmodule-file=M9=CMakeFiles/modtest.dir/M9.pcm

This file has the output filename as well as listing every other module file in this build target (only M1 is used by M0 so the other flags are superfluous). A file like this is created for each C++ source file and it must remain on disk for the entire compilation. NTFS users, rejoice!

The solution that we have here is exceedingly clever. Unfortunately the word "clever" is used here in the pejorative sense, as in "aren't I super clever for managing to create this hideously complicated Rube Goldberg machine to solve a problem caused by people not communicating with each other". This is a "first beta" level solution. The one where you prove that something can be done and which you then improve to be actually good. But no, CMake has decreed module support as "done", so this is what you are going to be stuck with for the foreseeable future.

AFAICT this setup has been designed mostly by CMake devs. Which kind of makes you wonder. Why would a company that makes a lot of money consulting and training people on their product make the practical use of C++ modules really hard for competing projects to implement? That would make it an even bigger burden to replace their software with something else? What kind of business goal could that possibly serve?

An interlude that is 100% unrelated to anything discussed above

There is a proposal for Sandia National Laboratories to fund Kitware to create a new front end language to CMake. Page 4 of the Powerpoint presentation on that page looks like this (emphasis mine):

I have no way of knowing how large the code base in question is. The document mentions on page 10 that a port from Autotools to CMake cost $1.2 million, 500k of which was internal costs and 700k went to Kitware for implementing basic functionality missing in CMake. For comparison the Cocomo estimate for Meson in its entirety is only $1.3 million. I don't know how much code they have but I have ported several fairly hefty projects from one build system to another and none of them has gotten even close to $50k. The text does not say if the $10 million includes internal costs as well as external (it probably does), but the $1 million one seems to be purely external costs. The cost of updating existing CMake code to use the new syntax seems to be ignored (or that is my interpretation of the doc, anyway). 

Without knowing the true ratio of internal costs (training, recertification etc) over total costs it is hard to compare the two numbers. But just for reference, if one were to charge $100 an hour, $10 million would get you 100k hours. At 8 hours per day that means 12.5k days or 2500 weeks or 625 months or 56 years at 11 months per year. Even a million would last for over five years of full time effort. That's a lot of time to spend on converting build systems.

We now return to your regularly scheduled program.

Is there a better design?

Maybe. Let's start by listing the requirements:

  • All command lines used must be knowable a priori, that is, without scanning the contents of source files
  • Any compiler may choose to name its module files however it wants, but said mapping must be knowable just from the compiler name and version, i.o.w. it has to be documented
  • Changing source files must not cause, by itself, a reconfiguration, only a rescan for deps followed by a build of the changed files
  • The developer must not need to tell which sources are which module types in the build system, it is to be deduced automatically without needing to scan the contents of source files (i.e. by using the proper file extension)
  • Module files are per-compiler and per-version and only make sense within the build tree of a single project
  • If module files are to be installed, those are defined in a way that does not affect source files with modules that do not need installing.
Fortran modules already statisfy all of these. 

We split requirements between the tools as follows.

  • It is the responsibility of the compiler to output files where told
  • It is the responsibility of the build system to invoke compilations in the correct order to satisfy module requirements
The immediate consequence of this is that the Ninja file must not contain any command line arguments with module names in them.

For now we assume a project having only one executable target. The scheme is extended to multiple targets later.

To get this working we need a special built modules subdirectory. It is the responsibility of the build system to guarantee that every source file within a given build target is given the same directory via a command line argument.

For Clang this would mean that instead of specifying -fmodule-output=moddir/M0.pcm meaning "write the module to this file" you'd say -fmodule-output-dir=moddir meaning "I don't care what the name of the module is, just write it in the given directory using your own naming scheme". That directory is also implicitly used as a module import directory so any source file can do import foo for any module foo defined in the same target without any additional command line arguments. Other targets' module import dirs can be added with a separate command line argument (not specified here).

With this setup we don't need anything else. The command line arguments are always the same, the dependency scanner can detect when a file generates a module, what its output name and path is going to be and it can generate Ninja dyndep info to make compilations happen in the correct order. This is, more or less, how Fortran modules work and also how Meson's WIP module builder currently works. It has the added benefit that the scanner only invokes one process per build target, not one process per source file. You can only do this if the build system enforces that all sources within a target have the same command line arguments. Meson does this. CMake does not.

The scheme listed above has been tested for small projects but not at any sort of scale. No fundamental blockers to large scale use are known at this time but they might exist.

Multiple targets and subprojects

If you have a module executable A that uses module library B, you obviously need to compile all module producers of B before compiling any module consumers of A. You also need to tell A where the module files of B are.

The former is easy to do with phony targets if you are willing to tolerate that all compilations of B (but not linking) need to happen before any compilations of A. That causes a minor build time hit, but since one of the main features of modules were faster build times, you should still come out ahead.

There are two solutions to the latter. The first one is that when you add a link dependency of B to A, you also add its module output directory to the module input path of A. The other, and much simpler, solution is that the module output directory is shared between all build targets and subprojects within a single build tree. It could be called, for example, built-modules and be at the root dir of the build tree. The only real downside is that if your project defines a module with the same name in two different source files, there would be a clash. But if that is the case your build setup is broken already due to potential module-ODR violations, and bailing out is the sensible thing to do.

Monday, October 2, 2023

Could we make C arrays memory safe? Probably not, but let's try anyway

Preamble

This is a project that I have wanted to implement for a long time. However it has become quite clear that I don't have the time to do it. Thus you get this blog post instead. If someone wants to try to do this on their own, feel free. If you succeed, it would improve computer security by a fair bit.

Nothing in this blog post is actually new. Similar ideas have been posted in research papers such as this one. AFAICT no-one has done a full scale implementation for C.

Even if the scheme worked (it is entirely possible that it can not be feasibly implemented), it would still not make C a memory safe language by any stretch of the imagination. There are many other sources of memory unsafety in C. But it would make it somewhat safer, which would be a good thing all considered.

Let us begin

Suppose we have the following very simple C function:

int do_indexing(int *buf, const size_t bufsize, size_t index) {
    return buf[index];
}

Is this memory safe? Obviously not. There is no guarantee that the index variable points to a valid array entry (we assume that bufsize is always correct for simplicity). Let's do this then:

int do_indexing(int *buf, const size_t bufsize, size_t index) {
    if(index < bufsize)
      return buf[index];
    return -1;
}

Assuming bufsize is a valid size of the array, then yes, this is memory safe (size_t is unsigned and is used to simplify the checking logic, signed integers behave the same but need two checks). What we want to achieve is to make the first version emit a compiler warning (or error) whereas the latter would not. First we need to tell the compiler that bufsize defines the array size. The GCC page on function attributes does not seem to provide such a thing, so let's invent our own magic syntax for it:

int do_indexing(int *buf, 
    const size_t bufsize [[arraysize_of:buf]], 
    size_t index);

We're going to leave this out in subsequent code samples for simplicity. This is just an attribute, it does not affect ABI, code generation or anything else.

Now the compiler can in fact verify that the latter example is safe. When buf is dereferenced we know that the value of index is nonnegative and less than the size of the array.

Similarly the compiler can diagnose that the first example can lead to a buffer overflow, because the value of index can be anything.

In other words, now we have a system where the programmer has to write a formal proof that the index is valid before being allowed to use it in a dereferencing operation. This has roughly the same basic idea as the borrow checker in Rust, where the programmer needs to provide a formal proof that they are the only one holding an object before being allowed to mutate it.

Onwards to more examples

What this scheme requires is to know each variable's domain, that is, the range of possible values it could have. Then on every array dereference the domain of the index must not be wider than the domain of the array size variable. Let's go through the sample function step by step.

int do_indexing(int *buf, const size_t bufsize, size_t index) {
    // domain of bufsize: unknown
    // domain of index: unknown
    if(index < bufsize)
      // domain of bufsize: not zero
      // domain of index: [0, bufsize)
      return buf[index]; // Index is formally valid.
    return -1;
}

After the check we know that bufsize is not zero (because there exists an integer smaller than it) and that index is valid.

The problem thus reduces to determining the worst possible domain a variable could hold. For example an if branch:

// Domain of index: [0, bufsize)
if(something())
    ++index;
//Domain of index: [1, bufsize+1)
return array[index]; // Error, index not valid

Loops with a known amount of iterations follow from this by assuming the worst case for both bounds:

// Domain of index: [a, b]
for(int i=0; i<count; ++i) {
    if(something())
        ++index;
    else
        --index;
}
// Domain of index: [a-count, b+count]
// Detect underflow if a < count.
// Detect overflow if b >= MAX_SIZE_T - count

A loop that runs an unknown number of times is surprisingly even simpler.

// Domain of index: [a, b)
while(keep_going()) {
    if(something())
        ++index;
    else
        --index;
}
// Domain of index: unknown

The checker does not need to be able to determine what the true domain is. It can merely throw up its hands and require the developer to add the necessary checks between this block and any subsequent array indexing operations.

If you want to get even fancier, you could create a second pragma for the index value, like so:

int do_indexing(int *buf, 
    const size_t bufsize [[arraysize_of:buf]], 
    size_t index [[arrayindex_of:buf]]
);

You would only be allowed to call this function after formally proving that the domain of index is [0, bufsize). This would give you array indexing operations that are both memory safe and fast as it can do the array dereferences without any runtime overhead (i.e. bounds checking). In theory you could also leave out the middle argument and still have safe dereferencing operations but a decent optimizer should be able to inline the function and optimize it out.

That's it? Sadly no

Suppose the value of bufsize is not constant and you do this:

// Domain of index: [0, bufsize]
shrink_buffer_by_one(&buf, &bufsize);
return buf[index]; // NOT SAFE

If the size of the array can change, then the domain might not be correct. In this case you'd need to store that the max value is bufsize at an earlier point in the program, not the current value. A full implementation would get quite complicated quite quickly.