Monday, October 2, 2023

Could we make C arrays memory safe? Probably not, but let's try anyway

Preamble

This is a project that I have wanted to implement for a long time. However it has become quite clear that I don't have the time to do it. Thus you get this blog post instead. If someone wants to try to do this on their own, feel free. If you succeed, it would improve computer security by a fair bit.

Nothing in this blog post is actually new. Similar ideas have been posted in research papers such as this one. AFAICT no-one has done a full scale implementation for C.

Even if the scheme worked (it is entirely possible that it can not be feasibly implemented), it would still not make C a memory safe language by any stretch of the imagination. There are many other sources of memory unsafety in C. But it would make it somewhat safer, which would be a good thing all considered.

Let us begin

Suppose we have the following very simple C function:

int do_indexing(int *buf, const size_t bufsize, size_t index) {
    return buf[index];
}

Is this memory safe? Obviously not. There is no guarantee that the index variable points to a valid array entry (we assume that bufsize is always correct for simplicity). Let's do this then:

int do_indexing(int *buf, const size_t bufsize, size_t index) {
    if(index < bufsize)
      return buf[index];
    return -1;
}

Assuming bufsize is a valid size of the array, then yes, this is memory safe (size_t is unsigned and is used to simplify the checking logic, signed integers behave the same but need two checks). What we want to achieve is to make the first version emit a compiler warning (or error) whereas the latter would not. First we need to tell the compiler that bufsize defines the array size. The GCC page on function attributes does not seem to provide such a thing, so let's invent our own magic syntax for it:

int do_indexing(int *buf, 
    const size_t bufsize [[arraysize_of:buf]], 
    size_t index);

We're going to leave this out in subsequent code samples for simplicity. This is just an attribute, it does not affect ABI, code generation or anything else.

Now the compiler can in fact verify that the latter example is safe. When buf is dereferenced we know that the value of index is nonnegative and less than the size of the array.

Similarly the compiler can diagnose that the first example can lead to a buffer overflow, because the value of index can be anything.

In other words, now we have a system where the programmer has to write a formal proof that the index is valid before being allowed to use it in a dereferencing operation. This has roughly the same basic idea as the borrow checker in Rust, where the programmer needs to provide a formal proof that they are the only one holding an object before being allowed to mutate it.

Onwards to more examples

What this scheme requires is to know each variable's domain, that is, the range of possible values it could have. Then on every array dereference the domain of the index must not be wider than the domain of the array size variable. Let's go through the sample function step by step.

int do_indexing(int *buf, const size_t bufsize, size_t index) {
    // domain of bufsize: unknown
    // domain of index: unknown
    if(index < bufsize)
      // domain of bufsize: not zero
      // domain of index: [0, bufsize)
      return buf[index]; // Index is formally valid.
    return -1;
}

After the check we know that bufsize is not zero (because there exists an integer smaller than it) and that index is valid.

The problem thus reduces to determining the worst possible domain a variable could hold. For example an if branch:

// Domain of index: [0, bufsize)
if(something())
    ++index;
//Domain of index: [1, bufsize+1)
return array[index]; // Error, index not valid

Loops with a known amount of iterations follow from this by assuming the worst case for both bounds:

// Domain of index: [a, b]
for(int i=0; i<count; ++i) {
    if(something())
        ++index;
    else
        --index;
}
// Domain of index: [a-count, b+count]
// Detect underflow if a < count.
// Detect overflow if b >= MAX_SIZE_T - count

A loop that runs an unknown number of times is surprisingly even simpler.

// Domain of index: [a, b)
while(keep_going()) {
    if(something())
        ++index;
    else
        --index;
}
// Domain of index: unknown

The checker does not need to be able to determine what the true domain is. It can merely throw up its hands and require the developer to add the necessary checks between this block and any subsequent array indexing operations.

If you want to get even fancier, you could create a second pragma for the index value, like so:

int do_indexing(int *buf, 
    const size_t bufsize [[arraysize_of:buf]], 
    size_t index [[arrayindex_of:buf]]
);

You would only be allowed to call this function after formally proving that the domain of index is [0, bufsize). This would give you array indexing operations that are both memory safe and fast as it can do the array dereferences without any runtime overhead (i.e. bounds checking). In theory you could also leave out the middle argument and still have safe dereferencing operations but a decent optimizer should be able to inline the function and optimize it out.

That's it? Sadly no

Suppose the value of bufsize is not constant and you do this:

// Domain of index: [0, bufsize]
shrink_buffer_by_one(&buf, &bufsize);
return buf[index]; // NOT SAFE

If the size of the array can change, then the domain might not be correct. In this case you'd need to store that the max value is bufsize at an earlier point in the program, not the current value. A full implementation would get quite complicated quite quickly.

No comments:

Post a Comment