Monday, October 2, 2023

Could we make C arrays memory safe? Probably not, but let's try anyway

Preamble

This is a project that I have wanted to implement for a long time. However it has become quite clear that I don't have the time to do it. Thus you get this blog post instead. If someone wants to try to do this on their own, feel free. If you succeed, it would improve computer security by a fair bit.

Nothing in this blog post is actually new. Similar ideas have been posted in research papers such as this one. AFAICT no-one has done a full scale implementation for C.

Even if the scheme worked (it is entirely possible that it can not be feasibly implemented), it would still not make C a memory safe language by any stretch of the imagination. There are many other sources of memory unsafety in C. But it would make it somewhat safer, which would be a good thing all considered.

Let us begin

Suppose we have the following very simple C function:

int do_indexing(int *buf, const size_t bufsize, size_t index) {
    return buf[index];
}

Is this memory safe? Obviously not. There is no guarantee that the index variable points to a valid array entry (we assume that bufsize is always correct for simplicity). Let's do this then:

int do_indexing(int *buf, const size_t bufsize, size_t index) {
    if(index < bufsize)
      return buf[index];
    return -1;
}

Assuming bufsize is a valid size of the array, then yes, this is memory safe (size_t is unsigned and is used to simplify the checking logic, signed integers behave the same but need two checks). What we want to achieve is to make the first version emit a compiler warning (or error) whereas the latter would not. First we need to tell the compiler that bufsize defines the array size. The GCC page on function attributes does not seem to provide such a thing, so let's invent our own magic syntax for it:

int do_indexing(int *buf, 
    const size_t bufsize [[arraysize_of:buf]], 
    size_t index);

We're going to leave this out in subsequent code samples for simplicity. This is just an attribute, it does not affect ABI, code generation or anything else.

Now the compiler can in fact verify that the latter example is safe. When buf is dereferenced we know that the value of index is nonnegative and less than the size of the array.

Similarly the compiler can diagnose that the first example can lead to a buffer overflow, because the value of index can be anything.

In other words, now we have a system where the programmer has to write a formal proof that the index is valid before being allowed to use it in a dereferencing operation. This has roughly the same basic idea as the borrow checker in Rust, where the programmer needs to provide a formal proof that they are the only one holding an object before being allowed to mutate it.

Onwards to more examples

What this scheme requires is to know each variable's domain, that is, the range of possible values it could have. Then on every array dereference the domain of the index must not be wider than the domain of the array size variable. Let's go through the sample function step by step.

int do_indexing(int *buf, const size_t bufsize, size_t index) {
    // domain of bufsize: unknown
    // domain of index: unknown
    if(index < bufsize)
      // domain of bufsize: not zero
      // domain of index: [0, bufsize)
      return buf[index]; // Index is formally valid.
    return -1;
}

After the check we know that bufsize is not zero (because there exists an integer smaller than it) and that index is valid.

The problem thus reduces to determining the worst possible domain a variable could hold. For example an if branch:

// Domain of index: [0, bufsize)
if(something())
    ++index;
//Domain of index: [1, bufsize+1)
return array[index]; // Error, index not valid

Loops with a known amount of iterations follow from this by assuming the worst case for both bounds:

// Domain of index: [a, b]
for(int i=0; i<count; ++i) {
    if(something())
        ++index;
    else
        --index;
}
// Domain of index: [a-count, b+count]
// Detect underflow if a < count.
// Detect overflow if b >= MAX_SIZE_T - count

A loop that runs an unknown number of times is surprisingly even simpler.

// Domain of index: [a, b)
while(keep_going()) {
    if(something())
        ++index;
    else
        --index;
}
// Domain of index: unknown

The checker does not need to be able to determine what the true domain is. It can merely throw up its hands and require the developer to add the necessary checks between this block and any subsequent array indexing operations.

If you want to get even fancier, you could create a second pragma for the index value, like so:

int do_indexing(int *buf, 
    const size_t bufsize [[arraysize_of:buf]], 
    size_t index [[arrayindex_of:buf]]
);

You would only be allowed to call this function after formally proving that the domain of index is [0, bufsize). This would give you array indexing operations that are both memory safe and fast as it can do the array dereferences without any runtime overhead (i.e. bounds checking). In theory you could also leave out the middle argument and still have safe dereferencing operations but a decent optimizer should be able to inline the function and optimize it out.

That's it? Sadly no

Suppose the value of bufsize is not constant and you do this:

// Domain of index: [0, bufsize]
shrink_buffer_by_one(&buf, &bufsize);
return buf[index]; // NOT SAFE

If the size of the array can change, then the domain might not be correct. In this case you'd need to store that the max value is bufsize at an earlier point in the program, not the current value. A full implementation would get quite complicated quite quickly.

Tuesday, September 19, 2023

Circles do not exist

Many logos, drawings and other graphical designs have the following shape in it. What is this shape?

If you thought: "Ah-ha! I'm smart and read the title of this blog post so I know that this is most definitely not a circle."

Well it is. Specifically it is a raster image of a circle that I created with the Gimp just for this use.

However almost every "circle" you can see in printed media (and most purely digital ones) are not, in fact, circles. Why is this?

Since roughly the mid 80s all "high quality" print jobs have been done either in PostScript or, nowadays almost exclusively, in PDF. They use the same basic drawing model, which does not have a primitive for circles (or circle arcs). The only primitives they have are straight line segments, rectangles and Bézier curves. None of these can be used to express a circle accurately. You can only do an approximation of a circle but it is always slightly eccentric. The only way to create a proper circle is to have a raster image like the one above.

Does this matter in the real world?

For printing probably not. Almost nobody can tell the difference between a real circle and one that has been approximated with a Bézier curve with just four points. Furthermore, the human vision system is a bit weird and perfect circles look vertically elongated. You have to make them non-circular for people to consider them properly circular.

But suppose you want to use one of these things:

This is a laser cutter that takes its "print job" as a PDF file and uses its vector drawing commands to drive the cutting head. This means that it is impossible to use it to print a wheel. You'd need to attach the output to a lathe and sand it down to be round so it actually functions as a wheel rather than as a vibration source.

Again one might ask whether this has any practical impact. For this case, again, probably not. But did you know that one of the cases PDF is being considered (and, based on Internet rumors, is already being used) is as an interchange format for CAD drawings? Now it suddenly starts mattering. If you have any component where getting a really accurate circle shape is vital (like pistons and their holes), suddenly all your components are slightly misshaped. Which would not be fun.

Extra bonus information

Even though it is impossible to construct a path that is perfectly circular, PDF does provide a way to draw a filled circle. Here is the relevant snippet from the PDF 2.0 spec, subsection 8.5.3.2:

If a subpath is degenerate (consists of a single-point closed path or of two or more points at the same coordinates), the S operator shall paint it only if round line caps have been specified, producing a filled circle centred at the single point.

Who is willing to put money on the line that every PDF rendering implementation actually uses circles rather than doing the simple thing of approximating it with Béziers?

Sunday, September 10, 2023

A logo for CapyPDF

The two most important things about any software project are its logo and mascot. Here is a proposal for both for CapyPDF.

As you can probably tell I'm not a professional artist, but you gotta start somewhere. The original idea was to have a capybara head which is wearing the PDF logo much like a bow tie around its ear. The gist of it should come across, though it did look much better inside my brain. The PDF squiggle logo is hard to mold to the desired shape.

The font is Nimbus Sans, which is one of the original PostScript Core Fonts. More precisely it is a freely licensed metrically compatible version of Helvetica. This combines open source with the history of PDF quite nicely.

Friday, September 1, 2023

CapyPDF 0.5 is out

There are no actual release notes, but a bunch of stuff got added. Code is here. Many of these were things needed by Inkscape for its new CMYK PDF exporter. More info about that can be found on DoctorMo's YouTube channel.

Wednesday, August 16, 2023

The least convenient way of defining an alpha value

PDF's drawing model inherits from PostScript, which was originally designed in the early 80s. It works and is really nice but the one glaring hole it has is missing transparency support. The original PDF spec from 1993 has no transparency support either, it was added in version 1.4 that came out mere 8 years later in 2001.

Setting drawing operation colours in PDF is fairly straightforward. You list the values you want and call the color setting command, like this:

1.0 0.1 0.4 rg    % Sets RGB nonstroke colour
0.2 0.8 0.0 0.2 K % Sets CMYK stroke colour

The property name for alpha in PDF is CA and ca for stroking and nonstroking operations, respectively. This would imply that you should be able to do this:

0.9 ca  % Sets nonstroke alpha? [no]
0.2 CA  % Sets stroke alpha?    [no]

But of course you can't. Instead the alpha value can only be set via a Graphics State dictionary, which is a top level PDF object that you then need to create, use, link to the current page's used resources and all that jazz.

/Alpha01State gs % Set graphics state to the dictionary with the given name

This works fine for use cases such as compositing layers that do not themselves use transparency. In that case you render the layer to a separate XObject, create a graphics dictionary with the given composition parameters, use it and then render the layer XObject. It does not work at all in pretty much any other case.

The more observant among you might have already noticed that you need to create a separate graphics state dictionary for every transparency value used in your document. If you are, say, a vector graphics application and allow artists to adjust the transparency for each object separately using a slider (as you should), then it means that you may easily have hundreds or thousands of different alpha values in your document. Which means that you need to create hundreds or thousands of different transparency graphics state dictionaries if you want to preserve document fidelity.

Why is it implemented like this?

As is common, we don't really know (or at least I don't). The Finger of Speculation points towards the most usual of suspects: backwards compatibility.

Suppose the alpha command had been added to the PDF spec. It would mean that trying to open documents with those commands in older PDF renderers would cause crashes due to unknown commands. Yes, even if the document had specified a newer minimum PDF version, because PDF renderers would try to open and show them anyway rather than popping an error message saying "Document version is too new". OTOH adding new entries to a graphics state dictionary is backwards compatible because the renderer would just ignore keys it does not understand. This renders the document as if no transparency had ever been defined. The output is not correct, but at least mostly correct. Sadly the same can't be done in the command stream because it is a structureless stream of tokens following The One True Way of RPN.

If the hypothesis above is true, then the inescapable conclusion is that no new commands can be added to the PDF command stream. Ever.

Friday, July 21, 2023

Creating a PDF/X-3 document with CapyPDF

The original motivation for creating CapyPDF was understanding how fully color managed and print-ready PDF files are actually put together. A reasonable measure of this would be being able to generate fully spec conforming PDF/X-3 files. Most of the building blocks were already there so this was mostly a question of exposing all that in Python and adding the missing bits.

I have now done this and it seems to work. The script itself can be found in CapyPDF's Git repo and the generated PDF document can be downloaded using this link. It looks like this in Acrobat Pro.

It is not graphically the most flashy, but it requires quite a lot of functionality:

  • The document is A4 size but it is defined on a larger canvas. After printing it needs to be cut into the final shape. This is needed as the document has a bleed.
  • To obtain this the PDF has a bleed box (the blue box) as well as a trim box (the green box). These are not painted in the document itself, Acrobat Pro just displays them for visualisation purposes.
  • All colors in graphical operations are specified in CMYK.
  • The image is a color managed CMYK TIFF. It has an embedded ICC profile that is different from the one used in final printing. This is due to laziness as I had this image laying around already. You don't want to do this in a real print job.
  • The heading has both a stroke and a fill
  • The printer marks are just something I threw on the page at random.
Here is a screenshot of the document passing PDF-X/3 validation in Acrobat Pro.

Sunday, July 16, 2023

PDF transparency groups and composition

The PDF specification has the following image as an example of how to do transparent graphics composition.

This seems simple but actually requires quite a lot of functionality:

  • Specifying CMYK gradients
  • Setting the blend mode for paint operations
  • Specifying transparency group xobjects
  • Specifying layer composition parameters
I had to implement a bunch of new functionality to get this working, but here is the same image reproduced with CapyPDF. (output PDF here)

The only difference here is that out of laziness I used a simple two color linear gradient rather than a rainbow one.

Nonportability

This is again one of those features that only Acrobat Reader can handle. I tried Okular, Ghostscript, Firefox, Chromium and Apple Preview and they all rendered the file incorrectly. There was no consistency, each one was broken in a different way.