In the 80s and 90s software development landscape was quite different from today (or so I have been told). Everything that needed performance was written in C and things that did not were written in Perl. Because computers of the time were really slow, almost everything was in C. If you needed performance and fast development, you could write a C extension to Perl.

As C was the only game in town, anyone could use pretty much any other library directly. The number of dependencies available was minuscule compared to today, but you could use all of them fairly easily. Then things changed, as they have a tendency to do. First Python took over Perl. Then more and more languages started eroding C's dominant position. This lead to a duplication of effort. For example if you were using Java and wanted to parse XML (which was the coolness of its day), you'd need an XML parser written in Java. Just dropping libxml in your Java source tree would not cut it (you could still use native code libs but most people chose not to).

The number of languages and ecosystems kept growing and nowadays we have dozens of them. But suppose you want to provide a library that does something useful and you'd like it to be usable by as many people as possible. This is especially relevant for providing closed source libraries but the same applies to open source libs as well. You especially do not want to rewrite and maintain multiple implementations of the code in different languages. So what do you do?

Let's start by going through a list of programming languages and seeing what sort of dependencies they can use natively (i.e. the toolchain or stdlib provides this support out of the box rather than requiring an addon, code generator, IDL tool or the like)

C: C
Perl: Perl and C
Python: Python and C
C++: C++ and C
Rust: Rust and C
Java: Java and C
Lua: Lua and C
D: D, subset of C++ and C
Swift: Swift, Objective C, C++ (eventually?) and C
PrettyMuchAnyNewLanguage: itself and C

The message is quite clear. The only thing in common is C, so that is what you have to use. The alternative is maintaining an implementation per language leaving languages you explicitly do not support out in the cold.

So even though C as a language is (most likely) going away, C APIs are not. In fact, designing C APIs is a skill that might even see a resurgence as the language ecosystem fractures even further. Note that providing a library with a C API does not mean having to implement it in C. All languages have ways of providing libraries whose external API is compatible with C. As an extreme example, Visual Studio's C runtime libraries are nowadays written in C++.

CapyPDF's design and things picked up along the way

One of the main design goals of CapyPDF was that it should provide a C API and be usable from any language. It should also (eventually) provide a stable API and ABI. This means that the ground truth of the library's functionality is the C header. This turns out to have design implications to the library's internals that might be difficult to add in after the fact.

Hide everything

Perhaps the most important declaration in widely usable C headers is this.

typedef struct _someObject SomeObject;

In C parlance this means "there is a struct type _someObject somewhere, create an alias to it called SomeObjectType". This means that the caller can create pointers to structs of type SomeObject but do nothing else with them. This leads to the common "opaque structs" C API way of doing things:

SomeObject *o = some_object_new();

some_object_do_something(o, "hello");
some_object_destroy(o);

This permits you to change the internal representation of the object while still maintaining stable public API and ABI. Avoid exposing the internals of structs whenever possible, because once made public they can never be changed.

Objects exposed via pointers must never move in memory

This one is fairly obvious when you think about it. Unfortunately it means that if you want to give users access to objects that are stored in an std::vector, you can't do it with pointers, which is the natural way of doing things in C. Pushing more entries in the vector will eventually cause the capacity to be exceeded so the storage will be reallocated and entries moved to the new backing store. This invalidates all pointers.

There are several solutions to this, but the simplest one is to access those objects via type safe indices instead. They are defined like this:

typedef struct { int32_t id; } SomeObjectId;

This struct behaves "like an integer" in that you can pass it around as an int but it does not implicitly convert to any other "integer" type.

Objects must be destructable in any order

It is easy to write into documentation that "objects of type X must be destroyed before any object Y that they use". Unfortunately garbage collected languages do not read your docs and thus provide no guarantee whatsoever on object destruction order. When used in this way any object must be destructable at any time regardless of the state of any other object.

This is the opposite of how modern languages want to work. For the case of CapyPDF especially page draw contexts were done in an RAII style where they would submit their changes upon destruction. For an internal API this is nice and usable but for a public C API it is not. The implicit action had to be replaced with an explicit function to add the page that takes both object pointers (the draw context and document) as arguments. This ensures that they both must exist and be valid at the point of call.

Use transactionality whenever possible

It would be nice if all objects were immutable but sadly that would mean that you can't actually do anything. A library must provide ways for end users to create, mutate and destroy objects. When possible try to do this with a builder object. That is, the user creates a "transactional change" that they want to do. They can call setters and such as much as they want, but they don't affect the "actual document". All of this new state is isolated in the builder object. Once the user is finished they submit the change to the main object which is then validated and either rejected or accepted as a whole. The builder object then becomes an empty shell that can be either reused or discarded.

CapyPDF is an append only library. Once something has been "committed" it can never be taken out again. This is also something to strive towards, because removing things is a lot harder than adding them.

Prefer copying to sharing

When the library is given some piece of data, it makes a private copy of it. Otherwise it would need to coordinate the life cycle of the shared piece of data with the caller. This is where bugs lie. Copying does cost some performance but makes a whole class of difficult bugs just go away. In the case of CapyPDF the performance hit turned out not to be an issue since most of the runtime is spent compressing the output with zlib.

Every function call can fail, even those that can't

Every function in the library returns an error code. Even those that have no way of failing, because circumstances can change in the future. Maybe some input that could be anything somehow needs to be validated now and you can't change the function definition as it would break API. Thus every function returns an error code (except the function that converts an error code into an error string). Sadly this means that all "return values" must be handled via out parameters.

ErrorCode some_object_new(SomeObject **out_ptr);

This is not great, but such is life.

Think of C APIs as "in-process RPC"

When designing the API of CapyPDF it was helpful to think of it like a call to a remote endpoint somewhere out there on the Internet. This makes you want to design functions that are as high level as possible and try to ignore all implementation details you can, almost as if the C API was a slightly cumbersome DSL.

Nibble Stew

Sunday, April 21, 2024

C is dead, long live C (APIs)