Wednesday, March 31, 2021

Never use environment variables for configuration

Suppose you need to create a function for adding two numbers together in plain C. How would you write it? What sort of an API would it have? One possible implementation would be this:

int add_numbers(int one, int two) {
    return one + two;
}

// to call it you'd do
int three = add_numbers(1, 2);

Seems reasonable? But what if it was implemented like this instead:

int first_argument;
int second_argument;

void add_numbers(void) {
    return first_argument + second_argument;
}

// to call it you'd do
first_argument = 1;
second_argument = 2;
int three = add_numbers();

This is, I trust you all agree, terrible. This approach is plain wrong, against all accepted coding practices and would get immediately rejected in any code review. It is left as an exercise to the reader to come up with ways in which this architecture is broken. You don't even need to look into thread safety to find correctness bugs.

And yet we have environment variables

Environment variables is exactly this: mutable global state. Envvars have some legitimate usages (such as enabling debug logging) but they should never, ever be used for configuring core functionality of programs. Sadly they are used for this purpose a lot and there are some people who think that this is a good thing. This causes no end of headaches due to weird corner, edge and even common cases.

Persistance of state

For example suppose you run a command line program that has some sort of a persistent state.

$ SOME_ENVVAR=... some_command <args>

Then some time after that you run it again:

$ some_command <args>

The environment is now different. What should the program do? Use the old configuration that had the env var set or the new one where it is not set? Error out? Try to silently merge the different options into one? Something else?

The answer is that you, the end user, can not now. Every program is free to do its own thing and most do. If you have ever spent ages wondering why the exact same commands work when run from one terminal but not the other, this is probably why.

Lack of higher order primitives

An environment variable can only contain a single null-terminated stream of bytes. This is very limiting. At the very least you'd want to have arrays, but it is not supported. Surely that is not a problem, you say, you can always do in-band signaling. For example the PATH environment variable has many directories which are separated by the : character. What could be simpler? Many things, it turns out.

First of all the separator for paths is not always :. On Windows it is ;. More generally every program is free to choose its own. A common choice is space:

CFLAGS='-Dfoo="bar" -Dbaz' <command>

Except what if you need to pass a space character as part of the argument? Depending on the actual program, shell and the phase of the moon, you might need to do this:

ARG='-Dfoo="bar bar" -Dbaz'

or this:

ARG='-Dfoo="bar\ bar" -Dbaz'

or even this:

ARG='-Dfoo="bar\\ bar" -Dbaz'

There is no way to know which one of these is the correct form. You have to try them all and see which one works. Sometimes, as an implementation detail, the string gets expanded multiple times so you get to quote quote characters. Insert your favourite picture of Xzibit here.

For comparison using JSON configuration files this entire class os problems would not exist. Every application would read the data in the same way, because JSON provides primitives to express these higher level constructs. In contrast every time an environment variable needs to carry more information than a single untyped string, the programmer gets to create a new ad hoc data marshaling scheme and if there's one thing that guarantees usability it's reinventing the square wheel.

There is a second, more insidious part to this. If a decision is made to configure something via an environment variable then the entire design goal changes. Instead of coming up with a syntax that is as good as possible for the given problem, instead the goal is to produce syntax that is easy to use when typing commands on the terminal. This reduces work in the immediate short term but increases it in the medium to long term.

Why are environment variables still used?

It's the same old trifecta of why things are bad and broken:

  1. Envvars are easy to add
  2. There are existing processes that only work via envvars
  3. "This is the way we have always done it so it must be correct!"
The first explains why even new programs add configuration options via envvars (no need to add code to the command line parser, so that's a net win right?).

The second makes it seem like envvars are a normal and reasonable thing as they are so widespread.

The third makes it all but impossible to improve things on a larger scale. Now, granted, fixing these issues would be a lot of work and the transition would unearth a lot of bugs but the end result would be more readable and reliable.

Monday, March 22, 2021

Writing a library and then using it as a Meson dependency directly from upstream Git

Meson has many ways of obtaining dependencies. The most common is pkg-config for prebuilt dependencies and the WrapDB for building upstream releases from source. A lesser known way is that you can get dependencies [1] directly from the upstream project's Git repository. Meson will transparently download and build them for you. There does not seem to be that many examples of this on the Internet, so let's see how one would both create and consume dependencies in this way.

The library

Rather than creating a throwaway library, let's instead make one that is actually useful. The C++ standard library does not have a full featured way to split strings, meaning that every project needs to write their own. To simplify the design, we're going to steal shamelessly from Python. That is, when in doubt, try to behave as closely as possible to how Python's string splitter functions work. In addition:

  • support any data storage (i.e. all input parameters should be string views)
  • have helper functionality for mmap, so even huge files can be split without massive overhead
  • return types can be either efficient (string views to the input) or safe (strings with copied data)

Once you start looking into the implementation it very quickly becomes clear why this functionality is not already in the standard library. It is quite tricky and there are many interesting things in Python's implementation that most people have never noticed. For example splitting a string via whitespace does this:

>>> ' hello world '.split()
['hello', 'world']

which is what you'd expect. But note that the only whitespace characters are spaces. So what happens if we optimise the code and explicitly split only by space?

>>> ' hello world '.split(' ')
['', 'hello', 'world', '']

That's ... unexpected. It turns out that if you split by whitespace, Python silently removes empty substrings. You can't make it not do that if you specify your own split criterion. This seems like a thing a general solution should provide.

Another common idiom in Python is to iterate over lines in a file with this:

for line in open('infile.txt'):
    ...

This seems like a thing that could be implemented by splitting the file contents with newline characters. That works for files whose path separator is \n but fails with DOS line endings of \ŗ\n. Usually in string splitting the order of the input characters does not matter, but in this case it does. \r\n is a single logical newline, whereas \n\r is two [2]. Further, in Python the returned strings contain the line ending characters converted to \n, but in this is not something we can do. Opening a DOS file should return a string view to the original immutable data but the \r character should be a \n instead. This could only be done by returning a modified copy rather than a view to the original data. This necessitates a behavioural difference to Python so that the linefeed characters are omitted.

This is the kind of problem that would be naturally implemented with coroutines. Unfortunately those are c++20 only, so very few people could use it and there is not that much info online on how to write your own generators. So vectors of string_views it is for now.

The implementation

The code for the library is available in this Github repo. For the purposes of this blog post, the interesting line is this one specifying the dependency information:

psplit_dep = declare_dependency(include_directories: '.')

This is the standard way subprojects set themselves up to be used. As this is a header only library, the dependency only has an include directory.

Using it

A separate project that uses the dependency to implement the world's most bare bones CSV parser can be obtained here. The actual magic happens in the file subprojects/psplit.wrap, which looks like this:

[wrap-git]
directory = psplit
url = https://github.com/jpakkane/psplit.git
revision = head

[provide]
psplit = psplit_dep

The first section describes where the dependency can be downloaded and where it should be placed. The second section specifies that this repository provides one dependency named psplit and that its dependency information can be found in the subproject in a variable named psplit_dep.

Using it is simple:

psplit_dep = dependency('psplit')
executable('csvsplit', 'csvsplit.cpp',
    dependencies: psplit_dep)

When the main project requests the psplit dependency, Meson will try to find it, notices that a subproject does provide it and will then download, configure and build the dependency automatically.

Language support

Even though we used C++ here, this works for any language supported by Meson. It even works for mixed language projects, so you can for example have a library in plain C and create bindings to it in a different language.

[1] As long as they build with Meson.

[2] Unless you are using a BBC Micro, though I suspect you won't have a C++17 compiler at your disposal in that case.

Friday, March 19, 2021

Microsoft is shipping a product built with Meson

Some time ago Microsoft announced a compatibility pack to get OpenGL and OpenCL running even on computers whose hardware does not provide native OpenGL drivers. It is basically OpenGL-over-Direct3D. Or that is at least my understanding of it, hopefully this description is sufficiently accurate to not cause audible groans on the devs who actually know what it is doing under the covers. More actual details can be found in this blog post.

An OpenGL implementation is a whole lot of work and writing one from scratch is a multi-year project. Instead of doing that, Microsoft chose the sensible approach of taking the Mesa implementation and porting it to work on Windows. Typically large corporations do this by the vendoring approach, that is, copying the source code inside their own repos, rewriting the build system and treating it as if it was their own code.

The blog post does not say it, but in this case that approach was not taken. Instead all work was done in upstream Mesa and the end products are built with the same Meson build files [1]. This also goes for the final release that is available in Windows Store. This is a fairly big milestone for the Meson project as it is now provably mature enough that major players like Microsoft are willing to use it to build and ship end user products. 

[1] There may, of course, be some internal patches we don't know about.

Wednesday, March 10, 2021

Mixing Rust into an existing C shared library using Meson

Many people are interested in adding Rust to their existing projects for additional safety. For example it would be convenient to use Rust for individual high-risk things like string parsing while leaving the other bits as they are. For shared libraries you'd need to be able to do this while preserving the external plain C API and ABI. Most Rust compilation is done with Cargo, but it is not particularly suited to this task due to two things.

  1. Integrating Cargo into an existing project's build system is painful, because Cargo wants to dominate the entire build process. It does not cooperate with these kind of build setups particularly well.
  2. Using any Cargo dependency brings in tens or hundreds of dependency crates including five different command line parsers, test frameworks and other deps that you don't care about and don't need but which take forever to compile.
It should be noted that the latter is not strictly Cargo's fault. It is possible to use it standalone without external deps. However what seems to happen in practice is all Cargo projects experience a dependency explosion sooner or later. Thus it would seem like there should be a less invasive way to merge Rust into an existing code base. Fortunately with Meson, there is.

The sample project

To see how this can be done, we created a simple standalone C project for adding numbers. The full source code can be found in this repository. The library consists of three functions:

adder* adder_create(int number);
int adder_add(adder *a, int number);
void adder_destroy(adder*);

To add the numbers 2 and 4 together, you'd do this:

adder *two_adder = adder_create(2);
int six = adder_add(two_adder, 4);
adder_destroy(two_adder);

As adding numbers is highly dangerous, we want to implement the adder_add function in Rust and leave the other functions untouched. The implementation in all its simplicity is the following:

#[repr(C)]
pub struct Adder {
  pub number: i32
}

#[no_mangle]
pub extern fn adder_add(a: &Adder, number: i32) -> i32 {
    return a.number + number;
}

The build setup

Meson has native support for building Rust. It does not require Cargo or any other tool, it invokes rustc directly. In this particular case we need to build the Rust code as a staticlib.

rl = static_library('radder', 'adder.rs',
                    rust_crate_type: 'staticlib')

In theory all you'd need to do, then, is to link this library into the main shared library, remove the adder_add implementation from the C side and you'd be done. Unfortunately it's not that simple. Because nothing in the existing code calls this function, the linker will look at it, see that it is unused and throw it away.

The common approach in these cases is to use link_whole instead of plain linking. This does not work, because rustc adds its own metadata files inside the static library. The system linker does not know how to handle those and will exit with an error. Fortunately there is a way to make this work. You can specify additional undefined symbol names to the linker. This makes it behave as if something in the existing code had called adder_add, and grabs the implementation from the static library. This can be done with an additional kwarg to the shared_library call.

link_args: '-Wl,-u,adder_add'

With this the goal has been reached: one function implementation is done with Rust while preserving both the API and the ABI and the test suite passes as well. The resulting shared library file is about 1 kilobyte bigger than the plain C one (though if you build without optimizations enabled, it is a whopping 14 megabytes bigger).