Monday, March 22, 2021

Writing a library and then using it as a Meson dependency directly from upstream Git

Meson has many ways of obtaining dependencies. The most common is pkg-config for prebuilt dependencies and the WrapDB for building upstream releases from source. A lesser known way is that you can get dependencies [1] directly from the upstream project's Git repository. Meson will transparently download and build them for you. There does not seem to be that many examples of this on the Internet, so let's see how one would both create and consume dependencies in this way.

The library

Rather than creating a throwaway library, let's instead make one that is actually useful. The C++ standard library does not have a full featured way to split strings, meaning that every project needs to write their own. To simplify the design, we're going to steal shamelessly from Python. That is, when in doubt, try to behave as closely as possible to how Python's string splitter functions work. In addition:

  • support any data storage (i.e. all input parameters should be string views)
  • have helper functionality for mmap, so even huge files can be split without massive overhead
  • return types can be either efficient (string views to the input) or safe (strings with copied data)

Once you start looking into the implementation it very quickly becomes clear why this functionality is not already in the standard library. It is quite tricky and there are many interesting things in Python's implementation that most people have never noticed. For example splitting a string via whitespace does this:

>>> ' hello world '.split()
['hello', 'world']

which is what you'd expect. But note that the only whitespace characters are spaces. So what happens if we optimise the code and explicitly split only by space?

>>> ' hello world '.split(' ')
['', 'hello', 'world', '']

That's ... unexpected. It turns out that if you split by whitespace, Python silently removes empty substrings. You can't make it not do that if you specify your own split criterion. This seems like a thing a general solution should provide.

Another common idiom in Python is to iterate over lines in a file with this:

for line in open('infile.txt'):
    ...

This seems like a thing that could be implemented by splitting the file contents with newline characters. That works for files whose path separator is \n but fails with DOS line endings of \ŗ\n. Usually in string splitting the order of the input characters does not matter, but in this case it does. \r\n is a single logical newline, whereas \n\r is two [2]. Further, in Python the returned strings contain the line ending characters converted to \n, but in this is not something we can do. Opening a DOS file should return a string view to the original immutable data but the \r character should be a \n instead. This could only be done by returning a modified copy rather than a view to the original data. This necessitates a behavioural difference to Python so that the linefeed characters are omitted.

This is the kind of problem that would be naturally implemented with coroutines. Unfortunately those are c++20 only, so very few people could use it and there is not that much info online on how to write your own generators. So vectors of string_views it is for now.

The implementation

The code for the library is available in this Github repo. For the purposes of this blog post, the interesting line is this one specifying the dependency information:

psplit_dep = declare_dependency(include_directories: '.')

This is the standard way subprojects set themselves up to be used. As this is a header only library, the dependency only has an include directory.

Using it

A separate project that uses the dependency to implement the world's most bare bones CSV parser can be obtained here. The actual magic happens in the file subprojects/psplit.wrap, which looks like this:

[wrap-git]
directory = psplit
url = https://github.com/jpakkane/psplit.git
revision = head

[provide]
psplit = psplit_dep

The first section describes where the dependency can be downloaded and where it should be placed. The second section specifies that this repository provides one dependency named psplit and that its dependency information can be found in the subproject in a variable named psplit_dep.

Using it is simple:

psplit_dep = dependency('psplit')
executable('csvsplit', 'csvsplit.cpp',
    dependencies: psplit_dep)

When the main project requests the psplit dependency, Meson will try to find it, notices that a subproject does provide it and will then download, configure and build the dependency automatically.

Language support

Even though we used C++ here, this works for any language supported by Meson. It even works for mixed language projects, so you can for example have a library in plain C and create bindings to it in a different language.

[1] As long as they build with Meson.

[2] Unless you are using a BBC Micro, though I suspect you won't have a C++17 compiler at your disposal in that case.

No comments:

Post a Comment