keskiviikko 28. helmikuuta 2018

On the unoptimalities of language specific build systems

A fairly big recent trend has been the emergence of new programming languages that are meant to be compiled into machine code. The silent (and sometimes not so silent) goal of these languages has been to replace C and C++ as the dominant systems programming language.

All of these languages come with their own build system and dependency management optimised for that particular language. This makes sense as having a good developer experience is important and not having 20-30 years of legacy to carry with you means you can design and develop slick systems relatively easily. But, as always, there is a downside. Perhaps the main issue comes up pretty quickly when trying to combine said code with projects in other languages.

A common approach is for the programming language in question to bundle up all its dependencies as source in a big clump. Then the advocates will say that "it's simple, just call our build system from yours and it gets built". This seems simple but it uses the weasieliest of all weasel words: just. Whenever someone tells you to "just" do something, what they almost always do is trying to trivialise away the hardest part of the entire operation. So it is here as well.

When could it work?

There is one case where this approach works without problems. That is when the dependency builds into a single library with a C interface and it also ships the header and a pkg-config file to use it. This case is indistinguishable from a plain C library so it will work exactly the same. The dependency can be provided as a system package or built as a dependency in a Flatpak manifest or any other similar issue.

Unfortunately this system breaks down the second you want to do anything else. The most common requirement is to build all dependencies from source in a single build step. This is necessary on any platform that does not have a concept of "system" package manager. Many people also want to do this on Linux systems to, for example, build their project's trunk against their dependencies' trunks. This is where things fall down.

The myth of the build dir

Most people probably haven't thought about the build directory of their builds. The most common conception is that the build system just (there's that word again) compiles source code into object files and then targets and that the installation step merely copies the files out to the staging directory. This is not true in the slightest.

Build systems need to do a whole lot of stuff to make things workable directly from the build tree. Every build system does it slightly (and sometimes massively) differently. More importantly the way each build system does it is not stable. They are allowed to, and will, change the way the build tree is laid out at any time. Nothing inside the build tree is stable, not file formats, not directory layouts, nothing.

The problem with building source code with two different build systems in a single build is that eventually they need to work together. Libraries need to be linked. Sources need to be generated. Executables need to be run. That means joining two different completely unstable elements together. The simplest problem in this space is about file layouts. Every build system expects a certain layout for the files it manages. This is usually very different from other build systems. Thus in order to work, there would need to be a way for every build system to be told to adapt to a different system's file layout when run as its subtask.

This is a challenging place to be requesting, because it takes a lot menial work that build systems have traditionally (ever, actually) been unwilling to do. Guessing the subtask's layout and hoping that it does not change might work for any amount of time and then breaks for the slightest of reasons. The problems only get harder from there.

N^2 manual work algorithms are awesome!

Even if this would work (and it does not) the next problem comes from scaling up. You can only "just call" from one build system to another if someone has taken the time to make one understand the other. This is simple for two build systems: you need to write two integrations, one in each direction. But suppose we live in a world where many of the common C libraries in use today have been replaced by implementations in another languages. If you were doing cross platform mobile development then you could have C, C++, Java, D, Rust, Go and Swift in the same project.

Seven languages means seven different build systems and possibly more since C and C++ commonly have more than one dominant build system. This means reading and understanding seven different build system syntaxes and mental models. If you want to combine those freely it means writing 7 x 7 = 49 different build system integrations who must, lest it be forgotten, combine the unstable innards of all of these. And then it gets worse.

Since every language has its own package manager and dependency downloader, you now have up to seven package managers in your project. Actually no, that is a lie.

The tangled web of lies and deceit dependencies

When talking about dependencies between projects in different languages, most people usually mean a dependency graph like this.

That is, there is one dependency of a single language and a second one of a different language that uses it. For this simple case most things are feasible. But let's see what happens when we add just one more project.

Here we have project 1 using language 1. It has a dependency to project 2 in language 2. However project 2 has an internal dependency on project 3 which is also written in language 1. The question now becomes: how should this be built?

Since languages 1 and 2 use their own build tool and language manager, the two edgemost projects don't know that they are being built as part of the same project. Language 2 completely hides its dependency, as it should. The two projects need to work independently. This means that each one of them must determine its dependencies in isolation. If they download their dependencies during configuration time then for each build setup you are accessing the dependency provider twice. Doing more dependency resolutions than you have languages in your project seems suboptimal.

The other approach to this is usually called vendoring. In this each project in a language is only used as a tarball and it embeds all its own dependencies as source code. This seems like a working solution but it's not really. Many modern languages go the NPM route where it is considered good practice to have many small dependencies. It is not uncommon for medium or even small projects to have 50+ dependencies. This leads to problems such as these:

Here project 1 depends on two different projects that are both implemented in language 1. Just like above these two projects don't know of each other because their dependency chain goes via language 2 that hides it. Both of these projects have their internal dependencies embedded so they can be built from scratch without problems.

The problem here is that due to basic popularity and probability theory, the embedded dependencies of these two projects have many of the same dependencies. The dependencies might have the same versions or they might differ. If they both end up in the toplevel executable you get, depending on your toolchain and the phase of the moon, either a working binary or the nastiest of linker bugs to fix.

Even if this yields a working program there is a big downside: compilation time takes up to twice as long because you have to compile the same dependencies twice in different but isolated parts of the build tree. As a rough approximation this means that adding a dependency to a dependency graph like this goes from being a O(1) more work to being O(N) more work because dependency graphs can not be deduplicated if there is a dependency of a different language between the two. It is left as an exercise to the reader to visualize what this would look like on a huge project such as Chromium.

The simple solution

There is a simple solution to this problem and it is very popular among language zealots: reducing the number of languages to one by claiming that in the future everything will be written in their own favourite language. It does not matter what the growth rate of complexity is if it will only be evaluated for the value 1.

The reduction of programming languages to one is expected to happen any minute now, immediately after mr Godot brings us the news on Eastasia's surrender.

The real problem

All of this boils down to the fact that language specific build systems are two opposing things at the same time. They are both a very comfortable gilded cage and an extremely isolating silo. They fertilise and promote cooperation within their own group but make things a lot harder for cooperation between groups.

One of the things we learn from history is that people who have opposed cooperation have, ultimately, lost to those who have promoted it. Maybe we should heed the teachings of history and start working towards better, more encompassing dependency management.

1 kommentti:

  1. Maybe the mythical solution you're looking for doesn't exist. And build systems, like all other software, are always kludges.. Maybe a first step would be to admit that we need to use other build systems too and then see how to integrate them. Probably having a way to build dependencies with their own thing, have them do the install in a subdirectory and then link from there. Having "foreign" subprojects that are just like cerbero/debian/rpm recipes would be a great start.