Wednesday, August 30, 2017

Dependencies, why are they so hard to get right?

The life cycle of any programming project looks roughly like this.


We start with the plain source code. It gets processed by the build system to produce so called "build tree artifacts". These are executables, libraries and the like but they are slightly special. They are stored inside the build tree and can not usually be used directly. Every build system has its own special magic sprinkled in the outputs. The files inside a build tree can not be run directly (usually) and the file system layout can be anything. The build tree is each build system's internal implementation detail, which is usually not documented and definitely not stable. The only thing that can reliably operate on items in the build directory is the build system itself.

The final stage is the "staging directory" which usually is the system tree, as an example /usr/lib in Unix machines but can be e.g. an app bundle dir on OSX or a standalone dir that is used to generate an MSI installer package on Windows. The important step here is installation. Conceptually it scrubs all traces of the build system's internal info and make the outputs conform to the standards of the current operating system.

The different dependency types

Based on this there are three different ways to obtain dependencies.


The first and simplest one is to take the source code of your dependency, put it inside your own project and pretend it is a native part of your project. Examples of this include the SQLite amalgamation file and some header-only C++ libraries. This way of obtaining dependencies is not generally recommended or interesting so we'll ignore it for the remainder of this post.

Next we'll look into the final case. Dependencies that are installed on the system are relatively easy to use as they are guaranteed to exist before any compilation steps are undertaken and they don't change during build steps. The most important thing to note here is that these dependencies must provide their own usage information in a build system independent format that is preferably fully declarative. The most widely accepted solution here is pkg-config but there can be others, as long as it is fully build system independent.

Which leaves us the middle case: build system internal dependencies. There are many implementations of this ranging from Meson subprojects to CMake internal projects and many new languages such as D and Rust which insist on compiling all dependencies by themselves all the time. This is where things get complicated.

Since the internal state of build trees are different, it is easy to see that you can not mix two different build systems within one single build tree. Or, rather, you could but it would require one of them to be in charge and the other one to do all of the following:
  • conform to the file layout of the master project
  • conform to the file format internals of the master project (which, if you remember, are undocumented and unstable)
  • export full information about what it generates, where and how to the master project in a fully documented format
  • accept dependency information for any dependency built by the master project in a standardized format
And there's a bunch more. If you go to any build system developer and tell them to add these features to their system they will first laugh at you and tell you that it will happen absolutely never.

This is totally understandable. Pairing together the output of two wildly different unstable interfaces in a reliable way is not fun or often even possible. But it gets worse.

Lucy in the Sky with Diamond Dependency Graphs

Suppose that your dependency graph looks like this.

The main program uses two libraries libbaz and libbob. Each one of them builds with a different build system each of which has its own package manager functionality. They both depend on a common library libfoo. As an example libbob might be a language wrapper for libfoo whereas libbaz only uses it internally. It is crucially important that the combined project has one, and only one, copy of libfoo and it must be shared by both dependents. Duplicate dependencies lead, at best, into link time errors and at worst to ten hour debugging sessions of madness in production.

The question then becomes: who should build libfoo? If it is provided as a system dependency this is not an issue but for build tree dependencies things break horribly. Each package manager will most likely insist on compiling all their own dependencies (in their own special format) and plain refuse to work anything else. What if we want the main program to build libfoo instead (as it is the one in charge)? This quagmire is the main reason why certain language advocates' view of "just call into our build tool [which does not support any way of injecting external dependency information] from your build tool and things will work" ultimately unworkable.

What have we learned?

  1. Everything is terrible and broken.
  2. Every project must provide a completely build system agnostic way of declaring how it is to be used when it is provided as a system dependency.
  3. Every build system must support reading said dependency information.
  4. Mixing multiple build systems in a single build directory is madness.

No comments:

Post a Comment