Thursday, March 29, 2018

And now for something completely different

If anyone has been wondering why merge requests have not been reviewed fast enough recently, here is one of the reasons.

Thursday, March 8, 2018

Stop trying to guess display language based on keyboard layout

A very common setup among non-English speaking computer power users is to have display language in English but have a country specific keyboard layout. For example I'm using the Finnish keyboard because I need to be able to easily type our special letters ö and ä. If your computer comes from someone else (such as your employer) there is not even the possibility to have a non-Finnish keyboard layout.

All of this has worked for tens of years flawlessly. However recently I have noticed that many programs on multiple platforms seem to alter their display language based on keyboard layout, which is just plain wrong. Display language should be chosen based on the display language choice and nothing else.

I first noticed this in the output of ls, which one would imagine to have reached stability ages ago.


Here we see that ls has chosen to print months in Finnish. Why? I have no idea. This was weird on its own, but then it spread to other operating systems as well. For no reason at all the existing Gimp install switched its display language to Finnish.


Let me reiterate: no setting was changed and the version of Gimp was exactly the same. One day it just decided to change its language to Finnish.

Then the issue spread to Windows.


VLC on Windows has chosen on my behalf to show its menus in Finnish in a completely English Windows 7 install. The only things it could use for language detection are geolocation and keyboard settings and both of these are terrible ideas. The OS has a language. It is very clearly specified. All other applications obey it, VLC should too.

The real kicker here is that Gimp on Windows displays English text correctly, as does VLC on macOS.

The newest case is the new Gnome-ified Ubuntu, whose lock screen stubbornly displays dates in the wrong language. It also does not conjugate the words correctly and has that weird american month/date date ordering which is wrong for Finnish.


What is causing this?

I don't know. But whoever is behind this: please stop doing that.

Sunday, March 4, 2018

Compiling Cargo crates natively with Meson

Recently we have been having discussions about how Rust and Meson should work together, especially for mixed language projects. One thing which multiple people have told me (over a time span of several years, actually) is that Rust is Special in that everyone uses crates for everything. Thus there is no point in having any sort of Rust support, the only true way is to blindly call Cargo and have it do everything exactly the way it wants to.

This seems like a reasonable recommendation so I did what every reasonable person would do and accepted this as is.

David Addison wearing cardboard x-ray goggles with caption "This is me being completely reasonable".

But then curiosity takes hold of you and you start to wonder. Is that really the case?

Converting Cargo manifests to Meson projects

The basic setup of a Cargo project is fairly straightforward. The file format is mostly declarative and most crates are simple libraries that have a few source files and a bunch of other crates they link against. Meson has the same primitives, so could they be automatically converted.

It turns out that for simple examples you can. Here is a sample repo that downloads the Itoa crate from github, converts the Cargo build definition into a Meson project and builds it as a subproject (with Meson, not Cargo) of the main project that uses it. This prototype turned out to require 71 lines of Python.

What about dependencies other than itoa?

The script currently only works for itoa, because crates.io does not seem to provide a web API for queries and the entire site is created with JavaScript so you can't even do web scraping easily. To get this working properly the only thing you'd need is a function to get from (crate name, version) to the git repo.

What about dependencies of dependencies?

They can be easily grabbed from Cargo.toml file. Combined with the above they could be downloaded and converted in the same fashion.

What doesn't work?

A lot. Unit tests are not built nor run, but they could be added fairly easily. This would require adding compile options so the actual source could be built with the unittest flags. This is some amount of work but Meson already supports a similar feature for D so adding it should not be a huge amount of work. Similarly docs are not generated.

What about build.rs?

Cargo provides a fairly simple project model and everything more complex should be handled by writing a build.rs program that does everything else necessary. This suffers from the same disadvantages as every Turing complete build system ever has, and these scripts are not in general possible to convert automatically.

However based on documentation the common case seems to be to call into external build tools to build dependency libraries in other languages. In a build system that builds both parts at the same time it would be possible to create a better UX for this (but again would obviously not be something you can convert automatically).

Could this actually work in practice with real world projects?

It might. It might not. Ignoring the previous segment no immediate showstopper has presented itself thus far. It might in the future. But you never know until you try.

Thursday, March 1, 2018

On the unoptimalities of language specific build systems

A fairly big recent trend has been the emergence of new programming languages that are meant to be compiled into machine code. The silent (and sometimes not so silent) goal of these languages has been to replace C and C++ as the dominant systems programming language.

All of these languages come with their own build system and dependency management optimised for that particular language. This makes sense as having a good developer experience is important and not having 20-30 years of legacy to carry with you means you can design and develop slick systems relatively easily. But, as always, there is a downside. Perhaps the main issue comes up pretty quickly when trying to combine said code with projects in other languages.

A common approach is for the programming language in question to bundle up all its dependencies as source in a big clump. Then the advocates will say that "it's simple, just call our build system from yours and it gets built". This seems simple but it uses the weasieliest of all weasel words: just. Whenever someone tells you to "just" do something, what they almost always do is trying to trivialise away the hardest part of the entire operation. So it is here as well.

When could it work?

There is one case where this approach works without problems. That is when the dependency builds into a single library with a C interface and it also ships the header and a pkg-config file to use it. This case is indistinguishable from a plain C library so it will work exactly the same. The dependency can be provided as a system package or built as a dependency in a Flatpak manifest or any other similar issue.

Unfortunately this system breaks down the second you want to do anything else. The most common requirement is to build all dependencies from source in a single build step. This is necessary on any platform that does not have a concept of "system" package manager. Many people also want to do this on Linux systems to, for example, build their project's trunk against their dependencies' trunks. This is where things fall down.

The myth of the build dir

Most people probably haven't thought about the build directory of their builds. The most common conception is that the build system just (there's that word again) compiles source code into object files and then targets and that the installation step merely copies the files out to the staging directory. This is not true in the slightest.

Build systems need to do a whole lot of stuff to make things workable directly from the build tree. Every build system does it slightly (and sometimes massively) differently. More importantly the way each build system does it is not stable. They are allowed to, and will, change the way the build tree is laid out at any time. Nothing inside the build tree is stable, not file formats, not directory layouts, nothing.

The problem with building source code with two different build systems in a single build is that eventually they need to work together. Libraries need to be linked. Sources need to be generated. Executables need to be run. That means joining two different completely unstable elements together. The simplest problem in this space is about file layouts. Every build system expects a certain layout for the files it manages. This is usually very different from other build systems. Thus in order to work, there would need to be a way for every build system to be told to adapt to a different system's file layout when run as its subtask.

This is a challenging place to be requesting, because it takes a lot menial work that build systems have traditionally (ever, actually) been unwilling to do. Guessing the subtask's layout and hoping that it does not change might work for any amount of time and then breaks for the slightest of reasons. The problems only get harder from there.

N^2 manual work algorithms are awesome!

Even if this would work (and it does not) the next problem comes from scaling up. You can only "just call" from one build system to another if someone has taken the time to make one understand the other. This is simple for two build systems: you need to write two integrations, one in each direction. But suppose we live in a world where many of the common C libraries in use today have been replaced by implementations in another languages. If you were doing cross platform mobile development then you could have C, C++, Java, D, Rust, Go and Swift in the same project.

Seven languages means seven different build systems and possibly more since C and C++ commonly have more than one dominant build system. This means reading and understanding seven different build system syntaxes and mental models. If you want to combine those freely it means writing 7 x 7 = 49 different build system integrations who must, lest it be forgotten, combine the unstable innards of all of these. And then it gets worse.

Since every language has its own package manager and dependency downloader, you now have up to seven package managers in your project. Actually no, that is a lie.

The tangled web of lies and deceit dependencies

When talking about dependencies between projects in different languages, most people usually mean a dependency graph like this.

That is, there is one dependency of a single language and a second one of a different language that uses it. For this simple case most things are feasible. But let's see what happens when we add just one more project.

Here we have project 1 using language 1. It has a dependency to project 2 in language 2. However project 2 has an internal dependency on project 3 which is also written in language 1. The question now becomes: how should this be built?

Since languages 1 and 2 use their own build tool and language manager, the two edgemost projects don't know that they are being built as part of the same project. Language 2 completely hides its dependency, as it should. The two projects need to work independently. This means that each one of them must determine its dependencies in isolation. If they download their dependencies during configuration time then for each build setup you are accessing the dependency provider twice. Doing more dependency resolutions than you have languages in your project seems suboptimal.

The other approach to this is usually called vendoring. In this each project in a language is only used as a tarball and it embeds all its own dependencies as source code. This seems like a working solution but it's not really. Many modern languages go the NPM route where it is considered good practice to have many small dependencies. It is not uncommon for medium or even small projects to have 50+ dependencies. This leads to problems such as these:


Here project 1 depends on two different projects that are both implemented in language 1. Just like above these two projects don't know of each other because their dependency chain goes via language 2 that hides it. Both of these projects have their internal dependencies embedded so they can be built from scratch without problems.

The problem here is that due to basic popularity and probability theory, the embedded dependencies of these two projects have many of the same dependencies. The dependencies might have the same versions or they might differ. If they both end up in the toplevel executable you get, depending on your toolchain and the phase of the moon, either a working binary or the nastiest of linker bugs to fix.

Even if this yields a working program there is a big downside: compilation time takes up to twice as long because you have to compile the same dependencies twice in different but isolated parts of the build tree. As a rough approximation this means that adding a dependency to a dependency graph like this goes from being a O(1) more work to being O(N) more work because dependency graphs can not be deduplicated if there is a dependency of a different language between the two. It is left as an exercise to the reader to visualize what this would look like on a huge project such as Chromium.

The simple solution

There is a simple solution to this problem and it is very popular among language zealots: reducing the number of languages to one by claiming that in the future everything will be written in their own favourite language. It does not matter what the growth rate of complexity is if it will only be evaluated for the value 1.

The reduction of programming languages to one is expected to happen any minute now, immediately after mr Godot brings us the news on Eastasia's surrender.

The real problem

All of this boils down to the fact that language specific build systems are two opposing things at the same time. They are both a very comfortable gilded cage and an extremely isolating silo. They fertilise and promote cooperation within their own group but make things a lot harder for cooperation between groups.

One of the things we learn from history is that people who have opposed cooperation have, ultimately, lost to those who have promoted it. Maybe we should heed the teachings of history and start working towards better, more encompassing dependency management.