Tuesday, December 31, 2019

How about not stabbing ourselves in the leg with a rusty fork?

Corporations are funny things. Many things no reasonable person would do on their own are done every day in thousands of business conglomerates around the world. With pride even. Let us consider as an arbitrary example a corporation where every day is started by employees stabbing themselves in the leg with a rusty fork. This is (I hope) not actually done for real, but there could be a company out there where this is the daily routine.

If you think that such a thing could possibly never happen, congratulations on having never worked in a big corporation. Stick with that if you can!

When faced with this kind of pointless and harmful routine, one might suggest not doing it any more or replacing it with some other, more useful procedure. This does not succeed, of course, but that is not the point. The reasons you get back are the interesting thing, because they will tell you what kind of manager and coworkers you are dealing with. Here are some possible options, can you think of more?

The survivor fallacist

This is a multi-billion dollar company. If stabbing oneself in the leg was bad, as you seem to claim, we could not have succeeded.

The minimum energy spender

It would take too much work to get this changed. Just bite the bullet and do it every morning. You're better off this way.

The blame shifter

This is mandated by our head office, we can't do anything about this even if we wanted to.

The metric optimizer

Our next year's bonus metric will measure the number of leg stabbings reduced that year. We must get as many of them in this year as we possibly can.

The traditionalist

We have always done this. We must always do it.

The cornered animal

How dare you! Do you have any idea how much work it is to get pre-rusted forks? They are all made of stainless steel nowadays. Your derogatory insinuations are a slap on the face of all people working to keep this system running!

The folklorist

This is a commonly accepted best practice in software companies, thus we should do it also.

The brainwashee

This is actually a great invention. Getting a nice jolt of adrenaline first thing in the morning really wakes you up and gives you focus for the entire day. Try it for a month or two! You'll see.

The control freak messiah

This procedure was put in place by the founder/CEO. You do not challenge his choices if you know what is good for you.

The team spiritist

If you don't stab yourself in the leg, you are setting up a very bad example that demoralizes everybody else who do their part diligently.

And finally the (sadly) most common one

Our product is special.

Saturday, December 28, 2019

What can clang-format teach us about the human condition?

Most people who do programming have taken part in at least one code formatting war. Usually these come about when companies want to standardise their code bases and thus want everything formatted according to a single style. Style wars, much like real wars, are not pleasant places to be in. They cause havoc and destruction, make reasonable people into life-long sworn enemies and halt work on anything useful.

In a typical style argument statements like the following are often thrown about:

  • Indentation should be done with tabs, because everyone can set tab width to whatever they want in their editor.
  • The opening brace must be on the same line as its preceding clause. This saves vertical space and thus makes the code more readable.
  • The opening brace must be on its own line. This makes code blocks stand out better and thus makes the code more readable.
  • When laying things like arguments vertically, the separating comma must be at the beginning of the line rather than the end. In this way when you add or remove an entry, the diff is always only one line.
  • In a declaration like int *bob, the asterisk must be next to the variable name, because that is what it binds to.
  • In a declaration like int* bob, the asterisk must be next to the type name, because "pointerness" is logically a feature of the type, not the variable.
  • Class variables must begin with m_ so they stand out better.
  • Class variables must not be separated with a prefix. The syntax highlighter will already draw them in a different color and if you have so many variables in your methods that you can't immediately tell which is which, your methods are too big and must be split.
  • Et cetera, et cetera, ad infinitum, ad nauseaum.
It is unknown when code formatting wars first began. Given that FORTRAN was the first real programming language with an actual syntax and was first released in 1957, the answer probably is "way, way before that". A reasonable guess would be the first or second design meeting on the syntax. Fighting over code style kept raging for almost sixty years after that. The arguments were the same, the discussion was the same, no progress was ever made. Then clang-format was introduced and suddenly everything changed.

This was surprising, because automatic code formatters had existed for decades and clang-format was "the same, just slightly better". Yet it made this problem mostly go away. Why?

Enter the human element

With all existing formatters it was fairly easy to find code where it failed. C macros were especially treacherous in this regard. This meant that either one needed to manually add (and update) comments that disabled formatting for some blocks or the formatter was run only every now and then by hand and the result had to be inspected and fixed by hand after the fact. With clang-format this manual work went effectively to zero. You could just run it at any time, even automatically before every commit. In a weird kind of backwards way once we had the correct solution, we could finally understand what the real problem was.

Every programmer writes code in their own way. Maybe they put braces on the same line, maybe not. Maybe they indent with spaces, maybe not. The details don't matter, the real point is that the writing code in this style is effortless. It just flows from your brain to the screen. Coding in any other style means spending brain energy on either typing in some non-natural style or fixing the code afterwards. This is manual and tedious work, just the kind that programmers hate with a passion. Thus when the threat of an externally mandated code style appears, the following internal monologue takes place:
If they choose a style different than mine, then I will forever have to write in a style that is unnatural to me. This is tedious. However if I spend some energy now and convince everyone else to use my style, then I can keep on doing what I have been doing thus far. All I have to do is to factually explain why my chosen style is the best, and since other programmers are rational whey will understand my point, agree and adopt my chosen style.
Unfortunately everyone else participating in the debate has the exact same idea and things end in a stalemate almost immediately. The sunk-cost fallacy ensures that once a person has publicly committed to a style choice, they will never budge from it.

Note the massive dichotomy at play here. The real reason people have for any style choice is "this is what I have gotten used to" but when they debate their chosen style they always, always use reasoning that aims to be objective and scientific. At this point you might want to pause and reread the sample arguments listed above. They follow this reasoning exactly and most claim to improve some real world objective metric such as readability. They are also all lies. These are all post-rationalising arguments, invented after the fact to make the opinion you already have sound as good as possible. They are not the real reason. They have never been the real reason. It is unlikely they will ever be the real reason. But the debate is carried on as if they were the real reason. This is why it will never end.

The lengths people are willing to go to in their post-rationalising arguments is nothing short of astounding. In this video on indenting with tabs vs spaces many tab advocates say that indenting with tabs is better because "you only need to press tab once rather than press space multiple times". Every single programmer's text editor since 1985 (possibly 1975 and potentially even 1965) has had the feature where pressing the tab key does the logically equivalent indent with spaces. Using this as an argument only shows that you have not done even the most minimal of thinking on the issue, but instead just have already made up your mind and don't want to even consider changing it.

This is why code style discussions never go anywhere. They are not about bringing people together to find the best possible choice. They are about trying to make other people submit to your will by repeatedly bashing them on the head with your style guide. This does not work because the average programmer's head is both thicker and more durable than the average style guide.

Tuesday, December 17, 2019

Notes on tax issues on selling digital goods internationally

Note: This blog post should not be considered tax, legal or any other sort of advice. There are no guarantees of any kind, even that any of the information below is correct. Consult a qualified professional before embarking on any international business ventures.

Some tax requirements for selling pretty much anything internationally (that I found out by googling and looking up random government web sites)

Nowadays it is easy to start a web store and sell products such as digital downloads to any country in the world. You might think that simply paying appropriate taxes in your own country would be enough. It's not. There are cases where you need to pay taxes or fees to other countries as well, specifically the ones you sell your products to. Surprisingly this can be the case even for very small amounts of money.

There seem to be three main cases: USA, the EU and individual countries. Let's go through them in increasing order of difficulty.

Individual countries

Most countries have a requirement that if you sell digital goods to them you need to register in said country, collect the appropriate amount of tax on your sales and then report and pay it. Most countries have a lower limit under which you don't need to do anything. This is usually on the order of 10 000 to 100 000 euros per year, which small scale operations won't ever reach. Unfortunately in some countries this limit is zero. That is, if your sales are even one euro, you need to register and do the full bureaucratic dance. These countries include Albania, Russia, South Korea and India among others. Lists of limits per country can be found online. Be careful when reading them, though, as web pages can get out of date quickly.

For small businesses the only realistic choice is to geoblock countries where the tax limit is zero. Dealing with the hassles is just not worth it. This is fairly easy, as most payment providers have good geoblocking tools.

VAT in the European Union

In the EU you can do the same registration to each country as for individual countries discussed above. However there is also a new, simplified system for digital services called VAT MOSS. The idea there is that you don't need to register to each country, instead you can report VAT purchases to your own tax authorities and they take care of the rest. This is highly convenient, because you can then sell to every EU member state but only have to deal with the bureaucracy of one of them.

There is a similar thing for non-EU companies, but I have not looked at how it works in detail for obvious reasons. Just note that the registration limit for EU is also zero, meaning if you must register if you sell anything at all to the EU. Sadly this means that for some people geoblocking all of EU is an entire reasonable thing to do.

Sales tax in the USA

The good news is that the USA does not have a federal sales tax. The bad news is that each state has its own laws on sales taxes. Whether or not you need to pay sales taxes on a given state depends on whether you have a "nexus" in the state. This used to mean something like an office. However then buying stuff over the Internet happened and now having a nexus simply means selling more than a given threshold's worth of goods or services to people in the state. Lists of these limits per state can be found online as well.

This is where things get unpleasant for small players. The limit for Kansas is zero, meaning any sales to Kansas means you have to register, gather sales tax and pay it to the state authorities. Other states have reasonable limits such as 100 000 dollars, but the obligation is also triggered if you have more than 200 sales events in total regardless of their value. This is a lot easier to trigger by accident.

Unfortunately payment processors don't seem to provide state-based geoblocking. Thus if you enable sales to the USA and that leads to even one sale in Kansas, you just got hit by a bunch of legal requirements. Dealing with all of these is not really feasible for small operations. On the other hand blocking all of USA means losing a fairly large chunk of your revenue.

To keep things from being too simple, there are web pages that claim that having an "economic nexus" is actually different for digital products. Based on that page Kansas does not have a sales tax at all for purely digital products, so you could sell arbitrary amount of products there without needing to pay any sales tax. Which one of these is correct? I don't actually know. I have spent the entire day reading up on international tax laws and now my head hurts and I just want to close the computer and have a drink.

What does this mean for tipping services like Patreon, crowdfunding et al?

Again: I don't really know. However a case can reasonably be made (at least by a tax collector that wants to get your money) that paying through one of those platforms constitutes a "sale of services" or something similar and thus subject to a sales tax. For example Patreon's documentation page states that they take care of paying VAT for EU customers but that customers in the USA need to take care of their tax responsibilities themselves. Presumably this also applies to all other countries.

Thus it may be the case that everyone who is running a tipping service and has taken any money from countries or states with a zero limit on sales taxes have unexpectedly been burdened with legal responsibilities to tens of different tax offices around the world.

In any case all of this means that things like gathering funds for open source development is a lot more complicated than appears at first glance.

Sunday, December 8, 2019

There are (at least) three distinct dependency types

Using dependencies is one of the main problems in software development today. It has become even more complicated with the recent emergence of new programming languages and the need to combine them with existing programs. Most discussion about it has been informal and high level, so let's see if we can make it more disciplined and how different dependency approaches work.

What do we mean when we say "work"?

In this post we are going to use the word "work" in a very specific way. A dependency application is said to work if and only if we can take two separate code projects where one uses the other and use them together without needing to write special case code. That is, we should be able to snap the two projects together like Lego. If this can be done to arbitrary projects with a success rate of more than 95%, then the approach can be said to work.

It should be especially noted that "I tried this with two trivial helloworld projects and it worked for me" does not fulfill the requirements of working. Sadly this line of reasoning is used all too often in online dependency discussions, but it is not a response that holds any weight. Any approach that has not been tested with at least tens (preferably hundreds) of packages does not have enough real world usage experience to be taken seriously.

The phases of a project

Every project has three distinctive phases on its way from source code to a final executable.
  1. Original source in the source directory
  2. Compiled build artifacts in the build directory
  3. Installed build artifacts on the system
In the typical build workflow step 1 happens after you have done a Git checkout or equivalent. Step 2 happens after you have successfully built the code with ninja all. Step 3 happens after a ninja install.

The dependency classes

Each of these phases has a corresponding way to use dependencies. The first and last ones are simple so let's examine those first.

The first one is the simplest. In a source-only world you just copy the dependency's source inside your own project, rewrite the build definition files and use it as if it was an integral part of your own code base. The monorepos used by Google, Facebook et al are done in this fashion. The main downsides are that importing and updating dependencies is a lot of work.

The third approach is the traditional Linux distro approach. Each project is built in isolation and installed either on the system or in a custom prefix. The dependencies provide a pkg-config file explaining which defines both the dependency and how it should be used. This approach is easy to use and scales really well, the main downside being that you usually end up with multiple versions of some dependency libraries on the same file system, which means that they will eventually get mixed up and crash in spectacular but confusing ways.

The second approach

A common thing people want to do is to mix two different languages and build systems in the same build directory. That is, to build multiple different programming languages with their own build systems intermixed so that one uses the built artifacts of the other directly from the build dir.

This turns out to be much, much, much more difficult than the other two. But why is that?

Approach #3 works because each project is clearly separated and the installed formats are simple, unambiguous and well established. This is not the case for build directories. Most people don't know this, but binaries in build directories are not the same as the installed ones Every build system conjures its own special magic and does things slightly differently. The unwritten contract has been that the build directory is each build system's internal implementation detail. They can do with it whatever they want, just as long as after install they provide the output in the standard form.

Mixing the contents of two build systems' build directories is not something that "just happens". Making one "just call" the other does not work simply because of the N^2 problem. For example, currently you'd probably want to support C and C++ with Autotools, CMake and Meson, D with Dub, Rust with Cargo, Swift with SwiftPM, Java with Maven (?) and C# with MSBuild. That is already up to 8*7 = 56 integrations to write and maintain.

The traditional way out is to define a data interchange protocol to declare build-dir dependencies. This has to be at least as rich in semantics as pkg-config, because that is what it is: a pkg-config for build dirs. In addition to that you need to formalise all the other things about setup and layout that pkg-config gets for free by convention and in addition you need to make every build system adhere to that. This seems like a tall order and no-one's really working on it as far as I know.

What can we do?

If build directories can't be mixed and system installation does not work due to the potential of library mixups, is there anything that we can do? It turns out not only that we can, but that there is already a potential solution (or least an approach for one): Flatpak.

The basic idea behind Flatpak is that it defines a standalone file system for each application that looks like a traditional Linux system's root file system. Dependencies are built and installed there as if one was installing them to the system prefix. What makes this special is that the filesystem separation is enforced by the kernel. Within each application's file system only one version of any library is visible. It is impossible to accidentally use the wrong version. This is what traditional techniques such as rpath and LD_LIBRARY_PATH have always tried to achieve, but have never been able to do reliably. With kernel functionality this becomes possible, even easy.

What sets Flatpak apart from existing app container technologies such as iOS and Android apps, UWP and so on is its practicality. Other techs are all about defining new, incompatible worlds that are extremely limited and invasive (for example spawning new processes is often prohibited). Flatpak is not. It is about making the app environment look as much as possible like the enclosing system. In fact it goes to great lengths to make this work transparently and it succeeds admirably. There is not a single developer on earth who would tolerate doing their own development inside a, say, iOS app. It is just too limited. By contrast developing inside Flatpak is not only possible and convenient, but something people already do today.

The possible solution, then, is to shift the dependency consumption from option 2 to option 3 as much as possible. It has only one real new requirement: each programming language must have a build system agnostic way of providing prebuilt libraries. Preferably this should be pkg-config but any similar neutral format will do. (For those exclaiming "we can't do that, we don't have a stable ABI", do not worry. Within the Flatpak world there is only one toolchain, system changes cause a full rebuild.)

With this the problem is now solved. All one needs to do is to write a Flatpak builder manifest that builds and installs the dependencies in the correct order. In this way we can mix and match languages and build systems in arbitrary combinations and things will just work. We know it will, because the basic approach is basically how Debian, Fedora and all other distros are already put together.

Monday, November 25, 2019

Process invocation will forever be broken

Invoking new processes is, at its core, a straightforward operation. Pretty much everything you need to know to understand it can be seen in the main declaration of the helloworld program:

#include<stdio.h>

int main(int argc, char **argv) {
    printf("Hello, world.\n");
    return 0;
}

The only (direct) information passed to the program is an array of strings containing its (command line) arguments. Thus it seems like an obvious conclusion that there is a corresponding function that takes an executable to run and an array of strings with the arguments. This turns out to be the case, and it is what the exec family of functions do. An example would be execve.

This function only exists on posixy operating systems, it is not available on Windows. The native way to start processes on Windows is the CreateProcess function. It does not take an array of strings, instead it takes a string:

BOOL CreateProcessA(
  LPCSTR lpApplicationName,
  LPSTR  lpCommandLine,
  ...

The operating system then internally splits the string into individual components using an algorithm that is not at all simple or understandable and whose details most people don't even know.

Side note: why does Windows behave this way?

I don't know for sure. But we can formulate a reasonable theory by looking in the past. Before Windows existed there was DOS, and it also had a way of invoking processes. This was done by using interrupts, in this case function 4bh in interrupt 21h. Browsing through online documentation we can find a relevant snippet:

Action: Loads a program for execution under the control of an existing program. By means of altering the INT 22h to 24h vectors, the calling prograrn [sic] can ensure that, on termination of the called program, control returns to itself.
On entry: AH = 4Bh
AL = 0: Load and execute a program
AL = 3: Load an overlay
DS.DX = segment:offset of the ASCIIZ pathname
ES:BX = Segment:offset of the parameter block
Parameter block bytes:
0-1: Segment pointer to envimmnemnt [sic] block
2-3: Offset of command tail
4-5: Segment of command tail

Here we see that the command is split in the same way as in the corresponding Win32 API call, into ta command to execute and a single string that contains the arguments (the command tail, though some sources say that this should be the full command line). This interrupt handler could take an array of strings instead, but does not. Most likely this is because it was the easiest thing to implement in real mode x86 assembly.

When Windows 1.0 appeared, its coders probably either used the DOS calls directly or copied the same code inside Windows' code base for simplicity and backwards compatibility. When the Win32 API was created they probably did the exact same thing. After all, you need the single string version for backwards compatibility anyway, so just copying the old behaviour is the fast and simple thing to do.

Why is this behaviour bad?

There are two main use cases for invoking processes: human invocations and programmatic invocations. The former happens when human beings type shell commands and pipelines interactively. The latter happens when programs invoke other programs. For the former case a string is the natural representation for the command, but this is not the case for the latter. The native representation there is an array of strings, especially for cross platform code because string splitting rules are different on different platforms. Implementing shell-based process invocation on top of an interface that takes an array of strings is straightforward, but the opposite is not.

Often command lines are not invoked directly but are instead passed from one program to another, stored to files, passed over networks and so on. It is not uncommon to pass a full command line as a command line argument to a different "wrapper" command and so on. An array of string is trivial to pass through arbitrarily deep and nested scenarios without data loss. Plain strings not so much. Many, many, many programs do command string splitting completely wrong. They might split it on spaces because it worksforme on this machine and implementing a full string splitter is a lot of work (thousands of lines of very tricky C at the very least). Some programs don't quote their outputs properly. Some don't unquote their inputs properly. Some do quoting unreliably. Sometimes you need to know in advance how many layers of unquoting your string will go through in advance so you can prequote it sufficiently beforehand (because you can't fix any of the intermediate blobs). Basically every time you pass commands as strings between systems, you get a parsing/quoting problem and a possibility for shell code injection. At the very least the string should carry with it information on whether it is a unix shell command line or a cmd.exe command line. But it doesn't, and can't.

Because of this almost all applications that deal with command invocation kick the can down the road and use strings rather than arrays, even though the latter is the "correct" solution. For example this is what the Ninja build system does. If you go through the rationale for this it is actually understandable and makes sense. The sad downside is that everyone using Ninja (or any such tool) has to do command quoting and parsing manually and then ninja-quote their quoted command lines.

This is the crux of the problem. Because process invocation is broken on Windows, every single program that deals with cross platform command invocation has to deal with commands as strings rather than an array of strings. This leads to every program using commands as strings, because that is the easy and compatible thing to do (not to mention it gives you the opportunity to close bugs with "your quoting is wrong, wontfix"). This leads to a weird kind of quantum entanglement where having things broken on one platform breaks things on a completely unrelated platform.

Can this be fixed?

Conceptually the fix is simple: add a new function, say, CreateProcessCmdArray to Win32 API. It is identical to plain CreateProcess except that it takes an array of strings rather than a shell command string. The latter can be implemented by running Windows' internal string splitter algorithm and calling the former. Seems doable, and with perfect backwards compatibility even? Sadly, there is a hitch.

It has been brought to my attention via unofficial channels [1] that this will never happen. The people at Microsoft who manage the Win32 API have decreed this part of the API frozen. No new functionality will ever be added to it. The future of Windows is WinRT or UWP or whatever it is called this week.

UWP is conceptually similar to Apple's iOS application bundles. There is only one process which is fully isolated from the rest of the system. Any functionality that need process isolation (and not just threads) must be put in its own "service" that the app can then communicate with using RPC. This turned out to be a stupid limitation for a desktop OS with hundreds of thousands of preexisting apps, because it would require every Win32 app using multiple processes to be rewritten to fit this new model. Eventually Microsoft caved under app vendor pressure and added the functionality to invoke processes into UWP (with limitations though). At this point they had a chance to do a proper from-scratch redesign for process invocation with the full wealth of knowledge we have obtained since the original design was written around 1982 or so. So can you guess whether they:
  1. Created a proper process invocation function that takes an array of strings?
  2. Exposed CreateProcess unaltered to UWP apps?
You guessed correctly.

Bonus chapter: msvcrt's execve functions

Some of you might have thought waitaminute, the Visual Studio C runtime does ship with functions that take string arrays so this entire blog post is pointless whining. This is true, it does provide said functions. Here is a pseudo-Python implementation for one of them. It is left as an exercise to the reader to determine why it does not help with this particular problem:

def spawn(cmd_array):
    cmd_string = ' '.join(cmd_array)
    CreateProcess(..., cmd_string, ...)

[1] That is to say, everything from here on may be completely wrong. Caveat lector. Do not quote me on this.

Tuesday, November 19, 2019

Some intricacies of ABI stability

There is a big discussion ongoing in the C++ world about ABI stability. People want to make a release of the standard that does a big ABI break, so a lot of old cruft can be removed and made better. This is a big and arduous task, which has a lot of "fun" and interesting edge, corner and hypercorner cases. It might be interesting to look at some of the lesser known ones (this post is not exhaustive, not by a long shot). All information here is specific to Linux, but other OSs should be roughly similar.

The first surprising thing to note is that nobody really cares about ABI stability. Even the people who defend stable ABIs in the committee do not care about ABI stability as such. What they do care about is that existing programs keep on working. A stable ABI is just a tool in making that happen. For many problems it is seemingly the only tool. Nevertheless, ABI stability is not the end goal. If the same outcome can be achieved via some other mechanism, then it can be used instead. Thinking about this for a while leads us to the following idea:
Since "C++ness" is just linking against libstdc++.so, could we not create a new one, say libstdc++2.so, that has a completely different ABI (and even API), build new apps against that and keep the old one around for running old apps?
The answer to this questions turns out to be yes. Even better, you can already do this today on any recent Debian based distribution (and probably most other distros too, but I have not tested). By default the Clang C++ compiler shipped by the distros uses the GNU C++ standard library. However you can install the libc++ stdlib via system packages and use it with the -stdlib=libc++ command line argument. If you go even deeper, you find that the GNU standard library's name is libstdc++.so.6, meaning that it has already had five ABI breaking updates.

So … problem solved then? No, not really.

Problem #1: the ABI boundary

Suppose you have a shared library built against the old ABI that exports a function that looks like this:

void do_something(const std::unordered_map<int, int> &m);

If you build code with the new ABI and call this function, the bit representation of the unordered map causes problems. The caller has a pointer to a bunch of bits in the new representation whereas the callee expects bits in the old representation. This code compiles and links but will invoke UB at runtime when called and, at best, crash your app.

Problem #2: the hidden symbols

This one is a bit complicated and needs some background information. Suppose we have a shared library foo that is implemented in C++ but exposes a plain C API. Internally it makes calls to the C++ standard library. We also have a main program that uses said library. So far, so good, everything works.

Let's add a second shared library called bar that also implemented in C++ and exposes a C API. We can link the main app against both these libraries and call them and everything works.

Now comes the twist. Let's compile the bar library against a new C++ ABI. The result looks like this:


A project mimicing this setup can be obtained from this Github repo. In it the abi1 and abi2 libraries both export a function with the same name that returns an int that is either 1 or 2. Libraries foo and bar check the return value and print a message saying whether they got the value they were expecting. It should be reiterated that the use of the abi libraries is fully internal. Nothing about them leaks to the exposed interface. But when we compile and run the program that calls both libraries, we get the following output snippet:

Foo invoked the correct ABI function.
Bar invoked the wrong ABI function.

What has happened is that both libraries invoked the function from abi1, which means that in the real world bar would have crashed in the same way as in problem #1. Alternatively both libraries could have called abi2, which would have broken foo. Determining when this happens is left as an exercise to the reader.

The reason this happens is that the functions in abi1 and abi2 have the same mangled name and the fact that symbol lookup is global to a process. Once any given name is determined, all usages anywhere in the same process will point to the same entity. This will happen even for non-weak symbols.

Can this be solved?

As far as I know, there is no known real-world solution to this problem that would scale to a full operating system (i.e. all of Debian, FreeBSD or the like). If there are any university professors reading this needing problems for your grad students, this could be one of them. The problem itself is fairly simple to formulate: make it possible to run two different, ABI incompatible C++ standard libraries within one process. The solution will probably require changes in the compiler, linker and runtime loader. For example, you might extend symbol resolution rules so that they are not global, but instead symbols from, say library bar would first be looked up in its direct descendents (in this case only abi2) and only after that in other parts of the tree.

To get you started, here is one potential solution I came up with while writing this post. I have no idea if it actually works, but I could not come up with an obvious thing that would break. I sadly don't have the time or know-how to implement this, but hopefully someone else has.

Let's start by defining that the new ABI is tied to C++23 for simplicity. That is, code compiled with -std=c++23 uses the new ABI and links against libstdc++.so.7, whereas older standard versions use the old ABI. Then we take the Itanium ABI specification and change it so that all mangled names start with, say, _^ rather than _Z as currently. Now we are done. The different ABIs mangle to different names and thus can coexist inside the same process without problems. One would probably need to do some magic inside the standard library implementations so they don't trample on each other.

The only problem this does not solve is calling a shared library with a different ABI. This can be worked around by writing small wrapper functions that expose an internal "C-like" interface and can call external functions directly. These can be linked inside the same library without problems because the two standard libraries can be linked in the same shared library just fine. There is a bit of a performance and maintenance penalty during the transition, but it will go away once all code is rebuilt with the new ABI.

Even with this, the transition is not a light weight operation. But if you plan properly ahead and do the switch, say, once every two standard releases (six years), it should be doable.

Monday, November 18, 2019

What is -pipe and should you use it?

Every now and then you see people using the -pipe compiler argument. This is particularly common on vintage handwritten makefiles. Meson even uses the argument by default, but what does it actually do? GCC manpages say the following:

-pipe
    Use pipes rather than temporary files for communication
    between the various stages of compilation.  This fails
    to work on some systems where the assembler is unable to
    read from a pipe; but the GNU assembler has no trouble.

So presumably this is a compile speed optimization. What sort of an improvement does it actually provide? I tested this by compiling LLVM on my desktop machine both with and without the -pipe command line argument. Without it I got the following time:

ninja  14770,75s user 799,50s system 575% cpu 45:04,98 total

Adding the argument produced the following timing:

ninja  14874,41s user 791,95s system 584% cpu 44:41,22 total

This is an improvement of less than one percent. Given that I was using my machine for other things at the same time, the difference is most likely statistically insignificant.

This argument may have been needed in the ye olden times of supporting tens of broken commercial unixes. Nowadays the only platform where this might make a difference is Windows, given that its file system is a lot slower than Linux's. But is its pipe implementation any faster? I don't know, and I'll let other people measure that.

The "hindsight is perfect" design lesson to be learned

Looking at this now, it is fairly easy to see that this command line option should not exist. Punting the responsibility of knowing whether files or pipes are faster (or even work) on any given platform to the user is poor usability. Most people don't know that and performance characteristics of operating systems change over time. Instead this should be handled inside the compiler with logic roughly like the following:

if(assembler_supports_pipes(...) &&
   pipes_are_faster_on_this_platform(...)) {
    communicate_with_pipes();
} else {
    communicate_with_files();
}

Monday, October 14, 2019

Apple of 2019 is the Linux of 2000

Last week the laptop I use for macOS development said that there is an XCode update available. I tried to install it but it said that there is not enough free space available to run the installer. So I deleted a bunch of files and tried again. Still the same complaint. Then I deleted some unused VM images. Those would free a few dozen gigabytes, so it should make things work. I even emptied the trash can to make sure nothing lingered around. But even this did not help, I still got the same complaint.

At this point it was time to get serious and launch the terminal. And, true enough, according to df the disk had only 8 gigabytes of free space even though I had just deleted over 40 gigabytes of files from it (using rm, not the GUI, so things really should have been gone). A lot of googling and poking later I discovered that all the deleted files had gone to "reserved space" on the file system. There was no way to access those files or delete them. According to documentation the operating system would delete those files "on demand as more space is needed". This was not very comforting because the system most definitely was not doing that and you'd think that Apple's own software would get this right.

After a ton more googling I managed to find a chat buried somewhere deep in Reddit which listed the magical indentation that purges reserved space. It consisted of running tmutil from the command line and giving it a bunch of command line arguments that did not seem to make sense or have any correlation to the thing that I wanted to do. But it did work and eventually I got XCode updated.

After my blood pressure dropped to healthier levels I got the strangest feeling of déjà vu. This felt exactly like using Linux in the early 2000s. Things break at random for reasons you can't understand and the only way to fix it is to find terminal commands from discussion forums, type them in and hope for the best. Then it hit me.

This was not an isolated incidence. The parallels are everywhere. Observe:

External monitors

Linux 2000: plugging an external monitor will most likely not work. Fanboys are very vocal that this is the fault of monitor manufacturers for not providing modeline info.

Apple 2019: plugging an external projector will most likely not work. Fanboys are very vocal that this is the fault of projector manufacturers for not ensuring that their HW works with every Apple model.

Software installation

Linux 2000: There is only One True Way of installing software: using distro packages. If you do anything else you are bad and you should feel bad.

Apple 2019: There is only True Way of installing software: using the Apple store. If you do anything else you are bad and you should feel bad.

Hardware compatibility

Linux 2000: only a limited number of hardware works out of the box, even for popular devices like 3D graphics cards. Things either don't work at all, have reduced functionality, or kinda work but fail spuriously every now and then for no discernible reason.

Apple 2019: only a limited number of hardware works out of the box, even for popular devices like Android phones. Things either don't work at all, have reduced functionality, or kinda work but fail spuriously every now and then for no discernible reason.

Technical support

Linux 2000: if your problem is not google-trivial, there's nothing you can do. Asking friends for assistance does not help, because they will just type your problem description into Google and read the first hit.

Apple 2019: if your problem is not google-trivial, there's nothing you can do. Calling Apple's tech support line does not help, because they will just type your problem description into Google and read the first hit.

Laptop features

Linux 2000: it is very difficult to find a laptop with more than two USB ports.

Apple 2019: it is very difficult to find a laptop with more than two USB ports.

Advocate behaviour

Linux 2000: fanboys will let you know in no uncertain terms that their system is the best and will take over all desktop computer usage. Said fanboys are condescending elitist computer nerds.

Apple 2019: fanboys will let you know in no uncertain terms that their system is the best and will take over all desktop computer usage. Said fanboys are condescending elitist hipster latte web site designers.

Friday, September 27, 2019

A look into building C++ modules with a scanner

At CppCon there was a presentation on building C++ modules using a standalone dependency scanner executable provided by the compiler toolchain. The integration (as I understand it) would go something like this:

  1. The build system creates a Ninja file as usual
  2. It sets up a dependency so that every compilation job depends on a prescan step.
  3. The scanner goes through all source files (using compilation_commands.json), determines module interdependencies and writes this out to a file in a format that Ninja understands.
  4. After the scan step, Ninja will load the file and use it to execute commands in the correct order.
This seems like an interesting approach for experimentation, but unfortunately it depends on functionality that is not yet in Ninja. It is unclear if and when these would be added to Ninja, as its current maintainers are extremely conservative in adding any new code. It is quite difficult to run experiments on approaches that have neither usable code nor all the required features in various parts of the toolchain.

Can we do it regardless? Yes we can!

Enter self-modifying build system code

The basic approach is simple
  1. Write a Ninja file as usual, but make all the top level commands (or, for this test, only all) run a secret internal command.
  2. The command will do the scanning, and change the Ninja file on the fly, rewriting it to have the module dependency information.
  3. Invoke Ninja on the new file giving it a secret target name that runs the actual build.
  4. Build proceeds as usual.
The code that does this can be found in the vsmodtest branch in the main Meson repository. To run it you need to use Visual Studio's module implementation, the test project is in the modtest directory. It actually does work, but there are a ton of disclaimers:
  • incremental builds probably won't work
  • the resulting binary never finishes (it is running a job with exponential complexity)
  • it does not work on any other project than the demo one (but it should be fixable)
  • the dependencies are on object files rather than module BMI files due to a Ninja limitation
  • module dep info is not cached, all files are fully rescanned every time
  • the scanner is not reliable, it does the equivalent of dumb regex parsing
  • any and all things may break at any time and if they do you get to keep both pieces
All in all nothing even close to production ready but a fairly nice experiment for ~100 lines of Python. This is of course a hack and should not go anywhere near production, but assuming Ninja gets all the required extra functionality it probably could be made to work reliably.

Is this the way C++ module building will work?

Probably not, because there is one major use case that this approach (or indeed any content scanning approach) does not support: code generation. Scanning assumes that all source code is available at the same time but if you generate source code on the fly, this is not the case. There would need to be some mechanism of making Ninja invoke the scanner anew every time source files appear and such a mechanism does not exist as far as I know. Even if it does there is a lot of state to transfer between Ninja and the scanner to ensure both reliable and minimal dependency scans.

There are alternative approaches one can take to avoid the need for scanning completely, but they have their own downsides.

Sunday, September 1, 2019

New Meson mugs, get them while they're hot!

Since there is no Meson swag available, I decided to make some. Here is the first one.


Now sadly there is no web store and shipping ceramics is expensive, so the only way to get them is to be in the same physical space as me. Unless you live in Finland (or have friends that do and can be persuaded to get one for you) the only real way is to meet up with me in some conference. The next one I'll be going to is CppCon in two weeks. I'm going to be on vacation in New York for one week before that so that is also an option. Supply is limited by how many I can pack in my luggage and survive airport baggage handlers.

Bonus meta logo picture


Saturday, August 10, 2019

C++ modules with a batch compiler feasibility study

In our previous blog post an approach for building C++ modules using a batch compiler model was proposed. The post raised a lot of discussion on Reddit. Most of it was of the quality one would expect, but some interesting questions were raised.

How much work for compiler developers?

A fairly raised issue was that the batch approach means adding new functionality to the compiler. This is true, but not particulary interesting as a standalone statement. The real question is how much more work is it, especially compared to the work needed to support other module implementation methods.

In the batch mode there are two main pieces of work. The first is starting compiler tasks, detecting when they freeze due to missing modules and resuming them once the modules they require have been built. The second one is calculating the minimal set of files to recompile when doing incremental builds.

The former is the trickier one because no-one has implemented it yet. The latter is a well known subject, build systems like Make and Ninja have done it for 40+ years. To test the former I wrote a simple feasibility study in Python. What it does is generate 100 source files containing modules that call each other and then compiles them all in the manner a batch compiler would. There is no need to scan the contents of files, the system will automatically detect the correct build order or error out if it can not be done.

Note that this is just a feasibility study experiment. There are a lot of caveats. Please go through the readme before commenting. The issue you want to raise may already be addressed there. Especially note that it only works with Visual Studio.

The code that is responsible for running the compile is roughly 60 lines of Python. It is conceptually very simple. A real implementation would need to deal with threading, locking and possibly IPC, which would take a fair bit of work.

The script does not support incremental builds. I'd estimate that getting a minimal version going would take something like 100-200 lines of Python.

I don't have any estimates on how much work this would mean on a real compiler code base.

The difference to scanning

A point raised in the Reddit discussion is that there is an alternative approach that uses richer communication between the build system and the compiler. If you go deeper you may find that the approaches are not that different after all. It's more of a question of how the pie is sliced. The scanning approach looks roughly like this:


In this case the build system and compiler need to talk to each other, somehow, via some sort of a protocol. This protocol must be implemented by all compilers and all build tools and they must all interoperate. It must also remain binary stable and all that. There is a proposal for this protocol. The specification is already fairly long and complicated especially considering that it supports versioning so future versions may be different in arbitrary ways. This has an ongoing maintenance cost of unknown size. This also complicates distributed builds because it is common for the build system and the compiler to reside on different machines or even data centres so setting up an IPC channel between them may take a fair bit of work.

The architectural diagram of the batch compiler model looks like this:


Here the pathway for communicating module information is a compiler implementation detail. Every compiler can choose to implement it in the way that fits their internal compiler implementation the best. They can change its implementation whenever they wish because it is never exposed outside the compiler. The only public facing API that needs to be kept stable is the compiler command line, which all compilers already provide. This also permits them to ship a module implementation faster, since there is no need to fully specify the protocol and do interoperability tests. The downside is that the compiler needs to determine which sources to recompile when doing incremental builds.

Who will eventually make the decision?

I don't actually know. But probably it will come down to what the toolchain providers (GCC, Clang, Visual Studio) are willing to commit to.

It is my estimate (which is purely a guess, because I don't have first hand knowledge) that the batch system would take less effort to implement and would present a smaller ongoing maintenance burden. But compiler developers are the people who would actually know.

Wednesday, August 7, 2019

Building C++ modules, take N+1

Modules were voted in C++20 some time ago. They are meant to be a replacement for #include statements to increase build speeds and to also isolate translation units so, for example, macros defined in one file do not affect the contents of another file. There are three major different compilers and each of them has their own prototype implementation available (GCC documentation, Clang documentation, VS documentation).

As you would expect, all of these implementations are wildly different and, in the grand C++ tradition, byzantinely complicated. None of them also have a really good solution to the biggest problem of C++ modules, namely that of dependency tracking. A slightly simplified but mostly accurate description of the problem goes like this:

Instead of header files, all source code is written in one file. It contains export statements that describe what functions can be called from the outside. An analogy would be that functions declared as exported would be in a public header file and everything else would be internal and declared in an internal header file (or would be declared static or similar). The module source can not be included directly, instead when you compile the source code the compiler will output an object file and also a module interface file. The latter is just some sort of a binary data file describing the module's interface. An import statement works by finding this file and reading it in.

If you have file A that defines a module and file B that uses it, you need to first fully compile file A and only after the module interface file has been created can you compile file B. Traditionally C and C++ files can be compiled in parallel because everything needed to compile each file is already in the header files. With modules this is no longer the case. If you have ever compiled Fortran and this seems familiar, it's because it is basically the exact same architecture.

Herein lies the problem

The big, big problem is how do you determine what order you should build the sources in. Just looking at the files is not enough, you seemingly need to know something about their contents. At least the following approaches have toyed with:
  1. Writing the dependencies between files manually in Makefiles. Yes. Really. This has actually been but forth as a serious proposal.
  2. First scan the contents of every file, determine the interdependencies, write them out to a separate dependency file and then run the actual build based on that. This requires parsing the source files twice and it has to be done by the compiler rather than a regex because you can define modules via macros (at least in VS currently).
  3. When the compiler finds a module import it can not resolve, it asks the build system via IPC to generate one. Somehow.
  4. Build an IPC mechanism between the different compiler instances so they can talk to each other to determine where the modules are. This should also work between compilers that are in different data centers when doing distributed builds.
Some of these approaches are better than others but all of them fail completely when source code generators enter the picture, especially if you want to build the generator executable during the build (which is fairly common). Scanning all file contents at the start of the build is not possible in this case, because some of the source code does not yet exist. It only comes into existence as build steps are executed. This is hideously complicated to support in a build system.

Is there a better solution?

There may well be, though I'd like to emphasize that none of the following has actually been tested and that I'm not a compiler developer. The approach itself does require some non-trivial amount of work on the compiler, but it should be less than writing a full blown IPC mechanism and distributed dataflow among the different parts of the system.

At the core of the proposed approach is the realization that not all module dependencies between files are the same. They can be split into two different types. This is demonstrated in the following diagram that has two targets: a library and an executable that uses it.


As you can see the dependencies within each target can get fairly complicated. The dependencies between targets can be just as complicated, but they have been left out of the picture to keep it simple. Note that there are no dependency cycles anywhere in the graph (this is mandated by the module specification FWICT). This gives us two different kinds of module dependencies: between-targets module dependencies and within-targets module dependencies.

The first one of these is actually fairly simple to solve. If you complete all compilations (but not the linking step) of the dependency library before starting any compilations in the executable, then all library module files that the executable could possibly need are guaranteed to exist. This is easy to implement with e.g. a Ninja pseudotarget.

The second case is the difficult one and leads to all the scanning problems and such discussed above. The proposed solution is to slightly change the way the compiler is invoked. Rather than starting one process per input file, we do something like the following:

g++ <other args> --outdir=somedir [all source files of this target]

What this means conceptually is that the compiler needs to take all the input files and compile each of them. Thus file1.cpp should end up as somedir/file1.o and so on. In addition it must deal with this target's internal module interrelations transparently behind the scenes. When run again it must detect which output files are up to date and rebuild only the outdated ones.

One possible implementation is that the compiler may launch one thread per input file (but no more than there are CPUs available). Each compilation proceeds as usual but when it encounters a module import that it can not find, it halts and waits on, say, a condition variable. Whenever a compilation job finishes writing a module, it will signal all the other tasks that a new module is available. Eventually either all jobs finish or every remaining task is deadlocked because they need a module that can't be found anywhere.

This approach is similar to the IPC mechanism described on GCC's documentation but it is never exposed to any third party program. It is fully an internal implementation detail of the compiler and as such there are no security risks or stability requirements for the protocol.

With this approach we can handle both internal and external module dependencies reliably. There is no need to scan the sources twice or write complicated protocols between the compiler and the build system. This even works for generated sources without any extra work, which no other proposed approach seems to be able to do.

As an added bonus the resulting command line API is so simple it can be even be driven with plain Make.

Extra bits

This approach also permits one to do ZapCC style caching. Since compiler arguments for all sources within one target must be the same under this scheme (which is a good thing to do in general), imports and includes can be potentially shared between different compiler tasks. Even further, suppose you have a type that is used in most sources like std::vector<std::string>.  Normally the instantiated and compiled code would need to be written in every object file for the linker to eventually deduplicate. In this case, since we know that all outputs will go to the same target it is enough to write the code out in one object file. This can lead to major build footprint reductions. It should also reduce the linker's memory usage since there is a lot less debug info to manage. In most large projects linking, rather than compiling, is the slow and resource intensive step so making it faster is beneficial.

The module discussions have been going around in circles about either "not wanting to make the compiler a build system" or "not wanting to make the build system into a compiler". This approach does neither. The compiler is still a compiler, it just works with slightly bigger work chunks at a time. It does need to track staleness of output objects, though, which it did not need to do before.

There needs to be some sort of a load balancing system so that you don't accidentally spawn N compile jobs each of which spawns N internal work threads.

If you have a project that consists of only one executable with a gazillion files and you want to use a distributed build server then this approach is not the greatest. The solution to this is obvious: split your project into independent logical chunks. It's good design and you should do it regardless.

The biggest downside of this approach that I could come up with was that CCache will probably no longer work without a lot of work. But if modules make compilation 5-10× faster (which is a given estimate, there are no public independent measurements yet) then it could be worth it.

Thursday, August 1, 2019

Metafont-inspired font design using nonlinear constraint optimization and Webassembly

Modern fonts work by drawing the outline of each letter out by hand. This is a very labour-intensive operation as each letter has to be redrawn for each weight. Back in the late 70s Donald Knuth created METAFONT, which has a completely different approach that mimics the way letters are drawn by hand. In this system you specify a pen shape and a stroke path. The computer would then calculate what sort of a mark this stroke would leave on paper and draw the result. The pen shape as well as the stroke were defined as mathematical equations of type "the stroke shall begin so that its topmost part is exactly at x-height and shall end so that its bottom is at the baseline".

The advantage of this approach is that it is fully parametrizable. If you change the global x-height, then every letter is automatically recalculated. Outline fonts can't be slanted by shearing because it changes the strokes' widths. Parametric fonts do not have this issue. Different optical weights can be obtained simply by changing the size of the pen nib. You could even go as far as change the widths of individual letters during typesetting to make text justification appear even smoother.

METAFONT was never widely used for font design. The idea behind it was elegant, though, and is worth further examination, for example are there other ways of defining fonts parametrically and what sort of letter shapes would they produce. Below we describe one such attempt.

How would that look in practice?

In a nutshell you write a definition that mathematically specifies the shape:


Then the system will automatically create the "best possible" shape that fulfills those requirements:


Note that this output comes from the Python Scipy prototype version, not the Webassembly one. It calculates a slightly different shape for the letter and does not do the outline yet.

How does it work?

First the program parses the input file, which currently can specify only one stroke. Said stroke consists of one or more cubic splines one after the other. That is, the end point of spline n is the starting point of spline n+1. It then converts the constraints into mathematical form. For example if you have a constraint that a point must be on a diagonal line between (0, 0) and (1, 1), then its coordinates must be (t, t) where t is a free parameter with values in the range [0, 1]. The free parameters can be assigned any values and the resulting curve will still fulfill all the requirements.

The followup question is how do we select the free parameters to pick the most pleasing of all curves to quote Knuth's terminology. If you thought "deep learning" then you thought wrong. This is actually a well known problem called nonlinear optimization. There are many possible approaches but we have chosen to use the (draws a deep breath) Broyden–Fletcher–Goldfarb–Shanno algorithm. This is the default optimization method in Scipy. There is also a C port of the original Fortran implementation.

The last thing we need is a target function to minimize. There are again many choices (curve length, energy, etc). After some experimentation we chose the maximum absolute value of the second derivative's perpendicular component over the entire curve. It gave the smoothest end result. Having defined a target function, free variables and a solver method we can let the computer do what it does best: crunch numbers. The end result is written as SVG using TinyXML2.

There are known problems with the current evaluation algorithm. Sometimes it produces weird and non-smooth results. LBFGS seems to be sensitive to the initial estimate and there are interesting numerical problems when evaluating the target function when switching between two consecutive curves.

Try it yourself

The Webassembly version is available here. You can edit the definition text and then recompute the letter shape with the action button. Things to try include changing the width and height (variables w and h, respectively) or changing the location constraints for curve points.

The code can also be compiled and run as a native program. It is available in this Github repository.

Future work and things that need fixing

  • Firefox renders the resulting SVG files incorrectly
  • All mathy bits are hardcoded, optimization functions et al should be definable in the character file instead
  • Designing strokes by hand is not nice, some sort of a UI would be useful
  • The text format for specifying strokes needs to be designed
  • Should be able to directly specify some path as a straight line segment rather than a bezier
  • You should be able to specify more than one stroke in a character
  • Actual font file generation, possibly via FontForge
  • "Anti-drawing" or erasing existing strokes with a second stroke

Tuesday, July 16, 2019

A personal story about 10× development

During the last few days there has been an ongoing Twitter storm about 10× developers. And like all the ones before it (and all the future ones that will inevitably happen) the debate immediately devolved into name calling and all the other things you'd except from Twitter fights. This blog post is not about that. Instead it is about a personal experience about productivity that I had to experience closer than I would have liked.

Some years ago I was working for company X on product Y. All in all it was quite a nice experience. We had a small team working on a code base that was pretty good. It had nice tests, not too many bugs, and when issues did arise they were usually easy to fix. Eventually the project was deemed good enough and we were transferred to work on different projects.

I have no idea what our "industry standard performance multiplier" was when we worked on that project, but for the sake of argument let's call it 1×.

The project I got transferred to was the thing of nightmares. It was a C++ project and all the bad things that have ever been said about C++ were true about that code base. There was not much code but it was utterly incomprehensible. There were massively deep inheritance hierarchies, , compilation speed was measured in minutes for even the most trivial changes, and so on. It was managed by an architecture astronaut that, as one is wont to do, rewrote existing mature libraries as header only template libraries that were buggy and untested (one could even say untestable).

Thus overnight I went from being a 1× down to being a 0.1× or possibly even a 0.01× developer. Simply trying to understand what a specific function was supposed to do took hours. There was, naturally, a big product launch coming up so we needed to get things finished quickly. All in all it was a stressful, frustrating and unpleasant situation to be in. And that was not the worst of it.

After a few weeks my manager wanted to talk to me in private. He was very concerned about the fact that I had not achieved any visible progress for a while. Then I explained to him in detail all the problems in the current project. I even demonstrated how compiling a simple helloworld-level program with the frameworks we had to use took tens of seconds on the beefiest i7 desktop machine I had available. He did not seem to be able to grasp any of that as his only response was "but you used to be so productive in your previous project". Shortly thereafter the same boss started giving me not-al-all-thinly-veiled accusations that I was just slacking off and that this could lead to serious reprimands.

This story does not have a happy ending. The project eventually failed (due to completely different reasons, though), money was squandered and almost everyone involved got fired. In the aftermath I seriously considered getting out of the software engineering business altogether. The entire experience had been so miserable that becoming a 0× developer was seriously tempting.

Is there something we can learn from this?

The "×ness" of any developer does not exist in a vacuum but depends on many organizational things. The most obvious one is tooling. If you have a CI where tests take 30 minutes to run or your developers have underpowered old laptops, everyone's performance goes down. In fact, the overall health of the code base probably has a bigger effect on developer productivity than all developers' skills combined.

But even more important than technical issues are things that promote healthy team dynamics. These include things like blameless postmortems, openness to ideas from everyone, permission to try new things even if they may fail, stern weeding out of jerk behaviour and, ultimately, trust.

If you work on getting all of these things in your working environment then you may find that you find yourself with a 10× team. And if you do, the entire concept of a single 10× developer becomes meaningless.

Sunday, July 14, 2019

Initializing all local variables with Clang-Tidy

A common source of all kinds of bugs is using variables without properly initializing them. Out of all security problems this one is the simplest to fix, just convert all declarations of type int x; to int x=0;. The main reason for not doing that is laziness, manually going through existing code bases and adding initialization statements is boring and nobody wants to do that.

Fortunately nowadays we don't have to. Clang-tidy provides a nice toolkit for writing source code refactoring tools for C and C++. As an exercise I wrote a checker to do this. It is submitted upstream and is undergoing code review. Implementing it was fairly straightforward. There were only two major problems. The first one was that existing documentation consists mostly of reference manuals. There is no easy to follow tutorials, only Doxygen pages. But if you dig around on the net and work on it a bit, you can get it working.

The second, and bigger, obstacle is that doing anything in the LLVM code base is sloooow. Everything in LLVM and Clang is linked to single, huge, monolithic libraries which take forever to link. Because of reasons I started doing this work on my secondary machine, which is a 4 core i5 with 16 gigs of RAM. I had to limit simultaneous linker jobs to 2 because otherwise it would just crash spectacularly to an out of memory error. Presumably it is impossible to compile the code base on a machine that has only 8 gigs of RAM. It seems that if you want to do any real development on LLVM you need a spare data center to run the compilations, which is unfortunate.

There is an evil hack to work around this, though. Set the CMake build type to Debug and then change CMAKE_CXX_FLAGS_DEBUG and CMAKE_C_FLAGS_DEBUG from -g to -Og. This makes the compilation faster and reduces memory usage to a fraction of the original. The downside is that there is no debug information, but it turns out to not be necessary when writing simple Clang-Tidy checkers.

Once all that is done the actual checker is almost trivial. This is the part that looks up all local variables without initial values:

void InitLocalVariablesCheck::registerMatchers(MatchFinder *Finder) {
  Finder->addMatcher(
      varDecl(unless(hasInitializer(anything()))).bind("vardecl"), this);
}

Then you determine what the initial value should be based on the type of the variable and add a warning and a fixit:

diag(location, "variable %0 is not initialized")
    << MatchedDecl;
diag(location, "insert initial value", DiagnosticIDs::Note)
    << FixItHint::CreateInsertion(
           location.getLocWithOffset(VarName.size()),
           InitializationString);

All in all this amounts to about 100 lines of code plus tests.

But what about performance?

The other major reason not to initialize variables is that it "may cause a runtime performance degradation of unknown magnitude". That is true, but with this tooling the degradation is no longer unknown. You can run the tool and then measure the results. This is trivial for all code bases that have performance benchmarks.

Saturday, June 8, 2019

Tweaking the parallel Zip file writer

A few years ago I wrote a command line program to compress and decompress Zip files in parallel. It turned out to work pretty well, but it had one design flaw that kept annoying me.

What is the problem?

Decompressing Zip files in parallel is almost trivial. Each file can be decompressed in parallel without affecting any other decompression task. Fire up N processing tasks and decompress files until finished. Compressing Zip files is more difficult to parallelize. Each file can be compressed separately, but the problem comes from writing the output file.

The output file must be written one file at a time. So if one compressed file is being written to the output file then other compression tasks must wait until it finishes before their output can be written to the result file. This data can not be kept in memory because it is common to have output files that are larger than available memory.

The original solution (and thus the design flaw alluded to) was to to have each compressor write its output to a temporary file. The writer would then read the data from the file, write it to the final result file and delete the temporary file.

This works but means that the data gets written to the file system twice. It may also require up to 2× disk space. The worst case happens when you compress only one very big file. On desktop machines this is not such a problem, but on something like a Raspberry Pi the disk is an SD card, which is very slow. You only want to write it once. SD cards also wear out when written to, which is another reason to avoid writes.

The new approach

An optimal solution would have all of these properties:
  1. Uses all CPU cores 100% of the time (except at the end when there are fewer tasks than cores).
  2. Writes data to the file system only once.
  3. Handles files of arbitrary size (much bigger than available RAM).
  4. Has bounded memory consumption.
It turns out that not all of these are achievable at the same time. Or at least I could not come up with a way. After watching some Numberphile videos I felt like writing a formal proof but quickly gave up on the idea. Roughly speaking since you can't reliably estimate when the tasks finish and how large the resulting files will be, it does not seem possible to choose an optimal strategy for writing the results out to disk.

The new architecture I came up with looks like this:


Rather than writing its result to a temporary file, each compressor writes it to a byte queue with a fixed maximum size. This was chosen to be either 10 or 100 megabytes, which means that in practice most files will fit the buffer. The queue can be in one of three states: not full, full or finished. The difference between the last two is that a full queue is one where the compression task still has data to compress but it can't proceed until the queue is emptied.

The behaviour is now straightforward. First launch compressor tasks as in decompressing. The file writer part will go through all the queues. If it finds a finished queue it will write it to disk and launch a new task. If it finds a full queue it will do the same, but it must write out the whole stream, meaning it is blocked until the current file has been fully compressed. If the compressions takes too long all other compression tasks will finish (or get full) but new ones can't be launched leading to CPU underutilization.

Is it possible to do better?

Yes, but only as a special case. Btrfs supports copying data from one file to another in O(1) time taking only an extra O(1) space. Thus you could write all data to temp files, copy the data to the final file and delete the temp files.

Sunday, June 2, 2019

Looking at why the Meson crowdfunding campaign failed

The crowdfunding campaign to create a full manual for the Meson build system ended yesterday. It did not reach its 10 000€ goal so the book will not be produced and instead all contributed money will be returned. I'd like to thank everyone who participated. A special thanks goes out to Centricular for their bronze corporate sponsorship (which, interestingly, was almost 50% of the total money raised).

Nevertheless the fact remains that this project was a failure and a fairly major one at that since it did not reach even one third of its target. This can not be helped, but maybe we can salvage some pieces of useful information from the ruins.

Some statistics

There were a total of 42 contributors to the campaign. Indiegogo says that a total of 596 people visited the project when it was live. Thus roughly 7% of all people who came to the site participated. It is harder to know how many people saw information about the campaign without coming to the site. Estimating based on numbers based on the blog's readership, Twitter reach and other sources puts the number at around 5000 globally (with a fairly large margin of error). This would indicate a conversion rate of 1% of all the people who saw any information about the campaign. In reality the percentage is lower since many of the contributors were people who did not really need convincing. Thus the conversion rate is probably closer to 0.5% or even lower.

The project was set up so that 300 contributors would have been enough to make the project a success. Given the number of people using Meson (estimated to be in the tens of thousands) this seemed like a reasonable goal. Turns out that it wasn't. Given these conversion numbers you'd need to reach 30 000 – 60 000 people in order to succeed. For a small project with zero advertising budget this seems like a hard thing to achieve.

On the Internet everything drowns

Twitter, LinkedIn, Facebook and the like are not very good channels for spreading information. They are firehoses where any one post has an active time of maybe one second if you are lucky. And if you are not, the platforms' algorithms will hide your post because they deem it "uninteresting".  Sadly filtering seems to be mandatory, because not having it makes the firehose even more unreadable. The only hope you have is that someone popular writes about your project. In practice this can only be achieved via personal connections.

Reddit-like aggregation sites are not much better, because you have basically two choices: either post on a popular subreddit or an unpopulare one. In the first case your post probably won't even make it on the front page, all it takes is a few downvotes because the post is "not interesting" or "does not belong here". A post that is not on the front page might not as well even exist; no-one will read it. Posting on an non-popular area is no better. Your post is there but it will reach 10 people and out of those maybe 1 will click on the link.

New sites are great for getting the information out, but they suffer from the same popularity problem as everything else. A distilled (and only slightly snarky) explanation is that news sites write mainly about two things:
  1. Things they have already written about (i.e. have deemed popular)
  2. Things other news sites write about (i.e. that other people have deemed popular)
This is not really the fault of news sites. They are doing their best on a very difficult job. This is just how the world and popularity work. Things that are popular get more popular because of their current popularity alone. Things that are not popular are unlikely to ever become popular because of their current unpopularity alone.

Unexpected requirements

One of the goals of this campaign (or experiment, really) was to see if selling manuals would be a sustainable way to compensate FOSS project developers and maintainers for their work. If working this would be a good way for compensation, because there are already established legal practices for selling books across the world. Transferring money in other ways (donations etc) is difficult and there may be legal obstacles.

Based on this one experiment this does not seem to be a feasible approach. Interestingly multiple people let me know that they would not be participating because the end result would not be released under a free license. Presumably the same people do not complain to book store tellers that "I will only buy this Harry Potter book if, immediately after my purchase, the full book is released for free on the Internet". But for some reason there is a hidden assumption that because a person has released something under a free license, they must publish everything else they do under free licenses as well.

These additional limitations make this model of charging for docs really hard to pull off. There is no possibility of steady, long term money flow because once a book is out under a free license it becomes unsellable. People will just download the free PDF instead. A completely different model (or implementation of the attempted model) seems to be needed.

So what happens next?

I don't really know. Maybe the book can get published through an actual publisher. Maybe not. Maybe I'll just take a break from the whole thing and get back to it later. But to end on some kind of a positive note I have extracted one chapter from the book and have posted it here in PDF form for everyone to read. Enjoy.