Friday, March 19, 2021

Microsoft is shipping a product built with Meson

Some time ago Microsoft announced a compatibility pack to get OpenGL and OpenCL running even on computers whose hardware does not provide native OpenGL drivers. It is basically OpenGL-over-Direct3D. Or that is at least my understanding of it, hopefully this description is sufficiently accurate to not cause audible groans on the devs who actually know what it is doing under the covers. More actual details can be found in this blog post.

An OpenGL implementation is a whole lot of work and writing one from scratch is a multi-year project. Instead of doing that, Microsoft chose the sensible approach of taking the Mesa implementation and porting it to work on Windows. Typically large corporations do this by the vendoring approach, that is, copying the source code inside their own repos, rewriting the build system and treating it as if it was their own code.

The blog post does not say it, but in this case that approach was not taken. Instead all work was done in upstream Mesa and the end products are built with the same Meson build files [1]. This also goes for the final release that is available in Windows Store. This is a fairly big milestone for the Meson project as it is now provably mature enough that major players like Microsoft are willing to use it to build and ship end user products. 

[1] There may, of course, be some internal patches we don't know about.

Wednesday, March 10, 2021

Mixing Rust into an existing C shared library using Meson

Many people are interested in adding Rust to their existing projects for additional safety. For example it would be convenient to use Rust for individual high-risk things like string parsing while leaving the other bits as they are. For shared libraries you'd need to be able to do this while preserving the external plain C API and ABI. Most Rust compilation is done with Cargo, but it is not particularly suited to this task due to two things.

  1. Integrating Cargo into an existing project's build system is painful, because Cargo wants to dominate the entire build process. It does not cooperate with these kind of build setups particularly well.
  2. Using any Cargo dependency brings in tens or hundreds of dependency crates including five different command line parsers, test frameworks and other deps that you don't care about and don't need but which take forever to compile.
It should be noted that the latter is not strictly Cargo's fault. It is possible to use it standalone without external deps. However what seems to happen in practice is all Cargo projects experience a dependency explosion sooner or later. Thus it would seem like there should be a less invasive way to merge Rust into an existing code base. Fortunately with Meson, there is.

The sample project

To see how this can be done, we created a simple standalone C project for adding numbers. The full source code can be found in this repository. The library consists of three functions:

adder* adder_create(int number);
int adder_add(adder *a, int number);
void adder_destroy(adder*);

To add the numbers 2 and 4 together, you'd do this:

adder *two_adder = adder_create(2);
int six = adder_add(two_adder, 4);
adder_destroy(two_adder);

As adding numbers is highly dangerous, we want to implement the adder_add function in Rust and leave the other functions untouched. The implementation in all its simplicity is the following:

#[repr(C)]
pub struct Adder {
  pub number: i32
}

#[no_mangle]
pub extern fn adder_add(a: &Adder, number: i32) -> i32 {
    return a.number + number;
}

The build setup

Meson has native support for building Rust. It does not require Cargo or any other tool, it invokes rustc directly. In this particular case we need to build the Rust code as a staticlib.

rl = static_library('radder', 'adder.rs',
                    rust_crate_type: 'staticlib')

In theory all you'd need to do, then, is to link this library into the main shared library, remove the adder_add implementation from the C side and you'd be done. Unfortunately it's not that simple. Because nothing in the existing code calls this function, the linker will look at it, see that it is unused and throw it away.

The common approach in these cases is to use link_whole instead of plain linking. This does not work, because rustc adds its own metadata files inside the static library. The system linker does not know how to handle those and will exit with an error. Fortunately there is a way to make this work. You can specify additional undefined symbol names to the linker. This makes it behave as if something in the existing code had called adder_add, and grabs the implementation from the static library. This can be done with an additional kwarg to the shared_library call.

link_args: '-Wl,-u,adder_add'

With this the goal has been reached: one function implementation is done with Rust while preserving both the API and the ABI and the test suite passes as well. The resulting shared library file is about 1 kilobyte bigger than the plain C one (though if you build without optimizations enabled, it is a whopping 14 megabytes bigger).

Wednesday, February 24, 2021

Millennium prize problems but for Linux

There is a longstanding tradition in mathematics to create a list of hard unsolved problems to drive people to work on solving them. Examples include Hilbert's problems and the Millennium Prize problems. Wouldn't it be nice if we had the same for Linux? A bunch of hard problems with sexy names that would drive development forward? Sadly there is no easy source for tens of millions of euros in prize money, not to mention it would be very hard to distribute as this work would, by necessity, be spread over a large group of people.

Thus it seems is unlikely for this to work in practice, but that does not prevent us from stealing a different trick from mathematicians' toolbox and ponder how it would work in theory. In this case the list of problems will probably never exist, but let's assume that it does. What would it contain if it did exist? Here's one example I came up with. it is left as an exercise to the reader to work out what prompted me to write this post.

The Memory Depletion Smoothness Property

When running the following sequence of steps:
  1. Check out the full source code for LLVM + Clang
  2. Configure it to compile Clang and Clang-tools-extra, use the Ninja backend and RelWithDebInfo build type, leave all other settings to their default values
  3. Start watching a video file with VLC or a browser
  4. Start compilation by running nice -19 ninja
The outcome must be that the video playback works without skipping a single frame or audio sample.

What happens currently?

When Clang starts linking, each linker process takes up to 10 gigabytes of ram. This leads to memory exhaustion, flushing active memory to swap and eventually crashing the linker processes. Before that happens, however, every other app freezes completely and the entire desktop remains frozen until things get swapped back in to memory, which can take tens of seconds. Interestingly all browser tabs are killed before linker processes start failing. This happens both with Firefox and Chromium.

What should happen instead?

The system handles the problematic case in a better way. The linker processes will still die as there is not enough memory to run them all but the UI should never noticeably freeze. For extra points the same should happen even if you run Ninja without nice.

The wrong solution

A knee-jerk reaction many people have is something along the lines of "you can solve this by limiting the number of linker processes by doing X". That is not the answer. It solves the symptoms but not the underlying cause, which is that bad input causes the scheduler to do the wrong thing. There are many other ways of triggering the same issue, for example by copying large files around. A proper solution would fix all of those in one go.

Saturday, February 6, 2021

Why most programming language performance comparisons are most likely wrong

For as long as programming languages have existed, people have fought over which one of them is the fastest. These debates have ranged from serious scientific research to many a heated late night bar discussion. Rather than getting into this argument, let's look at the problem at a higher level, namely how would you compare the performance of two different programming languages. The only really meaningful approach is to do it empirically, that is, implementing a bunch of test programs in both programming languages, benchmarking them and then declaring the winner.

This is hard. Really hard. Insanely hard in some cases and very laborious in any case. Even though the problem seems straightforward, there are a ton of error sources that can trip up the unaware (and even many very-much-aware) performance tester.

Equivalent implementations?

In order to make the two implementations comparable they should be "of equal quality". That is, they should have been implemented by people with roughly the same amount of domain knowledge as well as programming skills in their chosen language. This is difficult to organise. If the implementations are written by different people, they may approach the problem with different algorithms making the relative performance not a question of programming languages per se, but of the programming approaches chosen by each programmer.

Even if both implementation are written by the same person using the same algorithm, there are still problems. Typically people are better at some programming languages than others. Thus they tend to provide faster implementations in their favourite language. This causes bias, because the performance is not a measure of the programming languages themselves, but rather the individual programmer. These sorts of tests can be useful in finding usability and productivity differences, but not so much for performance.

Thus you might want to evaluate existing programs written by many expert programmers. This is a good approach, but sometimes even seasoned researches get it wrong. There is a paper that tries to compare different programming languages for performance and power efficiency using this approach. In their test results one particular program's C implementation was 30% faster than the same program written in C++. This single measurement throws a big shade over the entire paper. If we took the original C source, changed all the sources' file extension from .c to .cpp and recompiled, the end result should have the same performance within a few percentage points. Thus we have to conclude that one of the following is happening (in decreasing order of probability):
  1. The C++ version is suboptimally coded.
  2. The testing methodology has a noticeable flaw.
  3. The compiler used has a major performance regression for C++ as opposed to C.
Or, in other words, the performance difference comes somewhere else than the choice of programming language.

The difficulty of measurement

A big question is how does one actually measure the performance of any given program. A common approach is to run the test multiple times in a row and then do something like the following:
  • Handle outliers by dropping the points at extreme ends (that is, the slowest and fastest measurements)
  • Calculate the mean and/or median for the remaining data points
  • Compare the result between different programs, the one with the fastest time wins
Those who remember their high school statistics lessons might calculate standard deviation as well. This approach seems sound and rigorous, but it contains several sources of systematic error. The first of these is quite surprising and has to do with noise in measurements.

Most basic statistical tools assume that the error is normally distributed with an average value of zero. If you are measuring something like temperature or speed this is a reasonable assumption. For this case it is not. A program's measured time consists of the "true" time spent solving the problem and overhead that comes from things like OS interruptions, disk accesses and so on. If we assume that the noise is gaussian with a zero average then what it means is that the physical machine has random processes that make the program run faster than it would if the machine was completely noise free. This is, of course, impossible. The noise is strongly non-gaussian simply because it can never have a negative value.

In fact, the measurement that is the closest to the platonic ideal answer is the fastest one. It has the least amount of noise interference from the system. That is the very same measurement that was discarded in the first step when outliers were cleaned out. Sometimes doing established and reasonable things makes things worse.

Statistics even harder

Putting that aside, let's assume we have measurements for the two programs, which do look "sufficiently gaussian". Numerical analysis shows that language #1 takes 10 seconds to run whereas language #2 takes 9 seconds. A 10% difference is notable and thus we can conclude that language #2 is faster. Right?

Well, no. Suppose the actual measurement data look like this:


Is the one on the right faster or not? Maybe? Probably? Could be? Answering this question properly requires going all the way to university level statistics. First one formulates a null hypothesis, that is, that the two programs have no performance difference. Then one calculates the probability that both of these measurements have come from the same probability distribution. If the probability for this is small (typically 5%), then the null hypothesis is rejected and we have proven that one program is indeed faster than the other. This method is known as Student's t-test. and it is used commonly in heavy duty statistics. Note that some implementations of the test assume gaussian data and if you data has some other shape, the results you get might not be reliable.

This works for one program, but a rigorous test has many different programs. There are statistical methods for evaluating those, but they get even more complicated. Looking up how they work is left as an exercise to the reader.

All computers' alignment is Chaotic Neutral

Statistics are hard, but fortunately computers are simple because they are deterministic, reliable and logical. For example if you have a program and you modify it by adding a single NOP instruction somewhere in the stream, the biggest impact it could possibly have is one extra instruction cycle, which is so vanishingly small as to be unmeasurable. If you do go out and measure it, though, the results might confuse and befuddle you. Not only can this one minor change make the program 10% slower (or possibly even more), it can even make it 10% faster. Yes. Doing pointless extra work can make your the program faster.

If this is the first time you encounter this issue you probably won't believe it. Some fraction might already have gone to Twitter to post their opinion about this "stupid and wrong" article written by someone who is "clearly incompetent and an idiot". That's ok, I forgive you. Human nature and all that. You'll grow out of it eventually. The phenomenon is actually real and can be verified with measurements. How is it possible that having the CPU do extra work could make the program faster?

The answer is that it doesn't. The actual instruction is irrelevant. What actually matters is code alignment. As code shifts around in memory, its performance characteristics change. If a hot loop gets split by a cache boundary it slows down. Unsplitting it speeds it up. The NOP does not need to be inside the loop for this to happen, simply moving the entire code block up or down can cause this difference. Suppose you measure two programs in the most rigorous statistical way possible. If the performance delta between the two is under something like 10%, you can not reasonably say one is faster than the other unless you use a measurement method that eliminates alignment effects.

It's about the machine, not the language

As programs get faster and faster optimisation undergoes an interesting phase transition. Once performance gets to a certain level the system no longer about what the compiler and CPU can do to run the developer's program as fast as possible. Instead it becomes about how the programmer can utilize the CPU's functionality as efficiently as possible. These include things like arranging your data into a layout that the processor can crunch with minimal effort and so on. In effect this means replacing language based primitives with hardware based primitives. In some circles optimization works weirdly backwards in that the programmer knows exactly what SIMD instructions they want a given loop to be optimized into and then fiddles around with the code until it does. At this point the functionality of the programming language itself is immaterial.

This is the main reason why languages like C and Fortran are still at the top of many performance benchmarks, but the techniques are not limited to those languages. Years ago I worked on a fairly large Java application that had been very thoroughly optimized. Its internals consisted of integer arrays. There were no classes or even Integer objects in the fast path, it was basically a recreation of C inside Java. It could have been implemented in pretty much any programming language. The performance differences between them would have mostly come down to each compiler's optimizer. They produce wildly different performance outcomes even when using the same programming language, let alone different ones. Once this happens it's not really reasonable to claim that any one programming language is clearly faster than others since they all get reduced to glorified inline assemblers.

References

Most of the points discussed here has been scraped from other sources and presentations, which the interested reader is encouraged to look up. These include the many optimization talks by Andrei Alexandrescu as well as the highly informational CppCon keynote by Emery Berger. Despite its venue the latter is fully programming language agnostic so you can watch it even if you are the sort of person who breaks out in hives whenever C++ is mentioned.

Monday, February 1, 2021

Using a gamepad to control a painting application

One of the hardest things in drawing and painting is controlling the individual strokes. Not only do you have to control the location but also the pressure, tilt and rotation of the pen or brush. This means mastering five or six degrees of freedom at the same time with extreme precision. Doing it well requires years of practice. Modern painting applications and tools like drawing tablets emulate this experience quite well, but the beauty of computers is that we can do even more.

As an experiment I wrote a test application that separates tilt and pressure from drawing. In this approach one hand draws the shape as before, but the other controls can be controlled with the other hand by using a regular gamepad controller. Here's what it looks like (in case your aggregator strips embedded YouTube players, here is the direct link).

The idea itself is not new, there are discussions about it in e.g. Krita's web forum. Nonetheless it was a fun weekend hack (creating the video actually took longer than writing the app). After playing around with the app for a while this seems like a useful feature for an actual painting application. It is not super ergonomic though, but that may just be an issue with the Logitech gamepad I had. Something like the Wii remote would probably feel smoother, but I don't have one to test.

The code is available here for those who want to try it out.

Wednesday, January 6, 2021

Quick review of Lenovo Yoga 9i laptop

Some time ago I pondered on getting a new laptop. Eventually I bought a Lenovo Yoga 9i, which ticked pretty much all the boxes. I also considered a Dell 9310 but chose against it due to two reasons. Firstly, several reviews say that the keyboard feels bad with too shallow a movement. The second bit being that Dell's web site for Finland does not actually sell computers to individuals, only corporations, and their retailers did not have any of the new models available.

The hardware

It's really nice. Almost everything you need is there, such as USB A and C, touch screen, pen, 16GB of ram, Tiger Lake CPU, Xe graphics and so on. The only real missing things are a microsd card slot and a HDMI port. The trackpad is nice, with multitouch working flawlessly in e.g. Firefox. You can only do right click by clicking on the right edge rather than clicking with two fingers, but that's probably a software limitation (of Windows?). The all glass trackpad surface is a bit of a fingerprint magnet, though.

There are two choices for the screen, either FullHD or 4k. I took the latter because once you have experienced retina, you'll never go back. This reduces battery life, but even the 4k version gets 4-8 hours of battery life, which is more than I need. The screen itself is really, really nice apart from the fact that it is extremely glossy, almost like a mirror. Colors are very vibrant (to the point of being almost too saturated in some videos) and bright. Merely looking at the OS desktop background and app icons feels nice because the image is so sharp and calm. As a negative point just looking at Youtube videos makes the fan spin up. 

The touchscreen and pen work as expected, though pen input is broken in Windows Krita by default. You need to change the input protocol from the default to the other option (whose actual name I don't remember).

When it comes to laptop keyboards, I'm very picky. I really like the 2015-era MBPro and Thinkpad keyboards. This keyboard is not either of those two but it is very good. The key travel is slightly shallower and the resistance is crisper. It feels pleasant to type on.

Linux support

This is ... not good. Fedora live USBs do not even boot, and a Ubuntu 20/10 live USB has a lot of broken stuff, but surprisingly wifi works nicely. Things that are broken include:
  • Touchscreen
  • 3D acceleration (it uses LLVM softpipe instead)
  • Trackpad
  • Pen
The trackpad bug is strange. Clicking works, but motion does not unless you push it at a very, very, very specific amount pressure that is incredibly close to the strength needed to activate the click. Once click activates, motion breaks again. In practice it is unusable.

All of these are probably due to the bleeding-edgeness of the hardware and will probably be fixed in the future. For the time being, though, it is not really usable as a Linux laptop.

In conclusion

This is the best laptop I have ever owned. It may even be the best one I have ever used.

Monday, December 28, 2020

Some things a potential Git replacement probably needs to provide

Recently there has been renewed interest in revision control systems. This is great as improvements to tools are always welcome. Git is, sadly, extremely entrenched and trying to replace will be an uphill battle. This is not due to technical but social issues. What this means is that approaches like "basically Git, but with a mathematically proven model for X" are not going to fly. While having this extra feature is great in theory, in practice is it not sufficient. The sheer amount of work needed to switch a revision control system and the ongoing burden of using a niche, nonstandard system is just too much. People will keep using their existing system.

What would it take, then, to create a system that is compelling enough to make the change? In cases like these you typically need a "big design thing" that makes the new system 10× better in some way and which the old system can not do. Alternatively the new system needs to have many small things that are better but then the total improvement needs to be something like 20× because the human brain perceives things nonlinearly. I have no idea what this "major feature" would be, but below is a list of random things that a potential replacement system should probably handle.

Better server integration

One of Git's design principles was that everyone should have all the history all the time so that every checkout is fully independent. This is a good feature to have and one that should be supported by any replacement system. However it is not revision control systems are commonly used. 99% of the time developers are working on some sort of a centralised server, be it Gitlab, Github or the a corporation's internal revision control server. The user interface should be designed so that this common case is as smooth as possible.

As an example let's look at keeping a feature branch up to date. In Git you have to rebase your branch and then force push it. If your branch had any changes you don't have in your current checkout (because they were done on a different OS, for example), they are now gone. In practice you can't have more than one person working on a feature branch because of this (unless you use merges, which you should not do). This should be more reliable. The system should store, somehow, that a rebase has happened and offer to fix out-of-date checkouts automatically. Once the feature branch gets to trunk, it is ok to throw this information away. But not before that.

Another thing one could do is that repository maintainers could mandate things like "pull requests must not contain merges from trunk to the feature branch" and the system would then automatically prohibit these. Telling people to remove merges from their pull requests and to use rebase instead is something I have to do over and over again. It would be nice to be able to prohibit the creation of said merges rather than manually detecting and fixing things afterwards.

Keep rebasing as a first class feature

One of the reasons Git won was that it embraced rebasing. Competing systems like Bzr and Mercurial did not and advocated merges instead. It turns out that people really want their linear history and that rebasing is a great way to achieve that. It also helps code review as fixes can be done in the original commits rather than new commits afterwards. The counterargument to this is that rebasing loses history. This is true, but on the other hand is also means that your commit history gets littered with messages like "Some more typo fixes #3, lol." In practice people seem to strongly prefer the former to the latter.

Make it scalable

Git does not scale. The fact that Git-LFS exists is proof enough. Git only scales in the original, narrow design spec of "must be scalable for a process that only deals in plain text source files where the main collaboration method is sending patches over email" and even then it does not do it particularly well. If you try to do anything else, Git just falls over. This is one of the main reasons why game developers and the like use other revision control systems. The final art assets for a single level in a modern game can be many, many times bigger than the entire development history of the Linux kernel.

A replacement system should handle huge repos like these effortlessly. By default a checkout should only download those files that are needed, not the entire development history. If you need to do something like bisection, then files missing from your local cache (and only those) should be downloaded transparently during checkout operations. There should be a command to download the entire history, of course, but it should not be done by default.

Further, it should be possible to do only partial checkouts. People working on low level code should be able to get just their bits and not have to download hundreds of gigs of textures and videos they don't need to do their work.

Support file locking

This is the one feature all coders hate: the ability to lock a file in trunk so that no-one else can edit it. It is disruptive, annoying and just plain wrong. It is also necessary. Practice has shown that artists at large either can not or will not use revision control systems. There are many studios where the revision control system for artists is a shared network drive, with file names like character_model_v3_final_realfinal_approved.mdl. It "works for them" and trying to mandate a more process heavy revision control system can easily lead to an open revolt.

Converting these people means providing them with a better work flow. Something like this:
  1. They open their proprietary tool, be it Photoshop, Final Cut Pro or whatever.
  2. Click on GUI item to open a new resource.
  3. A window pops up where they can browse the files directly from the server as if they were local.
  4. They open a file.
  5. They edit it.
  6. They save it. Changes go directly in trunk.
  7. They close the file.
There might be a review step as well, but it should be automatic. Merge requests should be filed and kept up to date without the need to create a branch or to even know that such a thing exists. Anything else will not work. Specifically doing any sort of conflict resolution does not work, even if it were the "right" thing to do. The only way around this (that we know of) is to provide file locking. Obviously this should only be limitable to binary files.

Provide all functionality via a C API

The above means that you need to be able to deeply integrate the revision control system with existing artist tools. This means plugins written in native code using a stable plain C API. The system can still be implemented in whatever SuperDuperLanguage you want, but its one true entry point must be a C API. It should be full-featured enough that the official command line client should be implementable using only functions in the public C API.

Provide transparent Git support

Even if a project would want to move to something else, the sad truth is that for the time being the majority of contributors only know Git. They don't want to learn a whole new tool just to contribute to the project. Thus the server should serve its data in two different formats: once in its native format and once as a regular Git endpoint. Anyone with a Git client should be able to check out the code and not even know that the actual backend is not Git. They should be able to even submit merge requests, though they might need to jump through some minor hoops for that. This allows you to do incremental upgrades, which is the only feasible way to get changes like these done.