Wednesday, May 31, 2017

Gee, optimization sure is hard

In a recent Reddit discussion the following piece of code was presented:

for (unsigned c = 0; c < arraySize; ++c) {
    if (data[c] >= 128)
        sum += data[c];
}

This code snippet seems to be optimal but is it? There is a hard to predict branch and a data dependency, both of which can cause slowdowns. To see if this can be implemented faster I created a test repo with a bunch of alternative implementations. Let's see how they stack up.

The implementations

The simplest is the simple loop as described above.

A version using a lookup table has a helper table of 256 entries. Values larger or equal to 128 have identity value and values smaller than 128 have the value zero. The body of the loop then becomes result += lut[buf[i]], which is branch free.

The bit fiddling approach goes through the values one by one and first calculates a mask value which is b >> 7. This is an arithmetic right shift whose outcome is either all zeros or all ones depending whether the value of b is less than 128. Then we add the value to the result by ANDing it with the mask. The core of this loop is result += buf[i] & (((int8_t)buf[i]) >> 7).

The partitioning approach divides the data array with std::partition and then only the part with values we want are added.

The zeroing approach goes over the array and sets all values that are less than 128 to zero. Then it goes over it again and adds all entries unconditionally. This goes over the data twice (and mutates it) but there are no data dependencies.

The bucket approach has an array of 256 entries. The loop goes over the data and increments the count for each input byte like this: ++counts[buf[i]]. The result is then obtained by going over entries 128-255 and evaluating result += i*counts[i].

So which one is the fastest?

Well that depends. A lot. On pretty much everything.

On x86_64 lookup table and bucket are the fastest, but only when using -O2. And GCC. For some reason Clang can't optimize the bucket version and it is consistently slower. On O3 bit fiddling becomes faster.

On a Raspberry Pi and -O2, the fastest are bit fiddling and bucket. But when using -O3, the simple loop is the fastest by a noticeable margin.

There does not seem to be a version that is consistently the fastest. However the versions that mutate their input array (partition and zeroing) are consistently the slowest.

What should I do to get teh fastest codez for my program?

This is a difficult question and depends on many things. The most straightforward solution is to write the simplest code you possibly can. It is easiest to understand and modify and is also the simplest for the compilers to optimize for the largest number of targets.

After that find out if you have a bottleneck. Measure. Measure again. Write an optimized version if you really need it. Measure. Measure with many different compilers and platforms. Keep measuring and fixing issues until the project reaches end of life.

Monday, May 15, 2017

Emulating the Rust borrow checker with C++ part II: the borrowining

The most perceptive among you might have noticed that the last blog post did not actually do any borrowing, only single owner semantics with moves. Let's fix that. But first a caveat.

As far as I can tell, it is not possible to emulate Rust's borrow checker fully in compile time with C++. You can get pretty close (for some features at least) but there is some runtime overhead and violations are detected only at runtime, not compile time. See the end of this post for a description why that is. Still, it's better than nothing.

At the core is a resource whose ownership we want to track. We either want to allow either one accessor that can alter the object or many read-only accessors. We put the item inside a class that looks like this:

template<typename T>
class Owner final {
 private:
    T i;
    int roBorrows = 0;
    bool rwBorrow = false;
<rest of class omitted for brevity>

The holder object does not give out pointers or references to the underlying object. Instead you can only request either a read only or read-write (or mutable) reference to it. The read only function looks like this (the rw version is almost identical):

template<typename T>
RoBorrow<T> Owner<T>::borrowReadOnly() {
    if(rwBorrow) {
        throw std::runtime_error("Tried to create read only borrow when a rw borrow exists.");
    } else {
        return ::RoBorrow<T>(this); // increments borrow count
    }
}

Creating read only borrows only succeeds if there are no rw borrows. This code throws an exception but it could also call std::terminate or something else. A rw borrow would fail in a similar way if there are any ro borrows.

The interesting piece is the implementation of the proxy object RoBorrow<T>. It is the kind of move-only type that was described in part I. When it goes out of scope its destructor decrements the owner's borrow count:

~RoBorrow() { if (o) { o->decrementRoBorrows();} }

The magic lies in the conversion operator that is slightly different than the original version:

operator const T&() const { return owner->i; }

The conversion operator only gives out a const reference to the underlying object, which means only const operations can be invoked on it. Obviously this does not work for e.g. raw file descriptors because a "const int" does not protect the stat of the kernel fd object.  In order to call non-const functions you first need to get an RwBorrow whose conversion operator gives a constless reference, and Owner will only provide one if there are no other borrows outstanding.

When using this mechanism the following code fails (at runtime):

 auto b1 = owner.borrowReadOnly();
 auto b2 = owner.borrowReadWrite();
 
But this code works:

{
    auto b1 = owner.borrowReadOnly();
    auto b2 = owner.borrowReadOnly();
}
{
    auto b3 = owner.borrowReadWrite();
}
{
    auto b4 = owner.borrowReadOnly();
}

because the borrows go out of scope and are destroyed decrementing borrow counts to zero.

Why can't this be done at compile time?

I'm not a TMP specialist so maybe it can but as far as I know it is not possible. I'd love to be proven wrong on this one. The issue is due to limitations of constexpr which can be distilled into this dummy function:

template<typename T>
void constexpr foo(boolean b) {
    T x;
    constexpr int refcount = 1;
    if constexpr(b) {
        // do something
        --refcount;
    } else {
        // do something else
        --refcount;
    }
    static_assert(refcount == 0);
}

The static assert is obviously true no matter which code path is executed but the compiler can't prove that. First of all the if clause does not compile because its argument is not a compile-time constant. Second of all, the refcount decrements do not compile because constexpr variables can not be mutated during evaluation. You can create a new variable with the value x-1 but it would go out of scope when the subblocks end and there is no phi-node -like concept to get the value out to the outer scope. Finally, destructors can not be constexpr so even if we could mutate the count variable, it would not be done automatically (even though in theory the compiler has all the information it needs to do that).

Saturday, May 13, 2017

Emulating the Rust borrow checker with C++ move-only types

Perhaps the greatest thing to come out of C++11 is the notion of move semantics. Originally it was designed to make it efficient to return big objects like matrices from functions. In move operations the destination "steals the guts" of the source object rather than making a copy of it. Usually this means taking hold of some pointer to a reserved memory block and assigning the source's pointer to nullptr.

This mechanism is not reserved to pointers and can be used for any data type. As an example let's write a move-only integer class. A typical use for this would be to store file descriptors to ensure that they are closed. The code is straightforward, let's go through it line by line:

class MoveOnlyInt final {
 private:
    int i;

    void release() { /* call to release function here */ }

This is basic boilerplate. The release function would typically be close, but can be anything.

 public:
    explicit MoveOnlyInt(int i) : i(i) {}
    ~MoveOnlyInt() { release(); }

We can construct this from a regular integer and destroying this object means calling the release function. Simple. Now to the actual meat and bones.

    MoveOnlyInt() = delete;
    MoveOnlyInt(const MoveOnlyInt &) = delete;
    MoveOnlyInt& operator=(const MoveOnlyInt &) = delete;

These declarations specify that this object can not be copied (which is what C++ does by default). Thus creating two objects that hold the same underlying integer is not possible unless you explicitly create two objects with the same bare integer.

    MoveOnlyInt(MoveOnlyInt &&other) { i = other.i; other.i = -1; }
    MoveOnlyInt& operator=(MoveOnlyInt &&other) { release(); i = other.i; other.i = -1; return *this; }

Here we define the move operators. A move means releasing the currently held integer (if it exists) and grabbing the source's integer instead. This is all we need to have a move only type. The last thing to add is a small helper function.

    operator int() const { return i; }
};

This means that this object is convertible to a plain integer. In practice this means that if you have a function like this:

void do_something(int x);

then you can call it like this:

MoveOnlyInt x(42);
do_something(x);

You could achieve the same by creating a get() function that returns the underlying integer but this is nicer and more usable. This is not quite as useful for integers but extremely nice when using plain C libraries with opaque structs. Then your move only wrapper class can be used directly when calling plain C functions.

What does this allow us to do?

All sorts of things, the object behaves much in the same way as Rust's movable types (but is not 100% identical). You can for example return it from a function, which transfers ownership in a compiler enforced way:

MoveOnlyInt returnObject() {
    MoveOnlyInt retval(42);
    return retval;
}

If you try to pass an object as an argument like this:

int byValue(MoveOnlyInt mo);
...
byValue(mo);

a regular class would get copied but for a move-only type you get a compiler error:

./prog.cpp:39:13: error: call to deleted constructor of 'MoveOnlyInt'
    byValue(mo);
            ^~
../prog.cpp:13:5: note: 'MoveOnlyInt' has been explicitly marked deleted here
    MoveOnlyInt(const MoveOnlyInt &) = delete;
    ^
../prog.cpp:22:28: note: passing argument to parameter here
int byValue(MoveOnlyInt mo) {

Instead you have explicitly tell the compiler to move the object:

byValue(std::move(mo));

Some of you might have spotted a potential issue. Since a MoveOnly object can be converted to an int and there is a constructor that takes an int, that could create two objects for the same underlying integer. Like this:

MoveOnlyInt mo(42);
MoveOnlyInt other(mo);

The compiler output looks like the following:

../prog.cpp:37:14: error: call to deleted constructor of 'MoveOnlyInt'
    MoveOnlyInt other(mo);
             ^     ~~
../prog.cpp:13:5: note: 'MoveOnlyInt' has been explicitly marked deleted here
    MoveOnlyInt(const MoveOnlyInt &) = delete;

The compiler prevents creating invalid objects.

The wrapper object has zero memory overhead compared to a plain int and code generation changes are minimal. The interested reader is encouraged to play around with the compiler explorer to learn what those are.

Is this just as good as native Rust then?

No. Rust's borrow checker does more and is stricter. For example in C++ you can use a moved-from object, which may yield a null pointer dereference if your underlying type was a pointer. You won't get a use-after-free error, though. On the other hand this class is 18 lines of code that can be applied to any existing C++ code base immediately whereas Rust is a whole new programming language, framework and ecosystem.

Anyone is also free to do the wrong thing and take a copy of the integer value without telling anyone but this issue remains in every other language as well, even in unsafe Rust.


Tuesday, May 9, 2017

A list of common Meson antipatterns

In the vein of a blog post I wrote years ago, here are some Meson antipatterns that should be avoided in build definitions.

Adding -g, -Wall etc manually

The basic design principle of Meson is that common operations and features should work out of the box without any user intervention. That means that debug information should be enabled for debug builds (which are the default). Similarly basic warning flags should always be on, because anyone coding without -Wall is not doing software development, but practicing astrology.

Many people seem to be burned by existing build systems and add these flags manually as the first thing they do. There is no need, we do it for you. Similarly instead of manually adding flags such as -Wextra, look into changing the option warning_level instead.

Adding these flags manually is also bad for portability because there are compilers that do not support them.

Building full paths to files in the source tree

This pattern looks something like this:

myfiles = files('@0@/@1@'.format(meson.current_source_dir(), 'file.c', [other files here])

This makes sense if you are used to the Make's "everything is a string" semantics. Meson is not like that, it tries to take care of mundane things for you. The simpler version of the above is this:

myfiles = files('file.c', [other files here])

This will verify the existance of the file when called and will store the full location. The output can be used in any other subdirectory in any target and Meson will evaluate the correct path for you. There may be some corners where file objects are not accepted where they should. Please file bugs if you encounter one of these.

The former syntax works currently but is already an error in current Git master.

Using target_machine in cross compilation

Meson supports the full canadian cross environment. It is very powerful but unfortunately quite confusing until you wrap your brain around it. Many people will choose to use target_machine instead of host_machine for their cross compilation, even though that is incorrect and leads to interesting bugs. Fortunately there is a simple rule of thumb which is correct 99% of the time:
  • build_machine is the desktop or laptop computer you are using to do your development
  • host_machine is the low powered IoT device, ARM board or equivalent that will run your program
Unless you really know what you are doing and are building one of the very few packages that need to care (Binutils, GCC, Clang, some others), ignore target_machine. Pretend it does not even exist and everything will become simpler and more correct.

Manually setting up dependencies between targets

This happens often when defining custom targets or run targets that operate on the output of other custom targets. In Meson outputs carry dependency information with them. If you use the output of one target as an input to another, Meson will automatically set up all the dependencies. A consistent antipattern that can be seen in projects in the wild is going through strange contortions to build a string representing the path of a target output and then adding the dependency object to the target's extra dependency keyword argument. That is only meant for programs that do not take all their inputs as arguments because, for example, they do file globbing internally.

Like with files there may be cases where using the output object directly fails. These are all bugs, please report them.

Monday, May 1, 2017

The Infinity Horse

The early 1900s saw the popularisation of the automobile. Everyone was jumping on the proverbial bandwagon of this new transport technology. It was cool and exciting so people wanted to be part of it, even if they might not have had full technical understanding of the domain.

One of those movers was a company in the business of raising, selling, renting and feeding horses. The managers understood that horses were a dead-end business and that the future was in motorized vehicles. Thus they started putting a lot of money and effort into creating their own car.

Engineers were hired, meetings were had and factories were being prepped for production. After a few months of work the main obstacles were overcome and the plan to build and launch a car brand were looking good. That is when the CEO scheduled a surprise meeting with the chief engineer of the project.

"We had a top management meetin last week and what we came up with was spectacular! We are going to completely revolutionize the way people think about cars!"

"Really," said the surprised engineer.

"Yes! What is the most annoying thing about cars?" he asked and without giving the engineer a chance to reply continued "Running out of fuel. That never happens with horses, so this is really a step backwards."

The engineer felt like pointing out that there is such a thing as a horse that has run out of fuel. It's called a dead horse and the difference between that and a car was that adding fuel to a dead horse does not make it come back alive. But he chose not to mention this issue, but instead wanted to hear this new master plan.

"So what we are going to do - and this is going to be a world first - is to create a car that never runs out of fuel. We shall call it The Infinity Horse. Marketing department is already creating a campaign for this. I have seen some of the sketches and they are just out of this world!"

"Errrm, never runs out of fuel?" the engineer said with a puzzled look on his face.

"Never!"

"That's not really possible, how are you going to do that?"

"That's your job isn't it? Just make it happen, it can't be that hard."

A very unpleasant two weeks followed. The engineer did everything in his power to try to explain that the aim was impossible, which did not work, and trying to suggest alternative solutions, which did not work either. Nothing short of perfection was acceptable. As an example making the gas tank bigger was not an option because while it made the problem occur less often it was not enough.

"The Vision is The Infinity Horse. Unless we have this feature we have nothing," the CEO repeated every time issues were raised.

This drove our engineer to despair. He was being crushed between the unbreakable laws of physics and the immovable opinion of the CEO. Grasping at straws he started going through automotive parts catalogues. There he saw a new product: a warning light that activated when the car was running out of fuel. And then it came to him.

He took the warning light and modified it slightly. Instead of turning on a light it shut down the engine. This would fulfill the requirement. The car would never run out of fuel, because it could be easily proven that there was still gas in the tank. Said fuel could not be used for driving but it was there. The only thing this modification did was to reduce the maximum range making the car worse by every conceivable objective metric.

The CEO loved it. The Infinity Horse had been achieved and it was perfect! If this system caused people to get stranded, well, that's their own fault for not doing due diligence and adding enough fuel to the tank before embarking on their journey.