Wednesday, May 31, 2017

Gee, optimization sure is hard

In a recent Reddit discussion the following piece of code was presented:

for (unsigned c = 0; c < arraySize; ++c) {
    if (data[c] >= 128)
        sum += data[c];
}

This code snippet seems to be optimal but is it? There is a hard to predict branch and a data dependency, both of which can cause slowdowns. To see if this can be implemented faster I created a test repo with a bunch of alternative implementations. Let's see how they stack up.

The implementations

The simplest is the simple loop as described above.

A version using a lookup table has a helper table of 256 entries. Values larger or equal to 128 have identity value and values smaller than 128 have the value zero. The body of the loop then becomes result += lut[buf[i]], which is branch free.

The bit fiddling approach goes through the values one by one and first calculates a mask value which is b >> 7. This is an arithmetic right shift whose outcome is either all zeros or all ones depending whether the value of b is less than 128. Then we add the value to the result by ANDing it with the mask. The core of this loop is result += buf[i] & (((int8_t)buf[i]) >> 7).

The partitioning approach divides the data array with std::partition and then only the part with values we want are added.

The zeroing approach goes over the array and sets all values that are less than 128 to zero. Then it goes over it again and adds all entries unconditionally. This goes over the data twice (and mutates it) but there are no data dependencies.

The bucket approach has an array of 256 entries. The loop goes over the data and increments the count for each input byte like this: ++counts[buf[i]]. The result is then obtained by going over entries 128-255 and evaluating result += i*counts[i].

So which one is the fastest?

Well that depends. A lot. On pretty much everything.

On x86_64 lookup table and bucket are the fastest, but only when using -O2. And GCC. For some reason Clang can't optimize the bucket version and it is consistently slower. On O3 bit fiddling becomes faster.

On a Raspberry Pi and -O2, the fastest are bit fiddling and bucket. But when using -O3, the simple loop is the fastest by a noticeable margin.

There does not seem to be a version that is consistently the fastest. However the versions that mutate their input array (partition and zeroing) are consistently the slowest.

What should I do to get teh fastest codez for my program?

This is a difficult question and depends on many things. The most straightforward solution is to write the simplest code you possibly can. It is easiest to understand and modify and is also the simplest for the compilers to optimize for the largest number of targets.

After that find out if you have a bottleneck. Measure. Measure again. Write an optimized version if you really need it. Measure. Measure with many different compilers and platforms. Keep measuring and fixing issues until the project reaches end of life.

2 comments:

  1. It depends on the data too.. If you had only 255 in data for example, branch prediction would make the simple loop very fast.

    ReplyDelete
  2. Simplicity is rarely - if ever - a bottleneck. Reliability and such things are initially more important, and if parts of your program aren't performing well, maybe you did miss a more obvious gotcha. Before attempting to hand-optimize my code I'd rather reconsider the workload, because oftentimes you're just missing the trees for the forest and do far more work at once than is actually necessary.

    ReplyDelete