Thursday, June 8, 2017

Optimizing code: even the simplest things are unbelievably complex

In the previous post we looked at optimizing this simple function.

uint64_t result = 0;
for(size_t i=0; i<bufsize; i++) {
  if(buf[i] >= 128) {
    result += buf[i];
  }
}

We shall now do more measurements with real world hardware and compilers. The algorithms we use are the following:

  • simple: the above loop as-is
  • lookup: create a lookup table where entries less than 128 have value zero and the rest have the same value as the index
  • bit fiddling: convert the if into a branchless bitmask operation
  • partition: run std::partition on the data and only add the first half
  • zeroing: go over the data and set values not matching to zero, then add all
  • bucket: keep an array of 255 entries and count the number of times each number appears
  • multiply: convert if to a multiplication by 0 or 1, then add
  • parallel add: add several chars in a single packed 64 bit addition
Those interested in the actual implementations should look it up in the repo.

The hardware used is the following:

  • Raspberry Pi, Raspbian, GCC 4.9.2, Clang 3.5.0
  • Ubuntu zesty, GCC 6.3.0, Clang 4.0.0
  • Macbook Pro i5, XCode 8
  • Windows 10, Visual Studio 2017, run in VirtualBox
The test suite runs all available compilers with a selection of optimization types, CPU features (SSE, AVX, Neon etc) and measures the times taken.

The results


Let's start by looking at the simplest build setup.

This seems quite reasonable. Parallel addition is the fastest, others are roughly as fast and the two algoritms that reorder the input array are the slowest. For comparison Raspberry Pi looks like this:
Everything is much flatter as one would expect. Since everything is going smoothly, let's look at the first measurement again, except this time we sort the input data before evaluating. One would expect that the simple loop becomes faster because the branch predictor has an easier task, partitioning becomes faster and nothing becomes noticeably slower.
Well ... ummm ... one out of three ain't bad, I guess. At this point I should probably confess that I don't have a proper grasp on why these things are happening. Any speculation to follow might be completely wrong. The reason bucket slows down is the easier of these two to explain. Since the input is sorted, consecutive iterations of the loop attempt to write to the same memory address, which leads to contention. When the data was random, each iteration wrote to a random location which leads to fewer collisions.

The reason why the simple loop does not get faster may be caused by the processor evaluating both branches of the if clause in any case and thus having better branch prediction does not matter. On the other hand Visual Studio does this:

Bucket is slower for sorted as above, but the simple loop is an order of magnitude slower on unsorted data. Ideas on what could be the cause of that are welcome.

What is the fastest combination?

The fastest combination for each hardware platform is the following.
  • Ubuntu: bit fiddle, g++, release build, -msse, unsorted
  • Raspi: bit fiddle, g++, release build, -mfpu=neon, sorted
  • OSX: simple loop, Clang++, debugoptimized build, -msse4.2, sorted
  • VS2017: lut, debugoptimized build, unsorted
This is basically random. There does not seem to be any one algorithm that is consistently the fastest, every one of them is noticeably slower than others under some circumstances. Even weirder, things that you would expect to be straightforward and true are not. Here are some things to scratch your head over:
  • AVX instructions are never the fastest, and on an i7 the fastest is plain old SSE (for the record MMX was not tested)
  • With Clang, enabling Neon instructions makes everything a lot slower
  • On the Raspberry Pi doing a read only table lookup using Neon is slower than with regular instructions
  • On an i7 multiplication is sometimes faster than arithmetic shifting

6 comments:

  1. Another possible implementation - bucketing the sum:

    uint64_t result[2];
    result[0] = result[1] = 0;
    for(size_t i=0; i<bufsize; i++) {
    result[buf[i]/128]++;
    }
    return result[1];

    ReplyDelete
    Replies
    1. That counts how many items have value over 127. It should calculate the sum of those elements. It should probably be something like this:

      result[buf[i]/128] += buf[i];

      Delete
  2. debugoptimized build <-- Why not /O2 ?

    ReplyDelete
    Replies
    1. That's what it does. See https://github.com/mesonbuild/meson/blob/master/mesonbuild/compilers.py#L114

      Delete
  3. I'm looking at parallel_add_lookup. You take a pointer to uint8_t, cast it to pointer to uint64_t and then dereference that pointer. Isn't that undefined behavior?

    See https://blog.regehr.org/archives/1307, in particular under Chunking Optimizations Are Broken.

    ReplyDelete
    Replies
    1. It could be. I don't know enough about the specifications to say anything conclusive. That is why we test that the output is correct.

      The blog post says that char types (which uint8_t may or may not be) are allowed to be cast to other types. The char pointer _may_ have come from an uint64_t so converting it back may be legal. But, as said, I really don't know for sure.

      Delete