uint64_t result = 0;
for(size_t i=0; i<bufsize; i++) {
if(buf[i] >= 128) {
result += buf[i];
}
}
We shall now do more measurements with real world hardware and compilers. The algorithms we use are the following:
- simple: the above loop as-is
- lookup: create a lookup table where entries less than 128 have value zero and the rest have the same value as the index
- bit fiddling: convert the if into a branchless bitmask operation
- partition: run std::partition on the data and only add the first half
- zeroing: go over the data and set values not matching to zero, then add all
- bucket: keep an array of 255 entries and count the number of times each number appears
- multiply: convert if to a multiplication by 0 or 1, then add
- parallel add: add several chars in a single packed 64 bit addition
The hardware used is the following:
- Raspberry Pi, Raspbian, GCC 4.9.2, Clang 3.5.0
- Ubuntu zesty, GCC 6.3.0, Clang 4.0.0
- Macbook Pro i5, XCode 8
- Windows 10, Visual Studio 2017, run in VirtualBox
The results
Let's start by looking at the simplest build setup.
This seems quite reasonable. Parallel addition is the fastest, others are roughly as fast and the two algoritms that reorder the input array are the slowest. For comparison Raspberry Pi looks like this:
Well ... ummm ... one out of three ain't bad, I guess. At this point I should probably confess that I don't have a proper grasp on why these things are happening. Any speculation to follow might be completely wrong. The reason bucket slows down is the easier of these two to explain. Since the input is sorted, consecutive iterations of the loop attempt to write to the same memory address, which leads to contention. When the data was random, each iteration wrote to a random location which leads to fewer collisions.
The reason why the simple loop does not get faster may be caused by the processor evaluating both branches of the if clause in any case and thus having better branch prediction does not matter. On the other hand Visual Studio does this:
Bucket is slower for sorted as above, but the simple loop is an order of magnitude slower on unsorted data. Ideas on what could be the cause of that are welcome.
What is the fastest combination?
The fastest combination for each hardware platform is the following.
- Ubuntu: bit fiddle, g++, release build, -msse, unsorted
- Raspi: bit fiddle, g++, release build, -mfpu=neon, sorted
- OSX: simple loop, Clang++, debugoptimized build, -msse4.2, sorted
- VS2017: lut, debugoptimized build, unsorted
- AVX instructions are never the fastest, and on an i7 the fastest is plain old SSE (for the record MMX was not tested)
- With Clang, enabling Neon instructions makes everything a lot slower
- On the Raspberry Pi doing a read only table lookup using Neon is slower than with regular instructions
- On an i7 multiplication is sometimes faster than arithmetic shifting
Another possible implementation - bucketing the sum:
ReplyDeleteuint64_t result[2];
result[0] = result[1] = 0;
for(size_t i=0; i<bufsize; i++) {
result[buf[i]/128]++;
}
return result[1];
That counts how many items have value over 127. It should calculate the sum of those elements. It should probably be something like this:
Deleteresult[buf[i]/128] += buf[i];
debugoptimized build <-- Why not /O2 ?
ReplyDeleteThat's what it does. See https://github.com/mesonbuild/meson/blob/master/mesonbuild/compilers.py#L114
DeleteI'm looking at parallel_add_lookup. You take a pointer to uint8_t, cast it to pointer to uint64_t and then dereference that pointer. Isn't that undefined behavior?
ReplyDeleteSee https://blog.regehr.org/archives/1307, in particular under Chunking Optimizations Are Broken.
It could be. I don't know enough about the specifications to say anything conclusive. That is why we test that the output is correct.
DeleteThe blog post says that char types (which uint8_t may or may not be) are allowed to be cast to other types. The char pointer _may_ have come from an uint64_t so converting it back may be legal. But, as said, I really don't know for sure.