Test setup
We generate code that calls a few other functions that call other functions and so on until we reach the bottommost function. In that function we either raise an exception or error or return success. This gives us two parameters to vary: the depth of the call hierarchy and what percentage of calls produce an error.
The code itself is simple. The C++ version looks like this:
int func0() {
int num = random() % 5;
if(num == 0) {
return func1();
}
if(num == 1) {
return func2();
}
if(num == 2) {
return func3();
}
if(num == 3) {
return func4();
}
return func5();
}
We generate a random number and based on that call a function deeper in the hierarchy. This means that every time we call the test it goes from the top function to the final function in a random way. We use the C random number generator and seed it with a known value so both C and C++ have the same call chain path, even though they are different binaries. Note that this function calls on average a function whose number is its own number plus three. So if we produce 100 functions, the call stack is on average 100/3 = 33 entries deep when the error is generated.
The plain C version that uses error objects is almost identical:
int func0(struct Error **error) {
int num = random() % 5;
int res;
if(num == 0) {
res = func1(error);
} else if(num == 1) {
res = func2(error);
} else if(num == 2) {
res = func3(error);
} else if(num == 3) {
res = func4(error);
} else {
res = func5(error);
}
/* This is a no-op in this specific code but is here to simulate
* real code that would check the error and based on that
* do something. We must have a branch on the error condition,
* because that is what real world code would have as well.
*/
if(*error) {
return -1;
}
return res;
}
The code for generating this code and running the measurements can be found in this Github repo.
Results
We do not care so much about how long the executables take to run, only which one of them runs faster. Thus we generate output results that look like this (Ubuntu 16/10, GCC 6.2, Intel i7-3770):
CCeCCCCCCC
CCCCCCccec
eEcCCCeECE
cceEEeECCe
EEEcEEEeCE
EEEEEeccEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE
In this diagram C means that plain C error codes are faster and E that exceptions are faster for that particular pair of parameters. For easier reading all cases where error codes win have been given a red background color. Each row represents a different number of functions. The top row has 50 functions (call depth of 16) and each row adds 50 so the bottommost row has 500 functions.
The columns represent different error rates. The leftmost column has 0% errors, the next one has 1% and so on. A capital letter means that the runtime difference was bigger than 10%. Each executable was executed five times and the fastest run was taken in each case. Full results look like the following:
GCC 6.2 Clang 3.8.1 Clang 4.0 trunk
CCeCCCCCCC EecCcCcCcC ECcECcEeCC
CCCCCCccec CceEEeEcEC EEeEECEcCC
eEcCCCeECE EEEEEECEeE EEEECECeCC
cceEEeECCe EcEeEECEEE EEEEceeEeC
EEEcEEEeCE EEEeEEEEEe EEEEEEEEEE
EEEEEeccEE EEEEEEEEEE EEEEEEEEEE
EEEEEEEEEE EEEEEEEEEE EEEEEEEEEE
EEEEEEEEEE EEEEEEEEEE EEEEEEEEEE
EEEEEEEEEE EEEEEEEEEE EEEEEEEEEE
EEEEEEEEEE EEEEEEEEEE EEEEEEEEEE
Immediately we see that once the stack depth grows above a certain size (here 200/3 = 66), exceptions are always faster. This is not very interesting, because call stacks are usually not this deep (enterprise Java notwithstanding). For lower depths there is a lot of noise, especially for GCC. However we can see that for small error rates exceptions are sometimes faster, especially for Clang. Because of the noise we redid the measurements several times. This changed the individual measurements but produced the same overall shape where small errors seem to be faster with exceptions but as the error rate grows error codes become faster.
On a Raspberry Pi the results look like this:
GCC 4.9.2
eeccCCCCCC
EeeeccccCC
Eeeeeecccc
Eeeeeeeccc
EEeeeeeeec
EEEEEeeeee
EEEEEEEEee
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE
Here the results are a lot clearer and consistent, probably due to the simpler architecture of the ARM processor. Exceptions are faster except for small call depths and large error rates. When using 50 functions error objects only become noticeable faster at 4% error rate.
Finally let's do the same measurements on an i5-6600K.
GCC 4.8.4 gcc 5.4.0 Clang 3.{4, 6, 8}
ECCCCCCCCCC eeccCCCCCC EeeccCCCCC
EecCCCCCCCC EEEeeeeccc EEEEEeeeee
EEecCCCCCCC EEEEEEeeee EEEEEEEEEE
EEEeecCCCCC EEEEEEEEEe EEEEEEEEEE
EEEEeecccCC EEEEEEEEEE EEEEEEEEEE
EEEEEeeeccc EEEEEEEEEE EEEEEEEEEE
EEEEEEeeeec EEEEEEEEEE EEEEEEEEEE
EEEEEEEEeee EEEEEEEEEE EEEEEEEEEE
EEEEEEEEEee EEEEEEEEEE EEEEEEEEEE
EEEEEEEEEEe EEEEEEEEEE EEEEEEEEEE
Here we see that the compiler used can make a massive difference. GCC 4.8.4 favors error codes but 5.4.0 is the opposite as are all versions of Clang (which produce identical results). On this platform exceptions seem to be even more performant than on the Pi.
The top left corner seems to be the most interesting part of this entire diagram, as it represents shallow call chains with few errors. This is similar to most real world code used in, for example, XML or JSON parsing. Let's do some more measurements focusing on this area.
i5 Raspberry Pi
GCC Clang GCC Clang
5.4.0 3.8 4.9.2 3.5.0
cCCCC eCCCC eCCCC eCCCC
ecCCC EeCCC ecCCC ecCCC
ecCCC EeccC eccCC eccCC
eccCC EEecc eecCC eecCC
eeccC EEEee eeccC eeccC
In these measurements the parameters go from 10 functions at the top row to 50 in the bottom row in increments of 10 and the colums go from 0% error on the left to 4% on the right in increments of 1. Here we see that exceptions keep their performance advantage when there are no errors, especially when using Clang, but error codes get better as soon as the error rate goes up even a little.
Conclusions
Whenever a discussion on C++ exceptions occurs, there is That Guy who comes in and says "C++ exceptions are slow, don't use them". At this point the discussion usually breaks down on fighting about whether exceptions are fast or not. These discussion are ultimately pointless because asking whether exceptions are "fast" is the wrong question.
The proper question to ask is "under which circumstances are exceptions faster". As we have seen here, the answer is surprisingly complex. It depends on many factors including which platform, compiler, code size and error rate is used. Blanket statements about performance of X as compared to Y are usually simplifying, misleading and just plain wrong. Such is the case here as well.
(Incidentally if someone can explain why i7 has those weird performance fluctuations, please post in the comments.)
Thanks to Juhani Simola for running the i5 measurements.
One possible suspect for varying performance is Intel's Turbo Boost, which varies the clock speed based on load over several cores. The i5 I used for tests doesn't drop clock speed from maximum boost even under mprime torture test.
ReplyDeleteFor completeness' sake, I also ran the tests on a Macbook with i5-3230M (same generation as Jussi's i7 but a bit lower rung, questionable cooling) and something that is called "gcc" but has version string Apple LLVM version 8.0.0 (clang-800.0.42.1).
ReplyDeleteAnd sure, these are the patterns that showed up:
EEeEeCCCCc
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE
cCCCC
EccCC
ceccC
EccCc
eEEeE
Even weirder, Intel power gadget shows some kind of inverse turbo boost, the core is stable at 3GHz when build is running and drops somewhere between 2.8 and 2.9 for the single core benchmark.
DeleteDid you build with -fno-exceptions and -fno-rtti?
ReplyDeleteAlso, is there a difference when using a switch rather than the if's? IIRC, switch has some optimisations applied but they're likely invoked by the repeated if's you're using here too.
Finally, you should probably disable hyper-threading and turbo boost in the BIOS to get consistent results.
The C++ benchmark tests exceptions, so disabling them would be a bit counterproductive. C benchmark is compiled with C compiler, which has no exceptions or RTTI by definition.
DeleteClang produces the exact same machine code for equivalent code with equivalent switch, so I would not except any performance differences.
I've made a version of this that passes int pointers for errors instead of a struct, which is idiomatic of C code.
ReplyDeleteThe code can be found here: https://github.com/julian-goldsmith/excspeed , and I've also made a pull request on the Git repo.