Nibble Stew: Aiming for C++ sorting speed with a plain C API

Saturday, November 11, 2017

Aiming for C++ sorting speed with a plain C API

A well known performance measurement result is that C++ standard library's std::sort function is a lot faster than C library's equivalent qsort. Most people, when they first hear of this, very strongly claim that this is not possible, C is just as fast (if not faster) than C++, this is a measurement error, the sorting algorithms used are different and so on. Then they run the experiment themselves and find that C++ is indeed faster.

The reason for this has nothing to do with how the sorting function is implemented but everything to do with the API. The C API for sorting, as described in the man pages looks like this:

void qsort(void *base, size_t nmemb, size_t size,
int (*compar)(const void *, const void *));

The interesting point here is the last argument, which is a function pointer to a comparison function. Because of this, the sort implementation can not inline this function call but must instead always call the comparator function, which translates to an indirect jump.

In C++ the sort function is a template. Because of this the comparison function can be inlined in the implementation. This turns out to have a massive performance difference (details below). The only way to emulate this in plain C would be to ship the sort function as a preprocessor monster thingy that people could then use in their own code. This leads to awful and hard to maintain code, so this is usually not done. It would be nice to be able to provide a similar fast sort performance with a stable plain C API but due to the way shared libraries work, it's just not possible.

So let's do it by cheating.

If we know that a compiler is available during program execution, we can implement a hybrid solution that achieves this. Basically we emulate how JIT compilers work. All we need is something like this:

sorter opt_func = build_sorter("int",
"(const int a, const int b) { return a < b; }");
(*opt_func)(num_array, NUM_INTS);

Here build_sorter is a function that takes as arguments the type of the item being sorted, and a sorting function as source code. Then the function calls into the external compiler to create an optimised sorting function and returns that via a function pointer that can be called to do the actual sort.

Full source code is available in this Github repo. Performance measurements for 100 million integers are as follows.

C++ is almost twice as fast as plain C. On the fly code generation is only 0.3 seconds slower than C++, which is the amount of time it takes to compile the optimised version. Codegen uses the C++ sorting function internally, so this result is expected.

Thus we find that it is possible to provide C++ level performance with a plain C API, but it requires the ability to generate code at runtime.

Extra bonus

During testing it was discovered that for whatever reason the C++ compiler (I used GCC) is not able to inline free functions as well as lambdas. That is, this declaration:

std::sort(begin, end, sorting_function);

generates slower code than this one:

std::sort(begin, end, [](sorting_lambda_here));

even though the contents of both comparison functions is exactly the same (basically return a < b). This is the reason the source code for the sort function above is missing the function preamble.

3 comments:

Mar CanetNovember 11, 2017 at 10:31 PM
Do there is some comparison like this comparing same sort to python, javascript performance?
ReplyDelete
Replies
Alexander E. PatrakovNovember 13, 2017 at 9:43 AM
I think that the sentence "The only way to emulate this in plain C would be to ship the sort function as a preprocessor monster thingy that people could then use in their own code" is wrong. You don't need a preprocessor macro to inline the comparator. Defining the sort() function as a static inline in some header would be enough. GCC will then inline the sort() function, and then see that it doesn't really need to call the comparator through a pointer, because the value of the pointer is known. And then another round of inilining would take place and inline the comparator.
ReplyDelete
Replies