The reason for this has nothing to do with how the sorting function is implemented but everything to do with the API. The C API for sorting, as described in the man pages looks like this:
void qsort(void *base, size_t nmemb, size_t size,
int (*compar)(const void *, const void *));
The interesting point here is the last argument, which is a function pointer to a comparison function. Because of this, the sort implementation can not inline this function call but must instead always call the comparator function, which translates to an indirect jump.
In C++ the sort function is a template. Because of this the comparison function can be inlined in the implementation. This turns out to have a massive performance difference (details below). The only way to emulate this in plain C would be to ship the sort function as a preprocessor monster thingy that people could then use in their own code. This leads to awful and hard to maintain code, so this is usually not done. It would be nice to be able to provide a similar fast sort performance with a stable plain C API but due to the way shared libraries work, it's just not possible.
So let's do it by cheating.
If we know that a compiler is available during program execution, we can implement a hybrid solution that achieves this. Basically we emulate how JIT compilers work. All we need is something like this:
sorter opt_func = build_sorter("int",
"(const int a, const int b) { return a < b; }");
(*opt_func)(num_array, NUM_INTS);
Here build_sorter is a function that takes as arguments the type of the item being sorted, and a sorting function as source code. Then the function calls into the external compiler to create an optimised sorting function and returns that via a function pointer that can be called to do the actual sort.
Full source code is available in this Github repo. Performance measurements for 100 million integers are as follows.
C++ is almost twice as fast as plain C. On the fly code generation is only 0.3 seconds slower than C++, which is the amount of time it takes to compile the optimised version. Codegen uses the C++ sorting function internally, so this result is expected.
Thus we find that it is possible to provide C++ level performance with a plain C API, but it requires the ability to generate code at runtime.
Extra bonus
During testing it was discovered that for whatever reason the C++ compiler (I used GCC) is not able to inline free functions as well as lambdas. That is, this declaration:
std::sort(begin, end, sorting_function);
generates slower code than this one:
std::sort(begin, end, [](sorting_lambda_here));
even though the contents of both comparison functions is exactly the same (basically return a < b). This is the reason the source code for the sort function above is missing the function preamble.