Monday, March 23, 2026

Everything old is new again: memory optimization

At this point in history, AI sociopaths have purchased all the world's RAM in order to run their copyright infringement factories at full blast. Thus the amount of memory in consumer computers and phones seems to be going down. After decades of not having to care about memory usage, reducing it has very much become a thing.

Relevant questions to this state of things include a) is it really worth it and b) what sort of improvements are even possible. The answers to these depend on the task and data set at hand. Let's examine one such case. It might be a bit contrived, unrepresentative and unfair, but on the other hand it's the one I already had available.

Suppose you have to write script that opens a text file, parses it as UTF-8, splits it into words according to white space, counts the number of time each word appears and prints the words and counts in decreasing order (most common first).

The Python baseline

This sounds like a job for Python. Indeed, an implementation takes fewer than 30 lines of code. Its memory consumption on a small text file [update: repo's readme, which is 1.3k] looks like this.

Peak memory consumption is 1.3 MB. At this point you might want to stop reading and make a guess on how much memory a native code version of the same functionality would use.

The native version

A fully native C++ version using Pystd requires 60 lines of code to implement the same thing. If you ignore the boilerplate, the core functionality fits in 20 lines. The steps needed are straightforward:

  1. Mmap the input file to memory.
  2. Validate that it is utf-8
  3. Convert raw data into a utf-8 view
  4. Split the view into words lazily
  5. Compute the result into a hash table whose keys are string views, not strings

The main advantage of this is that there are no string objects. The only dynamic memory allocations are for the hash table and the final vector used for sorting and printing. All text operations use string views , which are basically just a pointer + size.

In code this looks like the following:

Its memory usage looks like this.

Peak consumption is ~100 kB in this implementation. It uses only 7.7% of the amount of memory required by the Python version.

Isn't this a bit unfair towards Python?

In a way it is. The Python runtime has a hefty startup cost but in return you get a lot of functionality for free. But if you don't need said functionality, things start looking very different.

But we can make this comparison even more unfair towards Python. If you look at the memory consumption graph you'll quite easily see that 70 kB is used by the C++ runtime. It reserves a bunch of memory up front so that it can do stack unwinding and exception handling even when the process is out of memory. It should be possible to build this code without exception support in which case the total memory usage would be a mere 21 kB. Such version would yield a 98.4% reduction in memory usage.

Thursday, March 19, 2026

Simple sort implementations vs production quality ones

One of the most optimized algorithms in any standard library is sorting. It is used everywhere so it must be fast. Thousands upon thousands of developer hours have been sunk into inventing new algorithms and making sort implementations faster. Pystd has a different design philosophy where fast compilation times and readability of the implementation have higher priority than absolute performance. Perf still very much matters, it has to be fast, but not at the cost of 10x compilation time.

This leads to the natural question of how much slower such an implementation would be compared to a production quality one. Could it even be faster? (Spoilers: no) The only way to find out is to run performance benchmarks on actual code.

To keep things simple there is only one test set, sorting 10'000'000 consecutive 64 bit integers that have been shuffled to a random order which is the same for all algorithms. This is not an exhaustive test by any means but you have to start somewhere. All tests used GCC 15.2 using -O2 optimization. Pystd code was not thoroughly hand optimized, I only fixed (some of the) obvious hotspots.

Stable sort

Pystd uses mergesort for stable sorting. The way the C++ standard specifies stable sort means that most implementations probably use it as well. I did not dive in the code to find out. Pystd's merge sort implementation consists of ~220 lines of code. It can be read on this page.

Stdlibc++ can do the sort in 0.9 seconds whereas Pystd takes .94 seconds. Getting to within 5% with such a simple implementation is actually quite astonishing. Even when considering all the usual caveats where it might completely fall over with a different input data distribution and all that.

Regular sort

Both stdlibc++ and Pystd use introsort. Pystd's implementation has ~150 lines of code but it also uses heapsort, which has a further 100 lines of code). Code for introsort is here, and heapsort is here.

Stdlibc++ gets the sort done in 0.76 seconds whereas Pystd takes 0.82 seconds. This makes it approximately 8% slower. It's not great, but getting within 10% with a few evening's work is still a pretty good result. Especially since, and I'm speculating here, std::sort has seen a lot more optimization work than std::stable_sort because it is used more.

For heavy duty number crunching this would be way too slow. But for moderate data set sizes the performance difference might be insignificant for many use cases.

Note that all of these are faster (note: did not measure) than libc's qsort because it requires an indirect function call on every comparison i.e. the comparison method can not be inlined.

Compiling the stdlib version takes 0.68 seconds while the Pystd version takes 0.33 seconds. Without optimizations the times are 0.55 and 0.21 seconds, respectively.

Where does the time go?

Valgrind will tell you that quite easily.

This picture shows quite clearly why big O notation can be misleading. Both quicksort (the inner loop of introsort) and heapsort have "the same" average time complexity but every call to heapsort takes approximately 4.5 times as long.