Nibble Stew: June 2019

A few years ago I wrote a command line program to compress and decompress Zip files in parallel. It turned out to work pretty well, but it had one design flaw that kept annoying me.

What is the problem?

Decompressing Zip files in parallel is almost trivial. Each file can be decompressed in parallel without affecting any other decompression task. Fire up N processing tasks and decompress files until finished. Compressing Zip files is more difficult to parallelize. Each file can be compressed separately, but the problem comes from writing the output file.

The output file must be written one file at a time. So if one compressed file is being written to the output file then other compression tasks must wait until it finishes before their output can be written to the result file. This data can not be kept in memory because it is common to have output files that are larger than available memory.

The original solution (and thus the design flaw alluded to) was to to have each compressor write its output to a temporary file. The writer would then read the data from the file, write it to the final result file and delete the temporary file.

This works but means that the data gets written to the file system twice. It may also require up to 2× disk space. The worst case happens when you compress only one very big file. On desktop machines this is not such a problem, but on something like a Raspberry Pi the disk is an SD card, which is very slow. You only want to write it once. SD cards also wear out when written to, which is another reason to avoid writes.

The new approach

An optimal solution would have all of these properties:

Uses all CPU cores 100% of the time (except at the end when there are fewer tasks than cores).
Writes data to the file system only once.
Handles files of arbitrary size (much bigger than available RAM).
Has bounded memory consumption.

It turns out that not all of these are achievable at the same time. Or at least I could not come up with a way. After watching some Numberphile videos I felt like writing a formal proof but quickly gave up on the idea. Roughly speaking since you can't reliably estimate when the tasks finish and how large the resulting files will be, it does not seem possible to choose an optimal strategy for writing the results out to disk.

The new architecture I came up with looks like this:

Rather than writing its result to a temporary file, each compressor writes it to a byte queue with a fixed maximum size. This was chosen to be either 10 or 100 megabytes, which means that in practice most files will fit the buffer. The queue can be in one of three states: not full, full or finished. The difference between the last two is that a full queue is one where the compression task still has data to compress but it can't proceed until the queue is emptied.

The behaviour is now straightforward. First launch compressor tasks as in decompressing. The file writer part will go through all the queues. If it finds a finished queue it will write it to disk and launch a new task. If it finds a full queue it will do the same, but it must write out the whole stream, meaning it is blocked until the current file has been fully compressed. If the compressions takes too long all other compression tasks will finish (or get full) but new ones can't be launched leading to CPU underutilization.

Is it possible to do better?

Yes, but only as a special case. Btrfs supports copying data from one file to another in O(1) time taking only an extra O(1) space. Thus you could write all data to temp files, copy the data to the final file and delete the temp files.

The crowdfunding campaign to create a full manual for the Meson build system ended yesterday. It did not reach its 10 000€ goal so the book will not be produced and instead all contributed money will be returned. I'd like to thank everyone who participated. A special thanks goes out to Centricular for their bronze corporate sponsorship (which, interestingly, was almost 50% of the total money raised).

Nevertheless the fact remains that this project was a failure and a fairly major one at that since it did not reach even one third of its target. This can not be helped, but maybe we can salvage some pieces of useful information from the ruins.

Some statistics

There were a total of 42 contributors to the campaign. Indiegogo says that a total of 596 people visited the project when it was live. Thus roughly 7% of all people who came to the site participated. It is harder to know how many people saw information about the campaign without coming to the site. Estimating based on numbers based on the blog's readership, Twitter reach and other sources puts the number at around 5000 globally (with a fairly large margin of error). This would indicate a conversion rate of 1% of all the people who saw any information about the campaign. In reality the percentage is lower since many of the contributors were people who did not really need convincing. Thus the conversion rate is probably closer to 0.5% or even lower.

The project was set up so that 300 contributors would have been enough to make the project a success. Given the number of people using Meson (estimated to be in the tens of thousands) this seemed like a reasonable goal. Turns out that it wasn't. Given these conversion numbers you'd need to reach 30 000 – 60 000 people in order to succeed. For a small project with zero advertising budget this seems like a hard thing to achieve.

On the Internet everything drowns

Twitter, LinkedIn, Facebook and the like are not very good channels for spreading information. They are firehoses where any one post has an active time of maybe one second if you are lucky. And if you are not, the platforms' algorithms will hide your post because they deem it "uninteresting". Sadly filtering seems to be mandatory, because not having it makes the firehose even more unreadable. The only hope you have is that someone popular writes about your project. In practice this can only be achieved via personal connections.

Reddit-like aggregation sites are not much better, because you have basically two choices: either post on a popular subreddit or an unpopulare one. In the first case your post probably won't even make it on the front page, all it takes is a few downvotes because the post is "not interesting" or "does not belong here". A post that is not on the front page might not as well even exist; no-one will read it. Posting on an non-popular area is no better. Your post is there but it will reach 10 people and out of those maybe 1 will click on the link.

New sites are great for getting the information out, but they suffer from the same popularity problem as everything else. A distilled (and only slightly snarky) explanation is that news sites write mainly about two things:

Things they have already written about (i.e. have deemed popular)
Things other news sites write about (i.e. that other people have deemed popular)

This is not really the fault of news sites. They are doing their best on a very difficult job. This is just how the world and popularity work. Things that are popular get more popular because of their current popularity alone. Things that are not popular are unlikely to ever become popular because of their current unpopularity alone.

Unexpected requirements

One of the goals of this campaign (or experiment, really) was to see if selling manuals would be a sustainable way to compensate FOSS project developers and maintainers for their work. If working this would be a good way for compensation, because there are already established legal practices for selling books across the world. Transferring money in other ways (donations etc) is difficult and there may be legal obstacles.

Based on this one experiment this does not seem to be a feasible approach. Interestingly multiple people let me know that they would not be participating because the end result would not be released under a free license. Presumably the same people do not complain to book store tellers that "I will only buy this Harry Potter book if, immediately after my purchase, the full book is released for free on the Internet". But for some reason there is a hidden assumption that because a person has released something under a free license, they must publish everything else they do under free licenses as well.

These additional limitations make this model of charging for docs really hard to pull off. There is no possibility of steady, long term money flow because once a book is out under a free license it becomes unsellable. People will just download the free PDF instead. A completely different model (or implementation of the attempted model) seems to be needed.

So what happens next?

I don't really know. Maybe the book can get published through an actual publisher. Maybe not. Maybe I'll just take a break from the whole thing and get back to it later. But to end on some kind of a positive note I have extracted one chapter from the book and have posted it here in PDF form for everyone to read. Enjoy.

Nibble Stew

Saturday, June 8, 2019

Tweaking the parallel Zip file writer