What is the problem?
Decompressing Zip files in parallel is almost trivial. Each file can be decompressed in parallel without affecting any other decompression task. Fire up N processing tasks and decompress files until finished. Compressing Zip files is more difficult to parallelize. Each file can be compressed separately, but the problem comes from writing the output file.
The output file must be written one file at a time. So if one compressed file is being written to the output file then other compression tasks must wait until it finishes before their output can be written to the result file. This data can not be kept in memory because it is common to have output files that are larger than available memory.
The original solution (and thus the design flaw alluded to) was to to have each compressor write its output to a temporary file. The writer would then read the data from the file, write it to the final result file and delete the temporary file.
This works but means that the data gets written to the file system twice. It may also require up to 2× disk space. The worst case happens when you compress only one very big file. On desktop machines this is not such a problem, but on something like a Raspberry Pi the disk is an SD card, which is very slow. You only want to write it once. SD cards also wear out when written to, which is another reason to avoid writes.
The new approach
An optimal solution would have all of these properties:
- Uses all CPU cores 100% of the time (except at the end when there are fewer tasks than cores).
- Writes data to the file system only once.
- Handles files of arbitrary size (much bigger than available RAM).
- Has bounded memory consumption.
The new architecture I came up with looks like this:
Rather than writing its result to a temporary file, each compressor writes it to a byte queue with a fixed maximum size. This was chosen to be either 10 or 100 megabytes, which means that in practice most files will fit the buffer. The queue can be in one of three states: not full, full or finished. The difference between the last two is that a full queue is one where the compression task still has data to compress but it can't proceed until the queue is emptied.
The behaviour is now straightforward. First launch compressor tasks as in decompressing. The file writer part will go through all the queues. If it finds a finished queue it will write it to disk and launch a new task. If it finds a full queue it will do the same, but it must write out the whole stream, meaning it is blocked until the current file has been fully compressed. If the compressions takes too long all other compression tasks will finish (or get full) but new ones can't be launched leading to CPU underutilization.
Is it possible to do better?
Yes, but only as a special case. Btrfs supports copying data from one file to another in O(1) time taking only an extra O(1) space. Thus you could write all data to temp files, copy the data to the final file and delete the temp files.