Compression makes this even worse. Accessing byte n requires unpacking every byte from the beginning of the file. This is unfortunate in itself but it has an even bigger downside. Compression and decompression are inherently serial operations. They can not be run in parallel. This was not really an issue in the seventies, but nowadays even a Raspberry Pi has four cores. Using only one seems wasteful.
There are some attempts to work around this, such as pigz, but they just chop the full file into constant sized blocks and compress them in parallel. Even in this case parallel decompression is not possible.
Why is tar used then?
In addition to inertia, tar has one major feature on its side: it compresses really well. Modern compressors like lzma (and even zlib) like having a lot of data to achieve high compression ratios. Tar clumps everything into one file, which is good for compressors. Let's examine how much. For testing we took Linux kernel 4.9 source tree. We first recompressed it with xz using the same encoder settings as for the other systems. The unpacked source is 762MB and the compressed size is 92 megabytes, which is our baseline.
The obvious comparison to tar + xz is a zip file using LZMA compression. There is an implementation called Parzip that supports parallel LZMA compression out of the box. When run it produces a zip file that is 164 megabytes in size. It is easy to see why tar + xz is so popular. The main point of compression is to save space and having a file that is almost twice the size is unacceptable.
One inefficiency of pkzip is that the metadata format is both messy and not compressed. Creating a new file format that stores the index as a single compressed lump at the end of the file is straightforward. The code for this (and the rest of the examples listed here) can be found in the jpak Github repo. Doing this creates a file that is 158 megabytes in size. This is better, but the compression ratio is still unusably bad.
If we wish to preserve perfect O(1) random file access to the compressed data this is probably the best we can do. But if loosen the requirements a bit we can get further. An archive basically consists of a sequence of files of different size. If we start from the top and pick files until their total uncompressed size reaches some predetermined limit, concatenate them into a single clump and compress that we can give the compressor large swatches of data at a time and in addition can compress the different clumps in parallel. To access a random file we need to decompress only the clump it is in, not the entire file from the beginning. This makes the operation take O(clumpsize) time.
If we do this using a clump size of 1 MB (uncompressed) the result is 102 MB file. This is a big improvement but still lags behind tar + xz. Increasing the clump size to 10MB produces a 93 MB file. Upping the size to 100 MB (which means that the file will consist of 8 clumps) produces a 91 MB file. This is smaller than what xz produces.
Wait, what? How can it be smaller than tar+xz?
The key here lies in the tar file format. It interleaves file data and metadata. Compressors do not really like this, because these two items usually have different statistics. Putting data that is similar next to each other improves compression ratios. Jpak stores the file metadata in a structure-of-arrays formation. That is, each metadata is a structure with properties A, B, C and so on. Jpak first writes the property A of all entries followed by all properties B and so on. This layout compresses better than tar's intermixed layout. Or that's the working theory at the moment, there has not been a thorough analysis.
It should be noted that the actual compression method is the same in every case: LZMA as provided by liblzma. The only difference is in how the data is laid out in the file. Changing the layout allows you to do the compression and decompression in parallel, get a random access index to your data and achieve as good or even better compression ratios with the downside being slightly slower access to individual files.
The small print
The current implementation does not actually do the operations in parallel (though Parzip does). The code has not been tested very thoroughly, so it might fail in many interesting ways. Do not use it to store any data you care about. The current implementation stores user and group metadata only as numbers rather than in strings, so it is not exactly equivalent to tar (though user/group strings are usually the same for most files so they compress really well). Other metadata bits may be missing as well.



