Monday, January 2, 2017

Beating the compression performance of xz, or has the time come to dump tar?

Tar is one of the basic tools of Unix. It has been in use, basically unchanged, since 1979. It does one thing and it does it quite well. Basically it is a stream format for describing files. It consists of consecutive file metadata and content pairs. This is a simple and reliable format but it has a big downside: individual entries can not be accessed without processing the entire file from the beginning. Wikipedia has further details.

Compression makes this even worse. Accessing byte n requires unpacking every byte from the beginning of the file. This is unfortunate in itself but it has an even bigger downside. Compression and decompression are inherently serial operations. They can not be run in parallel. This was not really an issue in the seventies, but nowadays even a Raspberry Pi has four cores. Using only one seems wasteful.

There are some attempts to work around this, such as pigz, but they just chop the full file into constant sized blocks and compress them in parallel. Even in this case parallel decompression is not possible.

Why is tar used then?

In addition to inertia, tar has one major feature on its side: it compresses really well. Modern compressors like lzma (and even zlib) like having a lot of data to achieve high compression ratios. Tar clumps everything into one file, which is good for compressors. Let's examine how much. For testing we took Linux kernel 4.9 source tree. We first recompressed it with xz using the same encoder settings as for the other systems. The unpacked source is 762MB and the compressed size is 92 megabytes, which is our baseline.

The obvious comparison to tar + xz is a zip file using LZMA compression. There is an implementation called Parzip that supports parallel LZMA compression out of the box. When run it produces a zip file that is 164 megabytes in size. It is easy to see why tar + xz is so popular. The main point of compression is to save space and having a file that is almost twice the size is unacceptable.

One inefficiency of pkzip is that the metadata format is both messy and not compressed. Creating a new file format that stores the index as a single compressed lump at the end of the file is straightforward. The code for this (and the rest of the examples listed here) can be found in the jpak Github repo. Doing this creates a file that is 158 megabytes in size. This is better, but the compression ratio is still unusably bad.

If we wish to preserve perfect O(1) random file access to the compressed data this is probably the best we can do. But if loosen the requirements a bit we can get further. An archive basically consists of a sequence of files of different size. If we start from the top and pick files until their total uncompressed size reaches some predetermined limit, concatenate them into a single clump and compress that we can give the compressor large swatches of data at a time and in addition can compress the different clumps in parallel. To access a random file we need to decompress only the clump it is in, not the entire file from the beginning. This makes the operation take O(clumpsize) time.

If we do this using a clump size of 1 MB (uncompressed) the result is 102 MB file. This is a big improvement but still lags behind tar + xz. Increasing the clump size to 10MB produces a 93 MB file. Upping the size to 100 MB (which means that the file will consist of 8 clumps) produces a 91 MB file. This is smaller than what xz produces.


Wait, what? How can it be smaller than tar+xz?

The key here lies in the tar file format. It interleaves file data and metadata. Compressors do not really like this, because these two items usually have different statistics. Putting data that is similar next to each other improves compression ratios. Jpak stores the file metadata in a structure-of-arrays formation. That is, each metadata is a structure with properties A, B, C and so on. Jpak first writes the property A of all entries followed by all properties B and so on. This layout compresses better than tar's intermixed layout. Or that's the working theory at the moment, there has not been a thorough analysis.

It should be noted that the actual compression method is the same in every case: LZMA as provided by liblzma. The only difference is in how the data is laid out in the file. Changing the layout allows you to do the compression and decompression in parallel, get a random access index to your data and achieve as good or even better compression ratios with the downside being slightly slower access to individual files.

The small print

The current implementation does not actually do the operations in parallel (though Parzip does). The code has not been tested very thoroughly, so it might fail in many interesting ways. Do not use it to store any data you care about. The current implementation stores user and group metadata only as numbers rather than in strings, so it is not exactly equivalent to tar (though user/group strings are usually the same for most files so they compress really well). Other metadata bits may be missing as well.

18 comments:

  1. I agree that tar is sub-optimal, but if your goal is to beat tar.xz you can get a much more significant benefit from replacing xz. Brotli and LZHAM both provide comparable ratios and compression speed, but with much faster decompression.

    ReplyDelete
  2. The point is not what compression algorithm to use. The point was not maximal compression, but rather to demonstrate the overhead that comes from data layout. I just used LZMA because it was simple, easily available and in common use. A comparison of tar.br to jpak + br or tar.lzham to jpak + lzham would probably yield similar results (I have not done them so I can't say for sure).

    And just using br or lzham would still not permit random access or parallel decompression.

    ReplyDelete
  3. The only thing I'd like to have is a SINGLE tool that compresses. As it is now, many tools on Windows (and also on Linux) first decompress the xz/gz/whatever into a tar, and then I need to open/decompress the tar too.

    Can this next-gen compressor do both tasks at once?

    ReplyDelete
  4. your testing methodology is so broken that ends being funny.

    instead of considering yourself a genius and doing everything by yourself, consult those that are specialists in this area

    go to "http://encode.ru/forums/2-Data-Compression" forum

    ReplyDelete
    Replies
    1. Thank you for your positive, encouraging and highly informative comment. You have made the Internet proud.

      Delete
  5. The man page of xz has two parts you really should read: "--block-size" and "--threads". xz uses LZMA2, which is capable of multi-threading.

    For truly random access on compressed files, look at squashfs. It also supports using LZMA via a parameter. I get a 120MB squashfs for Linux 4.9.

    For the data layout part you are right, tar metadata layout is not optimal for compression. Your solution can also be massively improved! Look at standard design techniques for compression. For example, if you only have one uid and gid, you don't need to store it per file at all. If you have two uids, you need only one bit per file to store it. Encode every value as a difference to a similar (or at least the previous) value. If you want to be serious about this, please do some research.

    Much of your code seems to be boilerplate (e.g. file.cpp). Consider using libraries. This may be a pain in C++, but nobody said you must use C++. I use Rust language for my new stuff. It can provide C-compatible bindings, has no garbage collector but has a very nice dependency management system (Cargo), which easily allows you to depend on small libraries ("crates") like "byteorder" which allows for endianness-aware reading and writing of numbers, but does not include a kitchen sink.

    ReplyDelete
  6. Hi Jussi, have you looked at my project pixz? https://github.com/vasi/pixz

    * pixz does xz compression in parallel, like a bunch of other projects do.
    * pixz also *decompresses* in parallel, which I believe no other tool supports.
    * pixz supports random access inside tarballs, by maintaining an index of where each file lives. Yet it's fully backwards-compatible with other xz and tarball tools.
    * pixz uses fixed-size blocks of data (similar to JPAK), so it retains a good compression ratio even as it allows random access.
    * pixz still has metadata interleaved with file data. Putting all the metadata together in JPAK is a good idea—was it inspired by squashfs?
    * pixz supports streaming operation for basic compression/decompression, like tradition Unix tools. I think JPAK does not.
    * pixz is already available in repositories for major distributions (Debian and derivatives, Fedora, OpenSuSE, MacPorts, homebrew).

    I hope pixz does enough of what you need!

    ReplyDelete
    Replies
    1. I did not have any needs as such. This was just an experiment on various aspects of compression. Metadata alignment did not come from squashfs. I already had a zip compressor and was trying to see if reordering data would make it compress better. It did.

      Delete
  7. Let's go back a bit.

    tar was used for tape backups (Tape ARchive), and you were streaming those onto the data tape. It was a simple format back then, namely because compression wasn't a consideration and everything was still small.

    Later on in the 80's, you needed compression because you had phone line modems. Thus we got ARC and LZH and eventually ZIP and the Deflate algorithm (a la gzip).

    Of course, you had tar.z, tar.Z, and eventually tar.gz. As new algorithms came around (BWT/bzip2, LZMA/xz) you got tar being compressed by that. Good for distribution source archives, which is most of the case.

    But I doubt you want that. You have a need to update archives w/o rewriting the entire archive. ZIP is no good because it compresses per file, and TAR/XZ's strength/weakness is compressing over the entire archive.

    Thus the compromise of "sub-archives" that are compressed over... which is similar to what Microsoft's CAB format does.

    ReplyDelete
  8. This is probably a really stupid question (I'm not very experience with compression or archival formats), but why do you store the metadata at the end of the file rather than at the beginning?

    ReplyDelete
    Replies
    1. After a few seconds of reflection, I may have answered my own question - is it so that you can efficiently add new files to an archive? So scrub the metadata from the end of the file, add a new file, and then write some revised metadata after the new file?

      Delete
    2. This allows you to write the file in a single write-only pass. If the index were at the beginning you'd need to seek back and forth (and you'd need to reserve space for it before writing the compressed files etc).

      Delete
  9. About .tar.*, we still need it to keep files attributes like permissions.

    ReplyDelete
    Replies
    1. jpa stores those just like tar, as does zip and a bunch of other archive formats.

      Delete
  10. There is some criticism of the xz format and utility for longevity and correct behaviors:

    http://www.nongnu.org/lzip/xz_inadequate.html

    Perhaps the associated libraries offered by the author of lzip might be of some use to you.

    ReplyDelete
  11. Generally speaking most modern compression algorithms give roughly the same compression, and with regard to the number of cores that you can use at once, it is up to you to decide how many you want to use. However, 7-zip is free and open source. The 7z format supports encryption with the AES algorithm with a 256-bit key. If the zip file exceeds that size, 7-zip will split it into multiple files automatically, such as integration_serviceLog.zip.001, integration_serviceLog.zip.002, etc. (Way back when, PK Zip used this to span zip files across multiple floppy disks.) You'll need all the files to be present to unzip them. The 7z format provides the option to encrypt the filenames of a 7z archive.

    ReplyDelete
  12. One of the two hardest things in CS is naming things. “Tarball” is a better name than, say, “Parzip file” or (cringe) “Parzipball.” QED.

    ReplyDelete