Comments on Nibble Stew: Beating the compression performance of xz, or has the time come to dump tar?

One of the two hardest things in CS is naming thin...

2020-08-30T03:51:38.520+03:00

One of the two hardest things in CS is naming things. “Tarball” is a better name than, say, “Parzip file” or (cringe) “Parzipball.” QED.

Generally speaking most modern compression algorit...

2019-04-08T09:38:07.606+03:00

Generally speaking most modern compression algorithms give roughly the same compression, and with regard to the number of cores that you can use at once, it is up to you to decide how many you want to use. However, 7-zip is free and open source. The 7z format supports encryption with the AES algorithm with a 256-bit key. If the zip file exceeds that size, 7-zip will split it into multiple files automatically, such as integration_serviceLog.zip.001, integration_serviceLog.zip.002, etc. (Way back when, PK Zip used this to span zip files across multiple floppy disks.) You'll need all the files to be present to unzip them. The 7z format provides the option to encrypt the filenames of a 7z archive.

There is some criticism of the xz format and utili...

2017-01-03T18:30:37.082+02:00

There is some criticism of the xz format and utility for longevity and correct behaviors:

http://www.nongnu.org/lzip/xz_inadequate.html

Perhaps the associated libraries offered by the author of lzip might be of some use to you.

jpa stores those just like tar, as does zip and a ...

2017-01-03T11:21:19.560+02:00

jpa stores those just like tar, as does zip and a bunch of other archive formats.

About .tar.*, we still need it to keep files attri...

2017-01-03T08:20:58.178+02:00

About .tar.*, we still need it to keep files attributes like permissions.

This allows you to write the file in a single writ...

2017-01-03T00:10:19.460+02:00

This allows you to write the file in a single write-only pass. If the index were at the beginning you'd need to seek back and forth (and you'd need to reserve space for it before writing the compressed files etc).

After a few seconds of reflection, I may have answ...

2017-01-03T00:10:02.144+02:00

After a few seconds of reflection, I may have answered my own question - is it so that you can efficiently add new files to an archive? So scrub the metadata from the end of the file, add a new file, and then write some revised metadata after the new file?

This is probably a really stupid question (I'm...

2017-01-03T00:06:00.977+02:00

This is probably a really stupid question (I'm not very experience with compression or archival formats), but why do you store the metadata at the end of the file rather than at the beginning?

I did not have any needs as such. This was just an...

2017-01-02T23:10:26.226+02:00

I did not have any needs as such. This was just an experiment on various aspects of compression. Metadata alignment did not come from squashfs. I already had a zip compressor and was trying to see if reordering data would make it compress better. It did.

Let's go back a bit. tar was used for tape ba...

2017-01-02T23:05:26.937+02:00

Let's go back a bit.

tar was used for tape backups (Tape ARchive), and you were streaming those onto the data tape. It was a simple format back then, namely because compression wasn't a consideration and everything was still small.

Later on in the 80's, you needed compression because you had phone line modems. Thus we got ARC and LZH and eventually ZIP and the Deflate algorithm (a la gzip).

Of course, you had tar.z, tar.Z, and eventually tar.gz. As new algorithms came around (BWT/bzip2, LZMA/xz) you got tar being compressed by that. Good for distribution source archives, which is most of the case.

But I doubt you want that. You have a need to update archives w/o rewriting the entire archive. ZIP is no good because it compresses per file, and TAR/XZ's strength/weakness is compressing over the entire archive.

Thus the compromise of "sub-archives" that are compressed over... which is similar to what Microsoft's CAB format does.

Hi Jussi, have you looked at my project pixz? http...

2017-01-02T20:20:50.785+02:00

Hi Jussi, have you looked at my project pixz? https://github.com/vasi/pixz

* pixz does xz compression in parallel, like a bunch of other projects do.
* pixz also *decompresses* in parallel, which I believe no other tool supports.
* pixz supports random access inside tarballs, by maintaining an index of where each file lives. Yet it's fully backwards-compatible with other xz and tarball tools.
* pixz uses fixed-size blocks of data (similar to JPAK), so it retains a good compression ratio even as it allows random access.
* pixz still has metadata interleaved with file data. Putting all the metadata together in JPAK is a good idea—was it inspired by squashfs?
* pixz supports streaming operation for basic compression/decompression, like tradition Unix tools. I think JPAK does not.
* pixz is already available in repositories for major distributions (Debian and derivatives, Fedora, OpenSuSE, MacPorts, homebrew).

I hope pixz does enough of what you need!

The man page of xz has two parts you really should...

2017-01-02T19:40:27.540+02:00

The man page of xz has two parts you really should read: "--block-size" and "--threads". xz uses LZMA2, which is capable of multi-threading.

For truly random access on compressed files, look at squashfs. It also supports using LZMA via a parameter. I get a 120MB squashfs for Linux 4.9.

For the data layout part you are right, tar metadata layout is not optimal for compression. Your solution can also be massively improved! Look at standard design techniques for compression. For example, if you only have one uid and gid, you don't need to store it per file at all. If you have two uids, you need only one bit per file to store it. Encode every value as a difference to a similar (or at least the previous) value. If you want to be serious about this, please do some research.

Much of your code seems to be boilerplate (e.g. file.cpp). Consider using libraries. This may be a pain in C++, but nobody said you must use C++. I use Rust language for my new stuff. It can provide C-compatible bindings, has no garbage collector but has a very nice dependency management system (Cargo), which easily allows you to depend on small libraries ("crates") like "byteorder" which allows for endianness-aware reading and writing of numbers, but does not include a kitchen sink.

Thank you for your positive, encouraging and highl...

2017-01-02T17:52:16.239+02:00

Thank you for your positive, encouraging and highly informative comment. You have made the Internet proud.

Yes, just like zip.

2017-01-02T17:44:23.334+02:00

Yes, just like zip.

your testing methodology is so broken that ends be...

2017-01-02T17:04:34.808+02:00

your testing methodology is so broken that ends being funny.

instead of considering yourself a genius and doing everything by yourself, consult those that are specialists in this area

go to "http://encode.ru/forums/2-Data-Compression" forum

The only thing I'd like to have is a SINGLE to...

2017-01-02T16:46:28.256+02:00

The only thing I'd like to have is a SINGLE tool that compresses. As it is now, many tools on Windows (and also on Linux) first decompress the xz/gz/whatever into a tar, and then I need to open/decompress the tar too.

Can this next-gen compressor do both tasks at once?

The point is not what compression algorithm to use...

2017-01-02T02:15:04.729+02:00

The point is not what compression algorithm to use. The point was not maximal compression, but rather to demonstrate the overhead that comes from data layout. I just used LZMA because it was simple, easily available and in common use. A comparison of tar.br to jpak + br or tar.lzham to jpak + lzham would probably yield similar results (I have not done them so I can't say for sure).

And just using br or lzham would still not permit random access or parallel decompression.

I agree that tar is sub-optimal, but if your goal ...

2017-01-02T02:01:55.964+02:00

I agree that tar is sub-optimal, but if your goal is to beat tar.xz you can get a much more significant benefit from replacing xz. Brotli and LZHAM both provide comparable ratios and compression speed, but with much faster decompression.