Wednesday, January 4, 2017

Clarifications to xz compression article

My previous article on beating the compression performance of xz went a bit viral. Comments on various discussion forums seemed to bring up the same, slightly erroneous, points so here are a few clarifications.

Random access to files is a useless feature

For most use cases this is probably true. But that was not the main point of the experiment. The point was to enable parallel compression without making the final size much bigger. Even more important was to enable parallel decompression. Very few file formats support that without losing compression performance. As a classical example PKZIP supports parallel decompression but its archives are a lot bigger than packed tar files.

Tarballs are just fine and need no changes

Many people were of the opinion that tarballs are "good enough for source distribution". This is probably correct. Unfortunately when people hear the word "tar" they immediately think of source tarballs. That is only a small fraction of tar usage in the world. Many file formats have tar files (or equivalent) hidden inside them. Thus all operations on said files are serial and can use only one CPU at a time.

As an example, the deb file format consists of a bunch of metadata and an embedded tar file. RPM is the same, except that it has a cpio archive rather than tar. This means that installing updates is an inherently serial operation. Update packages can only be installed one at a time, and because they are in a tarlike format, only one CPU can be used for processing them. This is particularly slow on platforms such as the Raspberry Pi that have underpowered multicore CPUs.

When running workloads such as a CI builder, there are multiple steps that are serial when using tar-like file formats. As an example pbuilder has the following steps:
  • unpack base image <- serial
  • install updates <- serial
  • install dependencies <- serial
  • unpack source <- serial
  • compile <- parallel
  • compress final result <- serial
If you are building Libreoffice, Firefox or Chromium, the compilation step dominates. For smaller packages the bottleneck may be elsewhere. As an extreme example, the Debian package build of Meson installs circa 1.5 GB of dependency packages but the actual build consists of a Python distutils invocation that takes all of one second (running the test suite does take a while, though).

This is not suitable for production due to X

It was never meant to. It was merely an experiment to see if a better file format were possible.

Pack/unpack operations are bound by disk, not by CPU

This is true on some platforms but not all of them. LZMA compression is almost always CPU bound even on the fastest CPUs and so is decompression if you have a fast SSD. Disks are getting faster all the time, CPUs are not, so CPUs will become even more of a bottleneck in the future.

What we have is good enough, the benefits are not worth an overhaul and even if they were inertia would prevent uptake

Maybe. But experiments are fun to do regardless.

1 comment:

  1. Some notes about disk performance: Around 5-10 years ago most desktop users bought budget 5400-7200 rpm disks. They had specs like:
    100 MB/s max sustained read performance
    50-100 IOPS

    Now (Samsung 960 EVO ~130 EUR)
    3200 MB/s
    330 000 IOPS

    Disk bandwidth has improved over 3000% and IOPS over 300000%. We should definitely expect more and rapid improvements in the near future (e.g. Optane disks). So, it's pretty obvious that things have become CPU/memory bound again. Many users also encrypt their file system, which affects the disk performance. For instance, my old i7 3770k system (2012) can only encrypt/decrypt up to 2000/2000 MB/s (multi-core parallel AES XTS 256b). If you combine that with transparent disk compression, even the 600 MB/s SATA disks might be fast enough to congest the CPU. LZ4/LZO are pretty much the only available algorithms that seem transparent enough not to slow everything down. For instance, squashfs/xz is clearly slowing down disk performance on most machines. The only place it's improving perf is PXE/NFS boot over 100 Mbps ethernet.

    Haswell and later generations provide somewhat better encryption perf due to new AVX instructions, but overall the situation is quite bad if you don't parallelize. Currently it seems that CPUs are pretty stuck with 1-3% annual perf upgrades with single threaded programs. Most improvements come from larger (turbo) clock frequencies, not from improved IPC. I find it really unlikely that we could even get 200% speedup with the current hardware development style during the next *100* years. Like your graphs show, we could achieve 10000% NOW with better algorithms. Heck, we could even switch back to simpler, in-order cores to make room for a larger core count, yet still achieve better performance than incremental CPU gate-level optimizations in 100 years. This is clearly the way to go.

    Given that most people really do expect constant perf improvements in all technology and are willing to spend lots of money on new, faster hardware with each new year, I don't quite understand the criticism here. Compared to new hardware, software upgrades are 'free'. You just install an update and instantly get better performance. Why waste money on bogus hardware upgrades? Better algorithms are clearly now the lowest hanging fruit with lots of potential. It's actually quite interesting that somebody spends time thinking about parallel compression. I've seen widely used CPU benchmarks where the compression tests typically yield worse results with server class many-core hardware (vs 2-4 turbo core gaming CPUs).