Most CI systems I have seen have been stateless. That is, they start by getting a fresh Docker container (or building one from scratch), doing a Git checkout, building the thing and then throwing everything away. This is simple and matematically pure, but really slow. This approach is further driven by the way how in cloud computing CPU time and network transfers are cheap but storage is expensive (or at least it is possible to get almost infinite CI build time for open source projects but not persistent storage). Probably because the cloud vendor needs to take care of things like backups, they can't dispatch the task on any machine on the planet but instead on the one that already has the required state and so on.
How much could you reduce resource usage (or, if you prefer, improve CI build speed) by giving up on statelessness? Let's find out by running some tests. To get a reasonably large code base I used LLVM. I did not actually use any cloud or Docker in the tests, but I simulated them on a local media PC. I used 16 cores to compile and 4 to link (any more would saturate the disk). Tests were not run.
Baseline
Creating a Docker container with all the build deps takes a few minutes. Alternatively you can prebuild it, but then you need to download a 1 GB image.
Doing a full Git checkout would be wasteful. There are basically three different ways of doing a partial checkout: shallow clone, blobless and treeless. They take the following amount of time and space:
- shallow: 1m, 259 MB
- blobless: 2m 20s, 961 MB
- treeless: 1m 46s, 473 MB
With CCache
Using CCache in Docker is mostly a question of bind mounting a persistent directory in the container's cache directory. A from-scratch build with an up to date CCache takes 9m 30s.
With stashed Git repo
Just like the CCache dir, the Git checkout can also be persisted outside the container. Doing a git pull on an existing full checkout takes only a few seconds. You can even mount the repo dir read only to ensure that no state leaks from one build invocation to another.
With Danger Zone
One main thing a CI build ensures is that the code keeps on building when compiled from scratch. It is quite possible to have a bug in your build setup that manifests itself so that the build succeeds if a build directory has already been set up, but fails if you try to set it up from scratch. This was especially common back in ye olden times when people used to both write Makefiles by hand and to think that doing so was a good idea.
Nowadays build systems are much more reliable and this is not such a common issue (though it can definitely still occur). So what if you would be willing to give up full from-scratch checks on merge requests? You could, for example, still have a daily build that validates that use case. For some organizations this would not be acceptable, but for others it might be reasonable tradeoff. After all, why should a CI build take noticeably longer than an incremental build on the developer's own machine. If anything it should be faster, since servers are a lot beefier than developer laptops. So let's try it.
The implementation for this is the same as for CCache, you just persist the build directory as well. To run the build you do a Git update, mount the repo, build dir and optionally CCache dirs to the container and go.
I tested this by doing a git pull on the repo and then doing a rebuild. There were a couple of new commits, so this should be representative of the real world workloads. An incremental build took 8m 30s whereas a from scratch rebuild using a fully up to date cache took 10m 30s.
Conclusions
The amount of wall clock time used for the three main approaches were:
- Fully stateless
- Image building: 2m
- Git checkout: 1m
- Build: 42m
- Total: 45m
- Cached from-scratch
- Image building: 0m (assuming it is not "apt-get update"d for every build)
- Git checkout: 0m
- Build: 10m 30s
- Total: 10m 30s
- Fully cached
- Image building: 0m
- Git checkout: 0m
- Build: 8m 30s
- Total: 8m 30s
- Fully stateless
- Image: 1G
- Checkout: 300 MB
- Cached from-scratch:
- Image: 0
- Checkout: O(changes since last pull), typically a few kB
- Fully cached
- Image: 0
- Checkout: O(changes since last pull)
The final 2 minute improvement might not seem like that much, but on the other hand do you really want your developers to spend 2 minutes twiddling their thumbs for every merge request they create or update? I sure don't. Waiting for CI to finish is one of the most annoying things in software development.
 
Interesting article, thanks for this.
ReplyDeleteAnother sad part about all this is that in particular open source projects that don’t pay directly for CI time are very wasteful with it.
Oftentimes CI is split into multiple jobs, which as you mentioned have a very high per-job overhead.
I would also be interested how something like sccache would perform, which I guess aims for a middle ground between local disk and remote caching.
SCCache would work, but it still needs transferring a lot of data. In the LLVM case a build generates ~ 4GB of object files in the cache.
Delete