Tuesday, July 7, 2020

What if? Revision control systems did not have merge

A fun design exercise is to take an established system or process and introduce some major change into it, such as adding a completely new constraint. Then take this new state of things, run with it and see what happens. In this case let's see how one might design a revision control system where merging is prohibited. Or, formulated in a slightly different way:
What if merging is to revision control systems as multiple inheritance is to software design?

What is merging used for?

First we need to understand what merging is used for so that wa can develop some sort of a system that achieves the same results via some other mechanism. There are many reasons to use merges, but the most popular ones include the following.

An isolated workspace for big changes

Most changes are simple and consists of only one commit. Sometimes, however, it is necessary to make big changes with intermediate steps, such as doing major refactoring operations. These are almost always done in a branch and then brought in to trunk. This is especially convenient if multiple people work on the change.

Trunk commits are always clean

Bringing big changes in via merges means that trunk is always clean and buildable. More importantly bisection works reliably since all commits in trunk are known good. This is typically enforced via a gating CI. This allows big changes to have intermediate steps that are useful but broken in some way so they would not pass CI. This is not common, but happens often enough to be useful.

An alternative to merging is squashing the branch into a single commit. This is suboptimal as it destroys information breaking for example git blame -kind of functionality as all changes made point to a single commt made by a single person (or possibly a bot).

Fix tracking

There are several systems that do automatic tracking of bug fixes to releases. The way this is done is that a fix is written in its own branch. The bug tracking system can then easily see when the fix gets to the various release branches by seeing when the bugfix branch has been merged to them.

A more linear design

In practice many (possibly even most) projects already behave like this. They keep their histories linear by rebasing, squashing and cherry picking, never merging. This works but has the downsides mentioned above. If one spends some time thinking about this problem the fundamental disconnect comes fairly clear. A "linear" revision control system has only one type of a change which is the commit whereas "real world" problems have two different types: logical changes and individual commits that make up the logical change. This structure is implicit in the graph of merge-based systems, but what if we made it explicit? Thus if we have a commit graph that looks like this:



the linear version could look like this:


The two commits from the right branch have become one logical commit in the flat version. If the revision control system has a native understanding of these kinds of physical and logical commits all the problematic cases listed could be made to work transparently. For example bisection would work by treating all logical commits as only one change. Only after it has proven that the error occurred inside a single logical commit would bisection look inside it.

This, by itself, does not fix bug tracing. As there are no merges you can't know which branches have which fixes. This can be solved by giving each change (both physical and logical) a logical ID which remains the same over rebase and edit operations as opposed to the checksum-based commit ID which changes every time the commit is edited. This changes the tracking question from "which release branches have merged this feature fix branch" to "which release branches have a commit with this given logical ID" which is a fairly simple problem to solve.

This approach is not new. LibreOffice has tooling on top of Git that does roughly the same thing as discussed here. It is implemented as freeform text in commit messages with all the advantages and disadvantages that brings.

One obvious question that comes up is could you have logical commits inside logical commits. This seems like an obvious can of worms. On one hand it would be mathematically symmetrical and all that but on the other hand it has the potential to devolve into full Inception, which you usually want to avoid. You'd probably want to start by prohibiting that and potentially permitting it later once you have more usage experience and user feedback.

Could this actually work?

Maybe. But the real question is probably "could a system like this replace Git" because that is what people are using. This is trickier. A key question would whether you can automatically convert existing Git repos to the new format with no or minimal loss of history. Simple merges could maybe be converted in this way but in practice things are a lot more difficult due to things like octopus merges. If the conversion can not be done, then the expected market share is roughly 0%.

No comments:

Post a Comment