If you enjoy this article, I have published a book explaining the internals of Git through implementation: Building Git.
—
Previously in this series, we explored the Myers diff algorithm, we learned a variation of it that uses linear space, then we saw how to use a diff algorithm to build three-way merge, and most recently we examined why merges fail and what can be done about it. In particular, we looked at an example of how our choice of diff algorithm can lead to merge conflicts, including conflicts that can mislead the reader about the changes they’re merging.
We have been using the following source code as our example:
void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
if (!Chunk_bounds_check(src, src_start, n)) return;
if (!Chunk_bounds_check(dst, dst_start, n)) return;
memcpy(dst->data + dst_start, src->data + src_start, n);
}
int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
if (chunk == NULL) return 0;
return start <= chunk->length && n <= chunk->length - start;
}
We made an edit to swap the order of the functions in this file. Using the original Myers algorithm, we get the following diff, which is easy for a reader to understand, cleanly separates the regions of change, and also matches up a lot of meaningful unique lines between the old and new versions, which tends to lead to fewer merge conflicts.
+int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+{
+ if (chunk == NULL) return 0;
+
+ return start <= chunk->length && n <= chunk->length - start;
+}
+
void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
if (!Chunk_bounds_check(src, src_start, n)) return;
if (!Chunk_bounds_check(dst, dst_start, n)) return;
memcpy(dst->data + dst_start, src->data + src_start, n);
}
-
-int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
-{
- if (chunk == NULL) return 0;
-
- return start <= chunk->length && n <= chunk->length - start;
-}
However, the linear-space version of Myers that Git actually uses does not handle this change well, and produces an unclear diff that’s hard for a reader to understand, does not cleanly separate changes from each other, and the lines it matches up are the blank lines and closing braces rather than unique lines of code, and so it leads to merge conflicts when this change is merged with other changes to the same functions.
-void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
+int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
- if (!Chunk_bounds_check(src, src_start, n)) return;
- if (!Chunk_bounds_check(dst, dst_start, n)) return;
+ if (chunk == NULL) return 0;
- memcpy(dst->data + dst_start, src->data + src_start, n);
+ return start <= chunk->length && n <= chunk->length - start;
}
-int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
- if (chunk == NULL) return 0;
+ if (!Chunk_bounds_check(src, src_start, n)) return;
+ if (!Chunk_bounds_check(dst, dst_start, n)) return;
- return start <= chunk->length && n <= chunk->length - start;
+ memcpy(dst->data + dst_start, src->data + src_start, n);
}
To round out this series, I’d like to explore an entirely different diff algorithm, one that’s designed very differently from Myers and which Git gives you the option of using, precisely because it behaves better on certain inputs than Git’s default choice of linear-space Myers. The algorithm is called patience diff, and it was invented by BitTorrent creator Bram Cohen; there is a brief explanation of it on his blog. As usual, we’ll work through an example and then look at implementing the algorithm in code.
The first thing to note is that patience diff is not a diff algorithm in and of itself. What it really is is a method of matching up lines in two versions of a document in order to break them into smaller pieces, before using an actual diff algorithm like Myers on those pieces. The rationale for doing this is that Myers can sometimes pick uninteresting lines to match up, such as all the blank lines and braces in the above example, which can lead to anything from slightly annoying mismatches to useless conflict-causing diffs as above. To improve on this, patience diff begins by doing a full scan of both versions to match up the unique lines they have in common, which are more likely to be meaningful content and not boilerplate.
Let’s look at an example. Say we’re diffing the two sentence fragments
this is incorrect and so is this
and
this is good and correct and so is this
The intuitive reading of this change is that the word incorrect
has been
replaced with good and correct
. Here are the two sentences written one word
per line, with numbers:
1 this 1 this
2 is 2 is
3 incorrect 3 good
4 and 4 and
5 so 5 correct
6 is 6 and
7 this 7 so
8 is
9 this
Patience diff splits this problem up by trying to match up unique lines, that is
lines that occur exactly once in both versions. The only line that occurs
exactly once in both these sentences is the word so
, and so we pair up line 5
on the left with line 7 on the right.
1 this
2 is
1 this 3 good
2 is 4 and
3 incorrect 5 correct
4 and 6 and
5 so <---------------> 7 so
6 is 8 is
7 this 9 this
We then apply the same process to the sub-sections of each version created by
slicing them up at the places we’ve found matches, which I’ll mark with a *
.
That is, we compare lines 1–4 on the left with 1–6 on the right, and 6–7 on the
left with 8–9 on the right. Within those regions we find some more unique
pairings; lines that were not unique when scanning the entire document can be
unique when we only consider a sub-section.
1 this <---------------> 1 this
2 is <---------------> 2 is
3 good
3 incorrect 4 and
4 and 5 correct
6 and
--------------------------------------------------------------------
* 5 so 7 so
--------------------------------------------------------------------
6 is <---------------> 8 is
7 this <---------------> 9 this
If we record these matches and use them to further divide up the documents, we’re left with only one non-empty section to compare: lines 3–4 on the left and 3–6 on the right.
* 1 this 1 this
--------------------------------------------------------------------
* 2 is 2 is
--------------------------------------------------------------------
3 good
3 incorrect 4 and
4 and 5 correct
6 and
--------------------------------------------------------------------
* 5 so 7 so
--------------------------------------------------------------------
* 6 is 8 is
--------------------------------------------------------------------
* 7 this 9 this
Within this region, there are no lines that occur exactly once on each side, so patience diff has run out of matches it can find and we fall back to Myers to calculate the diff of this region. Putting everything back together again, we get the diff we wanted:
this
is
- incorrect
+ good
+ and
+ correct
and
so
is
this
This is arguably an improvement on the original Myers algorithm, which produces
this diff showing incorrect
being replaced with correct
and the words
correct and
being inserted separately.
this
is
- incorrect
+ good
and
+ correct
+ and
so
is
this
Now let’s introduce a complication. Imagine we want the diff of these two lists of musicians:
1 David Axelrod 1 The Slits
2 Electric Prunes 2 Gil Scott Heron
3 Gil Scott Heron 3 David Axelrod
4 The Slits 4 Electric Prunes
5 Faust 5 Faust
6 The Sonics 6 The Sonics
7 The Sonics 7 The Sonics
Most of the lines here are unique to each side and we can pair them up. Note that the final two lines don’t match up because they’re not unique.
1 <---------> 3
2 <---------> 4
3 <---------> 2
4 <---------> 1
5 <---------> 5
However, some of these pairings cross: if you drew lines between the two lists above, the lines would literally cross. That means we can’t use all the pairings to align the two documents with each other and split them into sub-sections. We’d just like to use as many of the pairings as we can, and to do this, we need to pick the longest subsequence from the above pairing list such that the number on the right hand side is always increasing.
For example, if we pick the pairings
1 <---------> 3
2 <---------> 4
5 <---------> 5
then we can split the documents up as follows:
1 The Slits
2 Gil Scott Heron
--------------------------------------------------------------------
1 David Axelrod <---------> 3 David Axelrod
2 Electric Prunes <---------> 4 Electric Prunes
--------------------------------------------------------------------
3 Gil Scott Heron
4 The Slits
--------------------------------------------------------------------
5 Faust <---------> 5 Faust
--------------------------------------------------------------------
6 The Sonics 6 The Sonics
7 The Sonics 7 The Sonics
Indeed, 3,4,5
is the longest increasing subsequence of the line numbers on
the right. There are many algorithms for finding such sequences, and patience
diff is so-called because it uses patience sorting to do it. Let’s take a
detour to find out how that works.
Imagine we have all the ranks from one suit in a deck of cards, in some random order, as shown below.
9 4 6 Q 8 7 A 5 10 J 3 2 K
We wish to find the longest increasing subsequence in this list, that is the
longest sequence of cards we can pluck from this list such that the cards in the
subsequence are in the same order that they appear in above, and are in
increasing rank. For example, 4 6 7
is a possible increasing subsequence, as
is 8 10 J K
. The sequence 7 A 5
is not an increasing subsequence because
although those cards appear in that order above, A
is not a greater rank than
7
(aces are low in this example). Take a couple of minutes to try to find
longest increasing subsequence by hand.
Patience sorting works by analogy to the card game patience, also known as solitaire, where cards must be sorted into stacks of descending rank. Let’s run through this algorithm on our example.
First, we shift the 9
off the front of the list, and place it in a new stack
on its own.
4 6 Q 8 7 A 5 10 J 3 2 K
---------------------------------------------
9
Once we have our first stack established, the algorithm proceeds as follows. For
each card, find the leftmost stack whose uppermost card’s rank is greater than
the current card, and place the card on that stack. In our example, our next
card is 4
, and the first stack has a 9
uppermost, so we can place the 4
on
this first stack.
6 Q 8 7 A 5 10 J 3 2 K
-----------------------------------------
4
9
The next card is a 6
, which is greater than 4
, so we cannot place it on the
first stack and we must start a new one to the right of the first. Whenever we
place a card on any stack other than the leftmost one, we record a pointer from
the card we just placed to the card that’s uppermost on the stack immediately to
the left. In this case this means we record a pointer from 6
to 4
, since 4
is on top of the first stack.
Q 8 7 A 5 10 J 3 2 K 6 -> 4
-------------------------------------
4
9 6
Next we have a Q
, which is greater than all the cards stacked so far. So again
we place this to the right of all the others and record a pointer from the Q
to the 6
at the top of the stack to its left.
8 7 A 5 10 J 3 2 K 6 -> 4
--------------------------------- Q -> 6
4
9 6 Q
Next we have the 8
, which is greater than 4
and 6
but less than Q
, so we
place it on the Q
and record a pointer from 8
to 6
.
7 A 5 10 J 3 2 K 6 -> 4
----------------------------- Q -> 6
8 -> 6
4 8
9 6 Q
The same argument applies to the 7
.
A 5 10 J 3 2 K 6 -> 4
------------------------- Q -> 6
8 -> 6
7 7 -> 6
4 8
9 6 Q
The A
is lower than all the other ranks, so we must place it on the first
stack, and therefore we can’t record a pointer from it.
5 10 J 3 2 K 6 -> 4
--------------------- Q -> 6
8 -> 6
A 7 7 -> 6
4 8
9 6 Q
Now we place the 5
on the 6
, and record a pointer from 5
to A
.
10 J 3 2 K 6 -> 4
----------------- Q -> 6
8 -> 6
A 7 7 -> 6
4 5 8 5 -> A
9 6 Q
The 10
is now a greater rank than any of the cards on the top of the stacks,
so we start a new stack and record a pointer to the previous one.
J 3 2 K 6 -> 4
------------- Q -> 6
8 -> 6
A 7 7 -> 6
4 5 8 5 -> A
9 6 Q 10 10 -> 7
The J
again is a greater rank than anything on the stacks, so we begin a new
one.
3 2 K 6 -> 4
--------- Q -> 6
8 -> 6
A 7 7 -> 6
4 5 8 5 -> A
9 6 Q 10 J 10 -> 7
J -> 10
The 3
can be placed atop the 5
, so we record a pointer from 3
to A
.
2 K 6 -> 4
----- Q -> 6
8 -> 6
A 3 7 7 -> 6
4 5 8 5 -> A
9 6 Q 10 J 10 -> 7
J -> 10
3 -> A
The same goes for the 2
.
K 6 -> 4
- Q -> 6
8 -> 6
2 7 -> 6
A 3 7 5 -> A
4 5 8 10 -> 7
9 6 Q 10 J J -> 10
3 -> A
2 -> A
And finally, the K
is a greater rank than all the others, so must be placed on
a new stack, creating a pointer to the previous one.
6 -> 4
Q -> 6
8 -> 6
2 7 -> 6
A 3 7 5 -> A
4 5 8 10 -> 7
9 6 Q 10 J K J -> 10
3 -> A
2 -> A
K -> J
Having exhausted the list, we now use the list of pointers we generated to find
the longest increasing subsequence. We start with the card on top of the final
stack, the K
. We look up the pointer we stored for it, which is K -> J
. Then
we follow this pointer chain back to the start: J
points to 10
, 10
points
to 7
, and so on. Doing this gives us the chain:
K -> J -> 10 -> 7 -> 6 -> 4
If we reverse this, we get 4 6 7 10 J K
, which is the longest increasing
subsequence of the original shuffled list.
Patience diff uses this technique on the list of pairings of unique lines. With the pairings sorted by the number on the left, it finds the longest increasing subsequence of the numbers on the right. This gives us a list of pairings where both line numbers are always increasing, and we can use those to align the two documents and break them up.
Although patience diff takes its name from this algorithm for finding the longest increasing subsequence, that’s not really the interesting thing about it. Any other method could be used here and it would work about as well. The important thing is that it finds unique matching lines; it proactively seeks out lines that might represent interesting content, whereas Myers doesn’t use any heuristics about the text’s content and only considers one line at a time as it makes decisions. By analysing the entire text before dividing it up, patience diff may take a little longer but it can avoid making the types of mistakes we saw in our initial example.
In the next article, we’ll look at implementing this algorithm in code.