The patience diff algorithm

If you enjoy this article, I have published a book explaining the internals of Git through implementation: Building Git.

Previously in this series, we explored the Myers diff algorithm, we learned a variation of it that uses linear space, then we saw how to use a diff algorithm to build three-way merge, and most recently we examined why merges fail and what can be done about it. In particular, we looked at an example of how our choice of diff algorithm can lead to merge conflicts, including conflicts that can mislead the reader about the changes they’re merging.

We have been using the following source code as our example:

void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
    if (!Chunk_bounds_check(src, src_start, n)) return;
    if (!Chunk_bounds_check(dst, dst_start, n)) return;

    memcpy(dst->data + dst_start, src->data + src_start, n);
}

int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
    if (chunk == NULL) return 0;

    return start <= chunk->length && n <= chunk->length - start;
}

We made an edit to swap the order of the functions in this file. Using the original Myers algorithm, we get the following diff, which is easy for a reader to understand, cleanly separates the regions of change, and also matches up a lot of meaningful unique lines between the old and new versions, which tends to lead to fewer merge conflicts.

+int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+{
+    if (chunk == NULL) return 0;
+
+    return start <= chunk->length && n <= chunk->length - start;
+}
+
 void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
 {
     if (!Chunk_bounds_check(src, src_start, n)) return;
     if (!Chunk_bounds_check(dst, dst_start, n)) return;

     memcpy(dst->data + dst_start, src->data + src_start, n);
 }
-
-int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
-{
-    if (chunk == NULL) return 0;
-
-    return start <= chunk->length && n <= chunk->length - start;
-}

However, the linear-space version of Myers that Git actually uses does not handle this change well, and produces an unclear diff that’s hard for a reader to understand, does not cleanly separate changes from each other, and the lines it matches up are the blank lines and closing braces rather than unique lines of code, and so it leads to merge conflicts when this change is merged with other changes to the same functions.

-void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
+int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
 {
-    if (!Chunk_bounds_check(src, src_start, n)) return;
-    if (!Chunk_bounds_check(dst, dst_start, n)) return;
+    if (chunk == NULL) return 0;

-    memcpy(dst->data + dst_start, src->data + src_start, n);
+    return start <= chunk->length && n <= chunk->length - start;
 }

-int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
 {
-    if (chunk == NULL) return 0;
+    if (!Chunk_bounds_check(src, src_start, n)) return;
+    if (!Chunk_bounds_check(dst, dst_start, n)) return;

-    return start <= chunk->length && n <= chunk->length - start;
+    memcpy(dst->data + dst_start, src->data + src_start, n);
 }

To round out this series, I’d like to explore an entirely different diff algorithm, one that’s designed very differently from Myers and which Git gives you the option of using, precisely because it behaves better on certain inputs than Git’s default choice of linear-space Myers. The algorithm is called patience diff, and it was invented by BitTorrent creator Bram Cohen; there is a brief explanation of it on his blog. As usual, we’ll work through an example and then look at implementing the algorithm in code.

The first thing to note is that patience diff is not a diff algorithm in and of itself. What it really is is a method of matching up lines in two versions of a document in order to break them into smaller pieces, before using an actual diff algorithm like Myers on those pieces. The rationale for doing this is that Myers can sometimes pick uninteresting lines to match up, such as all the blank lines and braces in the above example, which can lead to anything from slightly annoying mismatches to useless conflict-causing diffs as above. To improve on this, patience diff begins by doing a full scan of both versions to match up the unique lines they have in common, which are more likely to be meaningful content and not boilerplate.

Let’s look at an example. Say we’re diffing the two sentence fragments

this is incorrect and so is this

and

this is good and correct and so is this

The intuitive reading of this change is that the word incorrect has been replaced with good and correct. Here are the two sentences written one word per line, with numbers:

    1   this                            1   this
    2   is                              2   is
    3   incorrect                       3   good
    4   and                             4   and
    5   so                              5   correct
    6   is                              6   and
    7   this                            7   so
                                        8   is
                                        9   this

Patience diff splits this problem up by trying to match up unique lines, that is lines that occur exactly once in both versions. The only line that occurs exactly once in both these sentences is the word so, and so we pair up line 5 on the left with line 7 on the right.

                                        1   this
                                        2   is
    1   this                            3   good
    2   is                              4   and
    3   incorrect                       5   correct
    4   and                             6   and

    5   so          <--------------->   7   so

    6   is                              8   is
    7   this                            9   this

We then apply the same process to the sub-sections of each version created by slicing them up at the places we’ve found matches, which I’ll mark with a *. That is, we compare lines 1–4 on the left with 1–6 on the right, and 6–7 on the left with 8–9 on the right. Within those regions we find some more unique pairings; lines that were not unique when scanning the entire document can be unique when we only consider a sub-section.

    1   this        <--------------->   1   this
    2   is          <--------------->   2   is

                                        3   good
    3   incorrect                       4   and
    4   and                             5   correct
                                        6   and
--------------------------------------------------------------------
*   5   so                              7   so
--------------------------------------------------------------------
    6   is          <--------------->   8   is
    7   this        <--------------->   9   this

If we record these matches and use them to further divide up the documents, we’re left with only one non-empty section to compare: lines 3–4 on the left and 3–6 on the right.

*   1   this                            1   this
--------------------------------------------------------------------
*   2   is                              2   is
--------------------------------------------------------------------
                                        3   good
    3   incorrect                       4   and
    4   and                             5   correct
                                        6   and
--------------------------------------------------------------------
*   5   so                              7   so
--------------------------------------------------------------------
*   6   is                              8   is
--------------------------------------------------------------------
*   7   this                            9   this

Within this region, there are no lines that occur exactly once on each side, so patience diff has run out of matches it can find and we fall back to Myers to calculate the diff of this region. Putting everything back together again, we get the diff we wanted:

  this
  is
- incorrect
+ good
+ and
+ correct
  and
  so
  is
  this

This is arguably an improvement on the original Myers algorithm, which produces this diff showing incorrect being replaced with correct and the words correct and being inserted separately.

  this
  is
- incorrect
+ good
  and
+ correct
+ and
  so
  is
  this

Now let’s introduce a complication. Imagine we want the diff of these two lists of musicians:

    1   David Axelrod                   1   The Slits
    2   Electric Prunes                 2   Gil Scott Heron
    3   Gil Scott Heron                 3   David Axelrod
    4   The Slits                       4   Electric Prunes
    5   Faust                           5   Faust
    6   The Sonics                      6   The Sonics
    7   The Sonics                      7   The Sonics

Most of the lines here are unique to each side and we can pair them up. Note that the final two lines don’t match up because they’re not unique.

                      1   <--------->   3
                      2   <--------->   4
                      3   <--------->   2
                      4   <--------->   1
                      5   <--------->   5

However, some of these pairings cross: if you drew lines between the two lists above, the lines would literally cross. That means we can’t use all the pairings to align the two documents with each other and split them into sub-sections. We’d just like to use as many of the pairings as we can, and to do this, we need to pick the longest subsequence from the above pairing list such that the number on the right hand side is always increasing.

For example, if we pick the pairings

                      1   <--------->   3
                      2   <--------->   4
                      5   <--------->   5

then we can split the documents up as follows:

                                        1   The Slits
                                        2   Gil Scott Heron
--------------------------------------------------------------------
    1   David Axelrod     <--------->   3   David Axelrod
    2   Electric Prunes   <--------->   4   Electric Prunes
--------------------------------------------------------------------
    3   Gil Scott Heron
    4   The Slits
--------------------------------------------------------------------
    5   Faust             <--------->   5   Faust
--------------------------------------------------------------------
    6   The Sonics                      6   The Sonics
    7   The Sonics                      7   The Sonics

Indeed, 3,4,5 is the longest increasing subsequence of the line numbers on the right. There are many algorithms for finding such sequences, and patience diff is so-called because it uses patience sorting to do it. Let’s take a detour to find out how that works.

Imagine we have all the ranks from one suit in a deck of cards, in some random order, as shown below.

9   4   6   Q   8   7   A   5   10  J   3   2   K

We wish to find the longest increasing subsequence in this list, that is the longest sequence of cards we can pluck from this list such that the cards in the subsequence are in the same order that they appear in above, and are in increasing rank. For example, 4 6 7 is a possible increasing subsequence, as is 8 10 J K. The sequence 7 A 5 is not an increasing subsequence because although those cards appear in that order above, A is not a greater rank than 7 (aces are low in this example). Take a couple of minutes to try to find longest increasing subsequence by hand.

Patience sorting works by analogy to the card game patience, also known as solitaire, where cards must be sorted into stacks of descending rank. Let’s run through this algorithm on our example.

First, we shift the 9 off the front of the list, and place it in a new stack on its own.

4   6   Q   8   7   A   5   10  J   3   2   K
---------------------------------------------

9

Once we have our first stack established, the algorithm proceeds as follows. For each card, find the leftmost stack whose uppermost card’s rank is greater than the current card, and place the card on that stack. In our example, our next card is 4, and the first stack has a 9 uppermost, so we can place the 4 on this first stack.

6   Q   8   7   A   5   10  J   3   2   K
-----------------------------------------

4
9

The next card is a 6, which is greater than 4, so we cannot place it on the first stack and we must start a new one to the right of the first. Whenever we place a card on any stack other than the leftmost one, we record a pointer from the card we just placed to the card that’s uppermost on the stack immediately to the left. In this case this means we record a pointer from 6 to 4, since 4 is on top of the first stack.

Q   8   7   A   5   10  J   3   2   K       6 -> 4
-------------------------------------

4
9   6

Next we have a Q, which is greater than all the cards stacked so far. So again we place this to the right of all the others and record a pointer from the Q to the 6 at the top of the stack to its left.

8   7   A   5   10  J   3   2   K           6 -> 4
---------------------------------           Q -> 6

4
9   6   Q

Next we have the 8, which is greater than 4 and 6 but less than Q, so we place it on the Q and record a pointer from 8 to 6.

7   A   5   10  J   3   2   K               6 -> 4
-----------------------------               Q -> 6
                                            8 -> 6
4       8
9   6   Q

The same argument applies to the 7.

A   5   10  J   3   2   K                   6 -> 4
-------------------------                   Q -> 6
                                            8 -> 6
        7                                   7 -> 6
4       8
9   6   Q

The A is lower than all the other ranks, so we must place it on the first stack, and therefore we can’t record a pointer from it.

5   10  J   3   2   K                       6 -> 4
---------------------                       Q -> 6
                                            8 -> 6
A       7                                   7 -> 6
4       8
9   6   Q

Now we place the 5 on the 6, and record a pointer from 5 to A.

10  J   3   2   K                           6 -> 4
-----------------                           Q -> 6
                                            8 -> 6
A       7                                   7 -> 6
4   5   8                                   5 -> A
9   6   Q

The 10 is now a greater rank than any of the cards on the top of the stacks, so we start a new stack and record a pointer to the previous one.

J   3   2   K                               6 -> 4
-------------                               Q -> 6
                                            8 -> 6
A       7                                   7 -> 6
4   5   8                                   5 -> A
9   6   Q  10                              10 -> 7

The J again is a greater rank than anything on the stacks, so we begin a new one.

3   2   K                                   6 -> 4
---------                                   Q -> 6
                                            8 -> 6
A       7                                   7 -> 6
4   5   8                                   5 -> A
9   6   Q  10   J                          10 -> 7
                                            J -> 10

The 3 can be placed atop the 5, so we record a pointer from 3 to A.

2   K                                       6 -> 4
-----                                       Q -> 6
                                            8 -> 6
A   3   7                                   7 -> 6
4   5   8                                   5 -> A
9   6   Q  10   J                          10 -> 7
                                            J -> 10
                                            3 -> A

The same goes for the 2.

K                                           6 -> 4
-                                           Q -> 6
                                            8 -> 6
    2                                       7 -> 6
A   3   7                                   5 -> A
4   5   8                                  10 -> 7
9   6   Q  10   J                           J -> 10
                                            3 -> A
                                            2 -> A

And finally, the K is a greater rank than all the others, so must be placed on a new stack, creating a pointer to the previous one.

                                            6 -> 4
                                            Q -> 6
                                            8 -> 6
    2                                       7 -> 6
A   3   7                                   5 -> A
4   5   8                                  10 -> 7
9   6   Q  10   J   K                       J -> 10
                                            3 -> A
                                            2 -> A
                                            K -> J

Having exhausted the list, we now use the list of pointers we generated to find the longest increasing subsequence. We start with the card on top of the final stack, the K. We look up the pointer we stored for it, which is K -> J. Then we follow this pointer chain back to the start: J points to 10, 10 points to 7, and so on. Doing this gives us the chain:

K -> J -> 10 -> 7 -> 6 -> 4

If we reverse this, we get 4 6 7 10 J K, which is the longest increasing subsequence of the original shuffled list.

Patience diff uses this technique on the list of pairings of unique lines. With the pairings sorted by the number on the left, it finds the longest increasing subsequence of the numbers on the right. This gives us a list of pairings where both line numbers are always increasing, and we can use those to align the two documents and break them up.

Although patience diff takes its name from this algorithm for finding the longest increasing subsequence, that’s not really the interesting thing about it. Any other method could be used here and it would work about as well. The important thing is that it finds unique matching lines; it proactively seeks out lines that might represent interesting content, whereas Myers doesn’t use any heuristics about the text’s content and only considers one line at a time as it makes decisions. By analysing the entire text before dividing it up, patience diff may take a little longer but it can avoid making the types of mistakes we saw in our initial example.

In the next article, we’ll look at implementing this algorithm in code.