If you enjoy this article, I’m working on a book explaining the internals of Git through implementation: Building Git.
—
The interesting thing about diff algorithms is that they’re a mix of computer science and human factors. There are many equally good diffs between two files, judging them by the length of the edit sequence, and choosing between them requires an algorithm that can best match people’s intuition about how their code has changed.
As luck would have it, as soon as I had finished up my previous articles on Myers diffs and gone back to make progress on another project, I stumbled into a case where Git produced a confusing diff for a file I’d just changed, and I had to know why. Here’s the portion of code I had been working on. It’s a couple of C functions that copy bytes from one buffer to another, checking the sizes of the requested regions to make sure they’re within the buffer:
size_t Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
if (!Chunk_bounds_check(src, src_start, n)) return 0;
if (!Chunk_bounds_check(dst, dst_start, n)) return 0;
memcpy(dst>data + dst_start, src>data + src_start, n);
return n;
}
int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
if (chunk == NULL) return 0;
size_t length = chunk>length;
return start <= length && n <= length  start;
}
The change I made involved swapping the order of these two functions, that is:
int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
if (chunk == NULL) return 0;
size_t length = chunk>length;
return start <= length && n <= length  start;
}
size_t Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
if (!Chunk_bounds_check(src, src_start, n)) return 0;
if (!Chunk_bounds_check(dst, dst_start, n)) return 0;
memcpy(dst>data + dst_start, src>data + src_start, n);
return n;
}
Looking at this change, one would expect that a diff for it should show the
shorter function Chunk_bounds_check
being moved to appear before Chunk_copy
,
i.e.:
+ int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+ {
+ if (chunk == NULL) return 0;
+
+ size_t length = chunk>length;
+
+ return start <= length && n <= length  start;
+ }
+
size_t Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
if (!Chunk_bounds_check(src, src_start, n)) return 0;
if (!Chunk_bounds_check(dst, dst_start, n)) return 0;
memcpy(dst>data + dst_start, src>data + src_start, n);
return n;
}

 int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
 {
 if (chunk == NULL) return 0;

 size_t length = chunk>length;

 return start <= length && n <= length  start;
 }
And indeed, if you try running these two code snippets through our previous Myers implementation, this is exactly what you get. However, when you ask Git to compare these versions, here’s what happens:
 size_t Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
+ int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
 if (!Chunk_bounds_check(src, src_start, n)) return 0;
 if (!Chunk_bounds_check(dst, dst_start, n)) return 0;
+ if (chunk == NULL) return 0;
 memcpy(dst>data + dst_start, src>data + src_start, n);
+ size_t length = chunk>length;
 return n;
+ return start <= length && n <= length  start;
}
 int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+ size_t Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
 if (chunk == NULL) return 0;
+ if (!Chunk_bounds_check(src, src_start, n)) return 0;
+ if (!Chunk_bounds_check(dst, dst_start, n)) return 0;
 size_t length = chunk>length;
+ memcpy(dst>data + dst_start, src>data + src_start, n);
 return start <= length && n <= length  start;
+ return n;
}
This diff is correct; that is, it is a legal and minimal edit sequence that transforms the first version into the second, and it contains exactly as many changes as the “expected” diff. However, it is a poor quality diff in that it fails to convey the highlevel meaning of the change to a programmer. It presents the change as a series of arbitrarily interleaved deletions and insertions, whereas the expected diff clearly shows the two functions switching places.
So, why is it that Git produces this diff rather than the one we’d expect? We can begin to answer this by drawing out the edit graph for this change, just as we did when we first investigated the Myers algorithm.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 ooooooooooooooooooo
           \        
1 ooooooooooooooooooo
  \           \       
2 ooooooooooooooooooo
             \      
3 ooooooooooooooooooo
     \   \    \     \   \   
4 ooooooooooooooooooo
               \    
5 ooooooooooooooooooo
     \   \    \     \   \   
6 ooooooooooooooooooo
                 \  
7 ooooooooooooooooooo
         \          \ 
8 ooooooooooooooooooo
     \   \    \     \   \   
9 ooooooooooooooooooo
 \                  
10 ooooooooooooooooooo
  \           \       
11 ooooooooooooooooooo
   \                
12 ooooooooooooooooooo
    \               
13 ooooooooooooooooooo
     \   \    \     \   \   
14 ooooooooooooooooooo
      \             
15 ooooooooooooooooooo
     \   \    \     \   \   
16 ooooooooooooooooooo
        \           
17 ooooooooooooooooooo
         \          \ 
18 ooooooooooooooooooo
You’ll notice this graph contains two major diagonals, one of length 9 beginning at (0,9) and ending at (9,18), and one of length 8 beginning at (10,0) and ending at (18,8). These diagonals represent the two functions in the code that are not modified by the change, they simply trade places. Apart from these, there are many scattered short diagonals that correspond to braces and blank lines that appear in both versions. These types of lines appear multiple times in the text, whereas all the actual lines of code are unique and are matched just once between versions.
Now, the path corresponding to the expected diff is this one, where we first
travel from (0,0) to (0,9), inserting the Chunk_bounds_check
function, then
from (0,9) to (9,18) to display the Chunk_copy
function unmodified, then from
(9,18) to (18,18) to delete the Chunk_bounds_check
from its previous location.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 o

1 o

2 o

3 o

4 o

5 o

6 o

7 o

8 o

9 o
\
10 o
\
11 o
\
12 o
\
13 o
\
14 o
\
15 o
\
16 o
\
17 o
\
18 oooooooooo
This path contains 9 diagonals, and recall that the aim of the Myers algorithm is to maximise the number of diagonals taken – that is, to find the longest common subsequence – so as to make as few changes as possible. However, it turns out that this path is not unique in containing 9 diagonals, for example, here is another path containing the same number:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 oo

1 o
\
2 ooo

3 o
\
4 oo

5 o
\
6 oo

7 o
\
8 o
\
9 oo

10 o
\
11 oo

12 o

13 o
\
14 oo

15 o
\
16 oo

17 o
\
18 o
If you compare this with the diffs above, you’ll see that this is the path taken
by Git. (There are many other possible paths through these diagonals that you
can get if you vary the order of rightward and downward moves, this is just the
one Git happens to take.) Because the number of matches between braces and blank
lines – tokens that happen to be the same but aren’t significant lines of
program text – is equal to the number of lines in the entire matched
Chunk_copy
function, both paths are equally legitimate in terms of minimising
the number of edits. However, it’s not a useful diff to a human reader, and it
will produce more merge conflicts than the expected diff because it doesn’t do
as good a job of separating sections of changes from each other.
But before we can look at how to fix these problems, we need to understand what causes them. Given the Myers algorithm we’ve seen so far does not choose this path, why does Git choose it?
The answer is that Git doesn’t use the algorithm we implemented, it uses a variation of it. The original Myers algorithm takes a quadratic amount of space to calculate the edit sequence. The central trick in the Myers algorithm is that rather than requiring a generalpurpose graph search algorithm, which would explore and store all possible branches, we can find the length of the longest common subsequence using only a single array of size proportional to N + M, the sum of the lengths of the two strings. However if we want not just the length but the actual edit sequence, this requires storing a copy of the array as we move through the graph so we can backtrack at the end. In the worst case, it will take N + M moves to traverse the graph, and so the algorithm requires space proportional to (N + M)².
Requiring quadratic space is generally not a good property for an algorithm to have, since it rapidly becomes unfeasible to run the algorithm in memory as the input size increases. But fortunately, in the same paper where the original algorithm appears, Myers presents a modified version of it that runs in linear space, that is, space proportional to the sum of the string lengths. It is explained briefly in section 4b, but I’ll give a worked example for our case here.
Whereas the original algorithm deterministically walks from the topleft to the bottomright of the graph in a single pass, the linear space version uses a divide and conquer approach. The edit sequence is made up of a series of what Myers calls snakes, that is a rightward or downward step followed by zero or more diagonal ones. The linearspace version works by finding the middle snake of a possible edit path, that is a snake that crosses the halfway distance from topleft to bottomright, and using the endpoints of that to divide the original graph into two smaller regions. It then works recursively on these regions until they’re so small that no further work is required.
Here’s an example: suppose that you were somehow able to determine that there exists a snake from (8,7) to (11,9), halfway between (0,0) and (18,18) in our example graph. (We will examine how to find this middle snake later.) This divides the original problem of getting from (0,0) to (18,18) into two smaller problems: getting from (0,0) to (8,7), and from (11,9) to (18,18). We can visualise this split like so:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 ooooooooo
        
1 ooooooooo
  \       
2 ooooooooo
        
3 ooooooooo
     \   \  
4 ooooooooo
        
5 ooooooooo
     \   \  
6 ooooooooo
        
7 ooooooooo
\
8 @
\
9 @oooooooo
       
10 oooooooo
 \       
11 oooooooo
       
12 oooooooo
       
13 oooooooo
   \   \   
14 oooooooo
       
15 oooooooo
   \   \   
16 oooooooo
       
17 oooooooo
       \ 
18 oooooooo
I have marked with a @
symbol those points that have been removed from the
problem and are not part of any remaining subregion of the graph.
Having split the problem in two like this, we can apply the same technique to the subregions. In the (0,0)–(8,7) region there is a middle snake from (4,2) to (5,4), splitting the box into (0,0)–(4,2) and (5,4)–(8,7). In the (11,9)–(18,18) region the middle snake is from (13,13) to (15,14), splitting the box into (11,9)–(13,13) and (15,14)–(18,18).
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 ooooo
    
1 ooooo
  \   
2 ooooo

3 @
\
4 oooo
   
5 oooo
  \  
6 oooo
   
7 oooo
\
8 @
\
9 @ooo
  
10 ooo
 \  
11 ooo
  
12 ooo
  
13 ooo
\
14 @oooo
   
15 oooo
 \   
16 oooo
   
17 oooo
   \ 
18 oooo
Applying the process yet again to these four boxes gives the following splits:
 (0,0)–(4,2) → (0,0)–(1,1), (3,2)–(4,2)
 (5,4)–(8,7) → (5,4)–(6,5), (8,6)–(8,7)
 (11,9)–(13,13) → (11,9)–(13,11), (13,11)–(13,13)
 (15,14)–(18,18) → (15,14)–(16,16), (17,16)–(18,18)
Notice that some of the resulting boxes have zero area, and some have corner points in common. This is perfectly normal, for example in the space from (11,9) to (13,13) below there are two resulting subregions: a 2by2 box (11,9)–(13,11), and a 0by2 box (13,11)–(13,13).
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 oo
 
1 oo
\
2 @oo

3 @
\
4 oo
 
5 oo
\
6 @o

7 o
\
8 @
\
9 @ooo
  
10 ooo
 \  
11 ooo

12 o

13 o
\
14 @oo
 
15 oo
 \ 
16 oooo
 
17 oo
 \ 
18 oo
One final pass and you can see there are very few points left to visit, and all of them are part of zeroarea boxes where the path from one end to the other is obvious. What we’re left with is the path followed by Git’s diff.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 o@

1 @
\
2 @@@

3 @
\
4 o@

5 @
\
6 @@

7 @
\
8 @
\
9 @o

10 o
\
11 @@

12 @

13 @
\
14 @@

15 @
\
16 @@

17 @
\
18 @
This is a neat trick, and it’s actually the reason this variant of the algorithm can discover multiple paths through the graph, as we’ll see shortly. But to get it to work, we need to know how to find those middle snakes. This seems like a chickenandegg problem: in order to make a working diff algorithm, we first need something that can find optimal edit paths, which is just what the original Myers algorithm does!
In fact, this is exactly what happens. To find the middle snake, we run the original Myers algorithm both forward from the topleft, and backward from the bottomright, until we find a point at which values on the same diagonal meet each other. That probably sounds a bit cryptic so let’s look at a small example.
Say we’re working on the (0,0)–(4,2) box. Let’s reproduce it here, labelled with its (x, y) values, and also with each diagonal labelled with its value of k. Recall that k = x − y and it either increases or decreases by 1 each time we make a move. Similarly, when working on the backward scan, we will label diagonals with a value c which is zero at the bottomright corner, increases with each step upward, and decreases with each step leftward. The relationship between k and c is:
k = c + delta, where delta = width − height, the difference between the box’s dimensions.
x k
0 1 2 3 4
0 1 2 3 4 \ \ \ \ \
y 0 ooooo ooooo
     1      \
     \      2
1 ooooo ooooo
  \    2   \    \
  \    \   \    1
2 ooooo ooooo
\ \ \ \ \
4 3 2 1 0
c
We run the original algorithm forward from (0,0) and backward from (4,2). Let’s begin exploring the graph, marking our progress as we go. Just like our previous graphwalking explorations, these diagrams place d, the number of moves taken, on the horizontal axis, and k, the number of the current diagonal, on the vertical axis, and record (x, y) points as we walk the graph. But unlike before, we also have the vertical axis labelled with c, which I’ve displayed on the right hand side, and we’ll include the points in the backward scan beginning from this side. Points on the same horizontal line in the forward and backward scan are on the same diagonal in the edit graph, and the forward and backward scans will overlap when a point in the forward scan has a greater x value than a point on the same diagonal in the backward scan.
We begin by marking our starting positions at opposite ends of a grid. (0,0) appears at (d, k) = (0,0), and (4,2) at (d, c) = (0,0).
 0 1 2 3 2 1 0 
++
k   c
 
5   3
 
4   2
 
3   1
 
2  4,2  0
 
1   1
 
0  0,0  2
 
1   3
 
2   4
 
3   5
Next we explore the graph to one level deep, that is we move to (d, k) = (1,−1) and (1,1) and similarly for (d, c).
Moving downward from (0,0) lands us at (0,1), and moving rightward at (1,0). Likewise, moving leftward from (4,2) takes us to (3,2), and moving upward to (4,1). So far, no diagonal moves have taken place.
 0 1 2 3 2 1 0 
++
k   c
 
5   3
 
4   2
 
3  4,1  1
 \ 
2  4,2  0
 / 
1  1,0 3,2  1
 / 
0  0,0  2
 \ 
1  0,1  3
 
2   4
 
3   5
Just moving this far, there’s the opportunity for the forward and backward parts to meet: the points (1,0) and (3,2) are on the same k diagonal. However, (1,0) is not equal or to the right of (3,2), so these paths have not met yet.
Let’s take one more move. We can move downward from (0,1) to reach (0,2), and from (1,0) to (2,2), taking a diagonal move. We can move rightward from (1,0) to (2,0). We can move leftward from (3,2) to (1,1), again taking the same diagonal but in the opposite direction, and upward from (3,2) to (3,1). We can also move upward from (4,1) to (4,0).
 0 1 2 3 2 1 0 
++
k   c
 
5   3
 
4  4,0  2
 \ 
3  4,1  1
 \ 
2  2,0 3,1 4,2  0
 / \ / 
1  1,0 3,2  1
 / \ / 
0  0,0 2,2 1,1  2
 \ 
1  0,1  3
 \ 
2  0,2  4
 
3   5
Now, we’ve found an overlap: the point (2,2) was found at (d, k) = (2,0), and the point (1,1) at (d, c) = (2,−2), which given the relationship between k and c means they live on the same diagonal. And, (2,2) on the forward path in (x, y) space is to the right of (1,1) on the backward path, and so these paths have overlapped. This means we’ve found our middle snake.
When delta is even, the optimal path length will also be even, and so we will find an overlap after taking both a forward and a backward move on each iteration of d. But when it is odd, we will find the overlap after just taking a forward move on the final iteration, after several turns of both forward and backward moves. And so by convention, when delta is odd, we consider the forward move we just made to be the middle snake, whereas when it’s even, we use the last backward move.
In this case, delta is even and so the middle snake is the last backward move, from (3,2) to (1,1), or, placing the points in ascending order, (1,1) to (3,2). Therefore we can split our original box into (0,0)–(1,1) and (3,2)–(4,2).
x
0 1 2 3 4
y 0 oo
 
 
1 oo
2 oo
Notice that, although I’ve drawn the lines of motion on the traces above, that’s purely to help visualise what’s going on. Since we’re only doing the forward scan of the original algorithm and we’re not going to backtrack at the end to calculate the full path, we don’t need to keep copies of the history of the walk as we go  we only need to keep the current state. That’s what allows this variant to work in linear space; we only need to allocate two arrays of size width + height + 1, half the size of the array in the original version since we’re only going to scan halfway. We use one array for the forward scan and one for the backward scan.
Let’s do one more example; this will be in the middle of the grid and also shows up a situation where the algorithm can make choices between equally good paths. We’ll take the (11,9)–(13,13) box.
x k
0 1 2
11 12 13 \ \ \
y 9 ooo ooo
   1    \
   \    4
10 ooo ooo
 \   2  \   \
 \   \  \   3
11 ooo ooo
   3    \
   \    2
12 ooo ooo
   4    \
   \    1
13 ooo ooo
\ \ \
2 1 0
c
Just as before, we mark our starting points in appropriate positions in (d, k) space.
 0 1 2 3 2 1 0 
++
k   c
 
3   5
 
2   4
 
1   3
 
0  11,9  2
 
1   1
 
2  13,13  0
 
3   1
 
4   2
 
5   3
We follow the graph one move. These moves are all a single step, except for the move downward from (11,9) which can take a diagonal to (12,11). So far, no overlaps are visible.
 0 1 2 3 2 1 0 
++
k   c
 
3   5
 
2   4
 
1  12,9  3
 / 
0  11,9  2
 \ 
1  12,11 13,12  1
 \ 
2  13,13  0
 / 
3  12,13  1
 
4   2
 
5   3
Now one more move. After this round, there are two overlaps: on the k=−2 diagonal both paths have found (12,12), and on the k=0 diagonal both have found (13,11). We now need to make a choice about which move we’re going to consider our middle snake.
 0 1 2 3 2 1 0 
++
k   c
 
3   5
 
2  13,9  4
 / 
1  12,9  3
 / 
0  11,9 13,11 13,11  2
 \ / \ 
1  12,11 13,12  1
 \ / \ 
2  12,12 12,12 13,13  0
 / 
3  12,13  1
 / 
4  11,13  2
 
5   3
We know from the eventual outcome that Git picks (13,11) as the midpoint here, but why does it do that? To find out, I went spelunking in xdiffi.c in the Git source tree. This is actually part of LibXDiff, originally developed by Davide Libenzi and then imported and modified by the Git authors.
In that file you will find this section that is triggered after a potential middle snake has been found. The variable names are a bit obtuse, but a few worth pointing out:

ec
is the edit cost, the quantity the Myers paper and our code calls d 
d
is the value we’ve referred to as k, the number of the diagonal you’re inspecting 
fmin
,fmax
andfmid
are the minimum, maximum and middle values of k for the current depth 
dd
is the absolute distance of the current point from the k=0 diagonal (or c=0 in the backward direction) 
i1
andi2
are the horizonal and vertical coordinates in our graphs respectively 
off1
andoff2
are the coordinates of the topleft of the current box 
lim1
andlim2
are the coordinates of the bottomright of the current box 
kvdf
andkvdb
are the arrays storing the forward and backward state of the scan 
v
is the score given to a snake to determine how interesting it is
Looking at the code, we see that it calculates v = (i1  off1) + (i2  off2) 
dd
if the middle snake is in the forward direction and v = (lim1  i1) + (lim2
 i2)  dd
if it’s in the backward direction. That is, it’s the sum of the
relative coordinates within the box, minus the distance from the cetral
diagonal. It then compares this value to the others it found and uses it to
pick the best snake.
There is one further subtlety here: whereas the algorithm in Myers’ paper says
that we should only consider backward moves as candidates here (because
delta is even), Git seems to consider snakes in both directions in the
code section linked above, starting with forward moves. In this example, that
means comparing the forward moves ending at (12,12) and (13,11), or (1,3) and
(2,2) in relative coordinates from the topleft. (12,12) lies on the
k=−2 diagonal whereas (13,11) is on k=0. Let’s calculate v
for both of them:
 (12,12):
v = 1 + 3  2 = 2
 (13,11):
v = 2 + 2  0 = 4
So, (13,11) gets the better score and wins, and we pick the (12,11) to (13,11) move as the middle snake. This splits the box into (11,9)–(12,11) and (13,11)–(13,13).
x
11 12 13
y 9 oo
 
 
10 oo
 \ 
 \ 
11 oo o


12 o


13 o
You might notice this is not identical to the split chosen in my original trace of the recursive procedure. This doesn’t matter, partly because we end up picking the same overall path here anyway, but also because there are endless small variations in how one can pick a snake. I’ve elided some details of how Git does this, and the 18by18 trace above is based on my own implementation, where I made some slightly different implementation choices to force the same result.
So, this process shows why it is that Git can choose a different path to the obvious one: rather than the original algorithm which is searching for a single point: the bottomright, this algorithm might find multiple possible candidates for a midpoint and faces the problem (or opportunity!) of choosing between them. The factors governing this choice are ripe for exploration, and indeed we will dig in further when we implement this algorithm in the following article.