If you enjoy this article, I’m working on a book explaining the internals of Git through implementation: Building Git.
—
This is the final installment in my series on the Myers diff algorithm, the default diff algorithm used by Git. In part 1 we established a model for the algorithm as finding the shortest path through a graph representing all the possible edit sequences to convert one string into another. In part 2 we changed the representation of this graph walk into a different coordinate space that allows a few optimisations, and from there we got to a working implementation to find the smallest number of edits required:
def shortest_edit(a, b)
n, m = a.size, b.size
max = n + m
v = Array.new(2 * max + 1)
v[1] = 0
(0 .. max).step do d
(d .. d).step(2) do k
if k == d or (k != d and v[k  1] < v[k + 1])
x = v[k + 1]
else
x = v[k  1] + 1
end
y = x  k
while x < n and y < m and a[x] == b[y]
x, y = x + 1, y + 1
end
v[k] = x
return d if x >= n and y >= m
end
end
end
The final step to complete the algorithm requires us to backtrack. The graph search lets us explore all possible edit paths until we find the smallest path necessary to reach the end. Once we’ve reached the end, we can look at the data we’ve recorded in reverse to figure out which single path led to the result.
Let’s look at our example again. Recall that the shortest_edit
function
records the best value of x we can reach at each (d, k)
position:
 0 1 2 3 4 5
+

4  7

3  5

2  3 7

1  1 5 7

0  0 2 5

1  0 4 5

2  2 4

3  3
We know the final position is at (x, y) = (7,6), so k = 1, and we found it after d = 5 steps. From (d, k) = (5,1), we can track backward to either (4,0) or (4,2):
 0 1 2 3 4 5
+

4  7

3  5

2  3 ( 7 )

1  1 5 [ 7 ]

0  0 2 ( 5 )

1  0 4 5

2  2 4

3  3
We see that (4,2) contains the higher x value, and so we must have reached (5,1) via a downward move from there. This tells us that the (x, y) position before (7,6) was (7,5).
 0 1 2 3 4 5
+

4  7

3  5

2  3 7
 \
1  1 5 7

0  0 2 5

1  0 4 5

2  2 4

3  3
Via a similar argument, to move back from (d, k) = (4,2) we must have come from (3,1) or (3,3). (3,3) does not have a greater x than (3,1), so we know this was a rightward move from (3,1), or from (x, y) = (5,4) to (7,5).
 0 1 2 3 4 5
+

4  7

3  5

2  3 7
 / \
1  1 5 7

0  0 2 5

1  0 4 5

2  2 4

3  3
Before that, we have a choice between (d, k) = (2,0) and (2,2), and (2,2) has the higher x value.
 0 1 2 3 4 5
+

4  7

3  5

2  3 7
 \ / \
1  1 5 7

0  0 2 5

1  0 4 5

2  2 4

3  3
After this, we’re on the edge of the grid and we can only have made rightward moves.
 0 1 2 3 4 5
+

4  7

3  5

2  3 7
 / \ / \
1  1 5 7
 /
0  0 2 5

1  0 4 5

2  2 4

3  3
We’ve walked all the way back to the start, and now we know that the (x, y) positions of each move are:
(0,0) > (1,0) > (3,1) > (5,4) > (7,5) > (7,6)
These positions are enough to figure out the diagonal moves between each position: we decrement both x and y until one of them is equal to the values in the previous position, and then we end up a single downward or rightward move away from that position. For example, to get from (5,4) back to (3,1), we observe that 5 > 3 and 4 > 1, and so we can step diagonally back one step to (4,3). This still has neither coordinate equal to (3,1) so we take another step back to (3,2). Now, we’ve reached the same x value as (3,1), so we must make an upward step from (3,2) to (3,1) to complete the move.
To perform this backtracking, we need to make a small adjustment to
shortest_edit
. Rather than just recording the latest x value for each
k, we need to keep a trace of the array before each turn of the outer
loop.
We add the variable trace
to store these snapshots of v
, and push a copy of
v
into it on each loop. Rather than just returning the number of moves
required, we now return this trace to the caller.
v = Array.new(2 * max + 1)
v[1] = 0
trace = []
(0 .. max).step do d
trace << v.clone
(d .. d).step(2) do k
# calculate the next move...
return trace if x >= n and y >= m
At the end of the function, trace
will contain enough information to let us
reconstruct the path that leads to the final position. It essentially contains
the best x value that we found at each (d, k), as we saw
earlier. Each copy of v
that we stored corresponds to a column from our
search history.
Now we can implement the backtracking in code. We can write a function that
takes a trace produced by the shortest_edit
function, and the target final
(x, y) position, and works out the backtrace. For each move, it
yields the x and y values before and after the move.
The function iterates over the trace in reverse order. each_with_index
generates a list that pairs each copy of v
in the trace up with its
corresponding d
value, and then reverse_each
iterates over that list
backwards.
def backtrack(trace, x, y)
trace.each_with_index.reverse_each do v, d
At each step of the trace, we calculate the k
value, and then determine
what the previous k
would have been, using the same logic as in the
shortest_edit
function:
k = x  y
if k == d or (k != d and v[k  1] < v[k + 1])
prev_k = k + 1
else
prev_k = k  1
end
From that previous k
value, we can retrieve the previous value of x
from the
trace, and use these k
and x
values to calculate the previous y
.
prev_x = v[prev_k]
prev_y = prev_x  prev_k
Then we begin yielding moves back to the caller. If the current x
and y
values are both greater than the previous ones, we know we can make a diagonal
move, so we yield a move from x1, y1
to x, y
and then decrement x
and
y
as long as this condition holds.
while x > prev_x and y > prev_y
yield x  1, y  1, x, y
x, y = x  1, y  1
end
Finally, we yield a move from the previous x
and y
from the trace, to the
position we reached after following diagonals. This should be a single downward
or rightward step. If d
is zero, there is no previous position to move back
to, so we skip this step in that case. After yielding this move, we set x
and
y
to their values in the previous round, and continue the loop.
yield prev_x, prev_y, x, y if d > 0
x, y = prev_x, prev_y
end
end
Running this on our example inputs, this function yields the following pairs of positions, which you should verify correspond to the full path through the edit graph, in reverse:
(7, 5) > (7, 6)
(6, 4) > (7, 5)
(5, 4) > (6, 4)
(4, 3) > (5, 4)
(3, 2) > (4, 3)
(3, 1) > (3, 2)
(2, 0) > (3, 1)
(1, 0) > (2, 0)
(0, 0) > (1, 0)
To glue everything together, we need a function that takes two texts to diff,
passes them as lists of lines into shortest_edit
to get a trace, and then runs
that trace through backtrack
to generate a diff: a sequence of lines marked as
either deletions, insertions, or equal lines.
As backtrack
yields each pair of positions in reverse, this function adds a
diff line to the front of a list. If the x values are the same between
positions, we know it’s a downward move, or an insertion; if the y values
are the same then it’s a deletion, otherwise the lines are equal. To build the
diff, we pop a line off the end of each file as appropriate, popping a line off
both files if the lines are equal. That [a.pop, b.pop].last
trick just pops a
line off both files and returns the value of one of them, which is fine since
both lines should be equal.
def generate_diff(a, b)
a, b = a.lines, b.lines
trace = shortest_edit(a, b)
diff = []
backtrack(trace, a.size, b.size) do prev_x, prev_y, x, y
if x == prev_x
diff.unshift(Diff::Line.new(:ins, nil, b.size, b.pop))
elsif y == prev_y
diff.unshift(Diff::Line.new(:del, a.size, nil, a.pop))
else
diff.unshift(Diff::Line.new(:eql, a.size, b.size, [a.pop, b.pop].last))
end
end
diff
end
The Diff::Line
class used in this function is a simple structure that records
the type of diff line (deletion, insertion, or equal), the line number from each
file as appropriate, and the text of the line.
module Diff
Line = Struct.new(:type, :old_line, :new_line, :text)
end
So generate_diff
produces a list of Diff::Line
objects, which we can then
use for various purposes. The simplest application of this is to print it to the
terminal, colouring the lines red and green as appropriate if the stream it’s
writing to is a TTY (e.g. the program’s standard output if being written to a
terminal).
module Diff
TAGS = {:eql => " ", :del => "", :ins => "+"}
COLOURS = {
:del => "\e[31m",
:ins => "\e[32m",
:default => "\e[39m"
}
LINE_WIDTH = 4
def self.print(diff, out = $stdout)
diff.each do line
col = out.tty? ? COLOURS[line.type] : ""
reset = out.tty? ? COLOURS[:default] : ""
tag = TAGS[line.type]
old_line = line.old_line.to_s.rjust(LINE_WIDTH, " ")
new_line = line.new_line.to_s.rjust(LINE_WIDTH, " ")
text = line.text.rstrip
out.puts "#{col}#{tag} #{old_line} #{new_line} #{text}#{reset}"
end
end
end
Here’s the result of running Diff.print(generate_diff(a, b))
on our two
example strings, just as we expect:
 1 A
 2 B
3 1 C
+ 2 B
4 3 A
5 4 B
 6 B
7 5 A
+ 6 C
You can make further refinements to this, for example to only show regions that
have changed, keeping a certain amount of unchanged context around them,
formatting them as git diff
does, and so on. Although git diff
doesn’t show
numbers for each line, it does include a header before each change section that
includes the offsets of that section, and the line numbers are needed to
calculate that. The offsets help git apply
find the right place to apply each
change.
Git does not just use diffs to show information to the user via git diff
and
git show
. It also diffs internally to implement other operations, such as git
merge
. Retaining the line numbers is important for the merge algorithm to
correctly incorporate multiple changes within a file, and I’ll be looking at
that in a future article.