If you enjoy this article, I have published a book explaining the internals of Git through implementation: Building Git.
—
In the previous article we learned how the patience diff algorithm works, including how it divides up two versions of a file by trying to find unique matching lines they have in common. Now, let’s turn this into working code. Like linear-space Myers, this will be a recursive algorithm: we match up unique lines in the complete documents to break them into smaller pieces, then do the same thing with those pieces, and so on until we find pieces that do not have any unique matching lines.
To represent the section of each file we’re working on at each turn, let’s
create a structure that stores the start and end line offsets (which I’ll call
low
and high
) in each version we’re comparing (which I’ll call a
and b
).
This structure also has a few helper functions for getting these line offsets as
a range, and for checking whether the slice contains any lines.
Slice = Struct.new(:a_low, :a_high, :b_low, :b_high) do
def a_range
a_low ... a_high
end
def b_range
b_low ... b_high
end
def not_empty?
a_low < a_high and b_low < b_high
end
end
The Patience
class will have the usual boilerplate we’ve seen for other diff
implementations, except this class takes an additional argument called
fallback
, which specifies which other diff algorithm to use when we can’t find
any matching lines. The Patience.diff
method takes two documents (arrays of
Diff::Line
objects), creates an instance of the Patience
class to hold them,
and then calls its diff
method with a Slice
representing the entirety of
both documents.
class Patience
def self.diff(a, b, fallback: MyersLinear)
slice = Slice.new(0, a.size, 0, b.size)
new(fallback, a, b).diff(slice)
end
def initialize(fallback, a, b)
@fallback = fallback
@a, @b = a, b
end
def diff(slice)
# TODO
end
end
We’ll come back to that diff
method once we’ve put all the necessary building
blocks in place. The first thing we need to build the algorithm is a way to find
all the pairs of unique matching lines. We won’t filter them by taking the
longest increasing subsequence yet, we just want to find all the lines that
appear exactly one in each document and record where they are.
To do this, we’ll create a hash table that maps lines of text to an array
containing four elements: the number of times the line appears in @a
(within
the current slice), the same count for @b
, the line number of the first
occurrence of the line in @a
and the same for @b
. The following expression
does that; it says that if a line of text doesn’t exist in the table, then
assign it a new record with zero occurrence counts and no line numbers.
counts = Hash.new { |h, text| h[text] = [0, 0, nil, nil] }
After this, we scan over the lines in @a
and @b
that fall within slice
and
update the counts
table for each line we see, incrementing the counts and
storing the first line number at which we see each line. Then we filter the
table using counts.select!
to select only those entries whose occurrence
counts are both 1
, and finally we map the remaining entries to a list of
Match
objects containing the line numbers from each document. Here’s the full
function:
def unique_matching_lines(slice)
counts = Hash.new { |h, text| h[text] = [0, 0, nil, nil] }
slice.a_range.each do |n|
text = @a[n].text
counts[text][0] += 1
counts[text][2] ||= n
end
slice.b_range.each do |n|
text = @b[n].text
counts[text][1] += 1
counts[text][3] ||= n
end
counts.select! { |text, count| count.take(2) == [1, 1] }
counts.map do |text, (n, m, a_line, b_line)|
Match.new(a_line, b_line)
end
end
Match
is a structure that just stores a line number from each document. It
also has prev
and next
fields, which we’ll use to form chains of these
objects when we sort them using the patience algorithm.
Match = Struct.new(:a_line, :b_line) do
attr_accessor :prev, :next
end
Because Ruby’s Hash
class maintains insertion order and we scan over the lines
in @a
first, unique_matching_lines
returns Match
objects ordered by their
a_line
value. What we now need to do is select the longest subsequence we can
from this list where the b_line
value is always increasing. For example, given
this list:
[
#<struct Patience::Match a_line=1, b_line=3>,
#<struct Patience::Match a_line=2, b_line=4>,
#<struct Patience::Match a_line=3, b_line=2>,
#<struct Patience::Match a_line=4, b_line=1>,
#<struct Patience::Match a_line=5, b_line=5>
]
We should select these objects:
[
#<struct Patience::Match a_line=1, b_line=3>,
#<struct Patience::Match a_line=2, b_line=4>,
#<struct Patience::Match a_line=5, b_line=5>
]
To do this, we’ll implement the patience sorting algorithm on the b_line
value
of the match objects. Let’s remind ourselves of the information we needed to
store using a step from our previous example:
J 3 2 K 6 -> 4
------------- Q -> 6
8 -> 6
A 7 7 -> 6
4 5 8 5 -> A
9 6 Q 10 10 -> 7
We have the array of items we’re sorting, an array of stacks, and a set of
pointers from one item to another. We can store the pointers on the Match
objects themselves; that’s what the prev
field is for. To store the stacks, we
actually only need a flat array: once an item has been placed on top of a stack,
the rest of the stack is never referred to again. So we could actually store the
above stack state as the flat array [A, 5, 7, 10]
. Also, notice that the cards
on the tops of the stacks are always sorted in ascending order. This means we
can find which stack to put the next card on by using a binary search rather
than scanning linearly from left to right.
The following patience_sort
function takes an array of Match
objects and
returns a linked list representing the longest increasing subsequence on
b_line
value. It does this by iterating over matches
and assigning each
match to a place in the stacks
array. binary_search
finds which position to
insert the next match at, returning the zero-based index of the stack to the
left of where the match should be placed. If this number is zero or higher, we
record the prev
pointer on the current match
to point at that stack, and
then insert the current match
in the stack to the right. After this loop
completes, we follow the prev
pointers backwards from the rightmost match in
stacks
, assigning the next
pointers to form a forwards chain from the lowest
to the highest item, and we return the item at the start of this chain.
def patience_sort(matches)
stacks = []
matches.each do |match|
i = binary_search(stacks, match)
match.prev = stacks[i] if i >= 0
stacks[i + 1] = match
end
match = stacks.last
return nil unless match
while match.prev
match.prev.next = match
match = match.prev
end
match
end
The binary_search
function takes the list of stacks and the next match, and
finds the position the match should be placed at. A binary search works by
assuming the list is sorted, and then repeatedly narrowing the range of possible
positions a value could be at by half each time. The initial range is just the
entire list we’re searching. On each iteration, we pick the position in the
middle of the range, and if the thing we’re searching for is higher than the
value at that position, that mid-point becomes the new start of the range,
otherwise it becomes the new end.
def binary_search(stacks, match)
low, high = -1, stacks.size
while low + 1 < high
mid = (low + high) / 2
if stacks[mid].b_line < match.b_line
low = mid
else
high = mid
end
end
low
end
Having defined methods for finding the unique matching lines and returning their
longest subsequence, we can now build the main diff
method that uses this
information to recursively split up the documents. This method will take a
Slice
and calculate the list of unique matching lines. If no matches are found
(i.e. the linked list is nil
) then we return the result of running the
fallback algorithm over the slice.
Otherwise, we enter a loop. We start two line counters at the beginning of the
current slice in each document. In the loop, we find the position of the next
match, and create a new slice from our current line counters to this match’s
position. We recursively calculate the diff over this slice and add it to our
list of lines. If there is no next match, then the new slice is taken to the
end of the current slice, and we return from the loop at this point. Otherwise,
we emit a Diff::Edit
representing an unchanged line and update our line
counters to begin after the current match’s position. This is the only actual
output this algorithm generates itself, all actual changes are detected by the
fallback algorithm.
Here’s the full method:
def diff(slice)
match = patience_sort(unique_matching_lines(slice))
return fallback_diff(slice) unless match
lines = []
a_line, b_line = slice.a_low, slice.b_low
loop do
a_next, b_next = match ?
[match.a_line, match.b_line] :
[slice.a_high, slice.b_high]
subslice = Slice.new(a_line, a_next, b_line, b_next)
lines += diff(subslice)
return lines unless match
lines << Diff::Edit.new(:eql, @a[match.a_line], @b[match.b_line])
a_line, b_line = match.a_line + 1, match.b_line + 1
match = match.next
end
end
The fallback_diff
method calls the @fallback
algorithm with the current
slice. This is where preserving line numbers on Diff::Line
objects is useful;
otherwise we’d need to convert the output of this call to offset the resulting
line numbers based on the current slice.
def fallback_diff(slice)
@fallback.diff(@a[slice.a_range], @b[slice.b_range])
end
In Bram Cohen’s blog post about this algorithm, he mentions two steps we’ve not covered here:
- Match the first lines of both if they’re identical, then match the second, third, etc. until a pair doesn’t match.
- Match the last lines of both if they’re identical, then match the next to last, second to last, etc. until a pair doesn’t match.
We’ve not implemented that here, and you may have noticed that using the
patience algorithm in Git does not in fact match up the trailing }
lines in
our C examples. That’s because Git doesn’t perform these steps first, it
performs them after calculating the matching lines but before recursing into
each slice. After each new slice is constructed, we trim the head and tail of it
by consuming any leading and trailing lines that match. Let’s modify our diff
method to include that.
subslice = Slice.new(a_line, a_next, b_line, b_next)
head, tail = [], []
match_head(subslice) { |edit| head = head + [edit] }
match_tail(subslice) { |edit| tail = [edit] + tail }
lines += head + diff(subslice) + tail
return lines unless match
The match_head
and match_tail
methods take the sub-slice and collect any
matching lines that appear at the start or end of the slice. Remember that a
slice between unique matching lines may still contain lines that match up: they
just weren’t unique across the whole slice under consideration, or weren’t
included in the longest subsequence. These methods adjust the offsets of the
sub-slice for any lines they consume, and return lists of Diff::Edit
objects
for matching lines.
def match_head(slice)
while slice.not_empty? and @a[slice.a_low].text == @b[slice.b_low].text
yield Diff::Edit.new(:eql, @a[slice.a_low], @b[slice.b_low])
slice.a_low += 1
slice.b_low += 1
end
end
def match_tail(slice)
while slice.not_empty? and @a[slice.a_high - 1].text == @b[slice.b_high - 1].text
slice.a_high -= 1
slice.b_high -= 1
yield Diff::Edit.new(:eql, @a[slice.a_high], @b[slice.b_high])
end
end
All the pieces are now in place, and if we call Patience.diff
with our two
example C files, we obtain the diff we wanted:
+ 1 int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+ 2 {
+ 3 if (chunk == NULL) return 0;
+ 4
+ 5 return start <= chunk->length && n <= chunk->length - start;
+ 6 }
+ 7
1 8 void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
2 9 {
3 10 if (!Chunk_bounds_check(src, src_start, n)) return;
4 11 if (!Chunk_bounds_check(dst, dst_start, n)) return;
5 12
6 13 memcpy(dst->data + dst_start, src->data + src_start, n);
7 14 }
- 8
- 9 int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
- 10 {
- 11 if (chunk == NULL) return 0;
- 12
- 13 return start <= chunk->length && n <= chunk->length - start;
- 14 }
Git also performs some work to gather up multiple unique matching lines that appear next to each other without recursing. This is done as an optimisation rather than being a correctness concern; the recursion in this case would produce an empty slice, and so invoke the fallback algorithm, which will return an empty diff. This is not wrong, but it is wasteful, and so extra code can be added to avoid it if desired.
There are a few other edge cases Git handles directly rather than relying on the algorithm dealing with empty input lists. If the slice contains no lines for either input then the diff can be easily calculated as either all lines in the old version being deleted or all lines in the new version being added. If there are no unique matching lines the same trick can be applied. Again, these are optimisations and I encourage you to try adding them to this implementation.
An important point that explains why we need the patience algorithm and not just
the original Myers implementation is that patience does not require quadratic
space. The list of Match
objects requires linear space (it’s proportional to
the lengths of the shortest input string), and the patience sort’s stacks also
require linear space since they store at most each Match
object once. Besides,
the Myers algorithm, considering as it does only one line at a time, can still
make the same poor decisions as its linear-space variance on some inputs. By
putting more up-front analysis into the total content of both documents,
patience trades off execution time for a better chance of a diff matching the
user’s expectations.