# Implementing patience diff

If you enjoy this article, I’m working on a book explaining the internals of Git through implementation: Building Git.

In the previous article we learned how the patience diff algorithm works, including how it divides up two versions of a file by trying to find unique matching lines they have in common. Now, let’s turn this into working code. Like linear-space Myers, this will be a recursive algorithm: we match up unique lines in the complete documents to break them into smaller pieces, then do the same thing with those pieces, and so on until we find pieces that do not have any unique matching lines.

To represent the section of each file we’re working on at each turn, let’s create a structure that stores the start and end line offsets (which I’ll call `low` and `high`) in each version we’re comparing (which I’ll call `a` and `b`). This structure also has a few helper functions for getting these line offsets as a range, and for checking whether the slice contains any lines.

``````  Slice = Struct.new(:a_low, :a_high, :b_low, :b_high) do
def a_range
a_low ... a_high
end

def b_range
b_low ... b_high
end

def not_empty?
a_low < a_high and b_low < b_high
end
end
``````

The `Patience` class will have the usual boilerplate we’ve seen for other diff implementations, except this class takes an additional argument called `fallback`, which specifies which other diff algorithm to use when we can’t find any matching lines. The `Patience.diff` method takes two documents (arrays of `Diff::Line` objects), creates an instance of the `Patience` class to hold them, and then calls its `diff` method with a `Slice` representing the entirety of both documents.

``````class Patience
def self.diff(a, b, fallback: MyersLinear)
slice = Slice.new(0, a.size, 0, b.size)
new(fallback, a, b).diff(slice)
end

def initialize(fallback, a, b)
@fallback = fallback
@a, @b    = a, b
end

def diff(slice)
# TODO
end
end
``````

We’ll come back to that `diff` method once we’ve put all the necessary building blocks in place. The first thing we need to build the algorithm is a way to find all the pairs of unique matching lines. We won’t filter them by taking the longest increasing subsequence yet, we just want to find all the lines that appear exactly one in each document and record where they are.

To do this, we’ll create a hash table that maps lines of text to an array containing four elements: the number of times the line appears in `@a` (within the current slice), the same count for `@b`, the line number of the first occurrence of the line in `@a` and the same for `@b`. The following expression does that; it says that if a line of text doesn’t exist in the table, then assign it a new record with zero occurrence counts and no line numbers.

``````    counts = Hash.new { |h, text| h[text] = [0, 0, nil, nil] }
``````

After this, we scan over the lines in `@a` and `@b` that fall within `slice` and update the `counts` table for each line we see, incrementing the counts and storing the first line number at which we see each line. Then we filter the table using `counts.select!` to select only those entries whose occurrence counts are both `1`, and finally we map the remaining entries to a list of `Match` objects containing the line numbers from each document. Here’s the full function:

``````  def unique_matching_lines(slice)
counts = Hash.new { |h, text| h[text] = [0, 0, nil, nil] }

slice.a_range.each do |n|
text = @a[n].text
counts[text][0]  += 1
counts[text][2] ||= n
end

slice.b_range.each do |n|
text = @b[n].text
counts[text][1]  += 1
counts[text][3] ||= n
end

counts.select! { |text, count| count.take(2) == [1, 1] }

counts.map do |text, (n, m, a_line, b_line)|
Match.new(a_line, b_line)
end
end
``````

`Match` is a structure that just stores a line number from each document. It also has `prev` and `next` fields, which we’ll use to form chains of these objects when we sort them using the patience algorithm.

``````  Match = Struct.new(:a_line, :b_line) do
attr_accessor :prev, :next
end
``````

Because Ruby’s `Hash` class maintains insertion order and we scan over the lines in `@a` first, `unique_matching_lines` returns `Match` objects ordered by their `a_line` value. What we now need to do is select the longest subsequence we can from this list where the `b_line` value is always increasing. For example, given this list:

``````[
#<struct Patience::Match a_line=1, b_line=3>,
#<struct Patience::Match a_line=2, b_line=4>,
#<struct Patience::Match a_line=3, b_line=2>,
#<struct Patience::Match a_line=4, b_line=1>,
#<struct Patience::Match a_line=5, b_line=5>
]
``````

We should select these objects:

``````[
#<struct Patience::Match a_line=1, b_line=3>,
#<struct Patience::Match a_line=2, b_line=4>,
#<struct Patience::Match a_line=5, b_line=5>
]
``````

To do this, we’ll implement the patience sorting algorithm on the `b_line` value of the match objects. Let’s remind ourselves of the information we needed to store using a step from our previous example:

``````J   3   2   K                               6 -> 4
-------------                               Q -> 6
8 -> 6
A       7                                   7 -> 6
4   5   8                                   5 -> A
9   6   Q  10                              10 -> 7
``````

We have the array of items we’re sorting, an array of stacks, and a set of pointers from one item to another. We can store the pointers on the `Match` objects themselves; that’s what the `prev` field is for. To store the stacks, we actually only need a flat array: once an item has been placed on top of a stack, the rest of the stack is never referred to again. So we could actually store the above stack state as the flat array `[A, 5, 7, 10]`. Also, notice that the cards on the tops of the stacks are always sorted in ascending order. This means we can find which stack to put the next card on by using a binary search rather than scanning linearly from left to right.

The following `patience_sort` function takes an array of `Match` objects and returns a linked list representing the longest increasing subsequence on `b_line` value. It does this by iterating over `matches` and assigning each match to a place in the `stacks` array. `binary_search` finds which position to insert the next match at, returning the zero-based index of the stack to the left of where the match should be placed. If this number is zero or higher, we record the `prev` pointer on the current `match` to point at that stack, and then insert the current `match` in the stack to the right. After this loop completes, we follow the `prev` pointers backwards from the rightmost match in `stacks`, assigning the `next` pointers to form a forwards chain from the lowest to the highest item, and we return the item at the start of this chain.

``````  def patience_sort(matches)
stacks = []

matches.each do |match|
i = binary_search(stacks, match)
match.prev = stacks[i] if i >= 0
stacks[i + 1] = match
end

match = stacks.last
return nil unless match

while match.prev
match.prev.next = match
match = match.prev
end

match
end
``````

The `binary_search` function takes the list of stacks and the next match, and finds the position the match should be placed at. A binary search works by assuming the list is sorted, and then repeatedly narrowing the range of possible positions a value could be at by half each time. The initial range is just the entire list we’re searching. On each iteration, we pick the position in the middle of the range, and if the thing we’re searching for is higher than the value at that position, that mid-point becomes the new start of the range, otherwise it becomes the new end.

``````  def binary_search(stacks, match)
low, high = -1, stacks.size

while low + 1 < high
mid = (low + high) / 2
if stacks[mid].b_line < match.b_line
low = mid
else
high = mid
end
end

low
end
``````

Having defined methods for finding the unique matching lines and returning their longest subsequence, we can now build the main `diff` method that uses this information to recursively split up the documents. This method will take a `Slice` and calculate the list of unique matching lines. If no matches are found (i.e. the linked list is `nil`) then we return the result of running the fallback algorithm over the slice.

Otherwise, we enter a loop. We start two line counters at the beginning of the current slice in each document. In the loop, we find the position of the next match, and create a new slice from our current line counters to this match’s position. We recursively calculate the diff over this slice and add it to our list of lines. If there is no next match, then the new slice is taken to the end of the current slice, and we return from the loop at this point. Otherwise, we emit a `Diff::Edit` representing an unchanged line and update our line counters to begin after the current match’s position. This is the only actual output this algorithm generates itself, all actual changes are detected by the fallback algorithm.

Here’s the full method:

``````  def diff(slice)
match = patience_sort(unique_matching_lines(slice))
return fallback_diff(slice) unless match

lines = []
a_line, b_line = slice.a_low, slice.b_low

loop do
a_next, b_next = match ?
[match.a_line, match.b_line] :
[slice.a_high, slice.b_high]

subslice = Slice.new(a_line, a_next, b_line, b_next)
lines   += diff(subslice)

return lines unless match

lines << Diff::Edit.new(:eql, @a[match.a_line], @b[match.b_line])

a_line, b_line = match.a_line + 1, match.b_line + 1
match = match.next
end
end
``````

The `fallback_diff` method calls the `@fallback` algorithm with the current slice. This is where preserving line numbers on `Diff::Line` objects is useful; otherwise we’d need to convert the output of this call to offset the resulting line numbers based on the current slice.

``````  def fallback_diff(slice)
@fallback.diff(@a[slice.a_range], @b[slice.b_range])
end
``````

1. Match the first lines of both if they’re identical, then match the second, third, etc. until a pair doesn’t match.
2. Match the last lines of both if they’re identical, then match the next to last, second to last, etc. until a pair doesn’t match.

We’ve not implemented that here, and you may have noticed that using the patience algorithm in Git does not in fact match up the trailing `}` lines in our C examples. That’s because Git doesn’t perform these steps first, it performs them after calculating the matching lines but before recursing into each slice. After each new slice is constructed, we trim the head and tail of it by consuming any leading and trailing lines that match. Let’s modify our `diff` method to include that.

``````      subslice   = Slice.new(a_line, a_next, b_line, b_next)

match_tail(subslice) { |edit| tail =  + tail }

lines += head + diff(subslice) + tail
return lines unless match
``````

The `match_head` and `match_tail` methods take the sub-slice and collect any matching lines that appear at the start or end of the slice. Remember that a slice between unique matching lines may still contain lines that match up: they just weren’t unique across the whole slice under consideration, or weren’t included in the longest subsequence. These methods adjust the offsets of the sub-slice for any lines they consume, and return lists of `Diff::Edit` objects for matching lines.

``````  def match_head(slice)
while slice.not_empty? and @a[slice.a_low].text == @b[slice.b_low].text
yield Diff::Edit.new(:eql, @a[slice.a_low], @b[slice.b_low])
slice.a_low += 1
slice.b_low += 1
end
end

def match_tail(slice)
while slice.not_empty? and @a[slice.a_high - 1].text == @b[slice.b_high - 1].text
slice.a_high -= 1
slice.b_high -= 1
yield Diff::Edit.new(:eql, @a[slice.a_high], @b[slice.b_high])
end
end
``````

All the pieces are now in place, and if we call `Patience.diff` with our two example C files, we obtain the diff we wanted:

``````+         1    int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+         2    {
+         3        if (chunk == NULL) return 0;
+         4
+         5        return start <= chunk->length && n <= chunk->length - start;
+         6    }
+         7
1    8    void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
2    9    {
3   10        if (!Chunk_bounds_check(src, src_start, n)) return;
4   11        if (!Chunk_bounds_check(dst, dst_start, n)) return;
5   12
6   13        memcpy(dst->data + dst_start, src->data + src_start, n);
7   14    }
-    8
-    9         int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
-   10         {
-   11             if (chunk == NULL) return 0;
-   12
-   13             return start <= chunk->length && n <= chunk->length - start;
-   14         }
``````

Git also performs some work to gather up multiple unique matching lines that appear next to each other without recursing. This is done as an optimisation rather than being a correctness concern; the recursion in this case would produce an empty slice, and so invoke the fallback algorithm, which will return an empty diff. This is not wrong, but it is wasteful, and so extra code can be added to avoid it if desired.

There are a few other edge cases Git handles directly rather than relying on the algorithm dealing with empty input lists. If the slice contains no lines for either input then the diff can be easily calculated as either all lines in the old version being deleted or all lines in the new version being added. If there are no unique matching lines the same trick can be applied. Again, these are optimisations and I encourage you to try adding them to this implementation.

An important point that explains why we need the patience algorithm and not just the original Myers implementation is that patience does not require quadratic space. The list of `Match` objects requires linear space (it’s proportional to the lengths of the shortest input string), and the patience sort’s stacks also require linear space since they store at most each `Match` object once. Besides, the Myers algorithm, considering as it does only one line at a time, can still make the same poor decisions as its linear-space variance on some inputs. By putting more up-front analysis into the total content of both documents, patience trades off execution time for a better chance of a diff matching the user’s expectations.

If you’ve enjoyed this article, you might enjoy my recently published book JavaScript Testing Recipes. It’s full of simple techniques for writing modular, maintainable JavaScript apps in the browser and on the server.