Implementing patience diff

In the previous article we learned how the patience diff algorithm works, including how it divides up two versions of a file by trying to find unique matching lines they have in common. Now, let’s turn this into working code. Like linear-space Myers, this will be a recursive algorithm: we match up unique lines in the complete documents to break them into smaller pieces, then do the same thing with those pieces, and so on until we find pieces that do not have any unique matching lines.

To represent the section of each file we’re working on at each turn, let’s create a structure that stores the start and end line offsets (which I’ll call low and high) in each version we’re comparing (which I’ll call a and b). This structure also has a few helper functions for getting these line offsets as a range, and for checking whether the slice contains any lines.

  Slice = Struct.new(:a_low, :a_high, :b_low, :b_high) do
    def a_range
      a_low ... a_high
    end

    def b_range
      b_low ... b_high
    end

    def not_empty?
      a_low < a_high and b_low < b_high
    end
  end

The Patience class will have the usual boilerplate we’ve seen for other diff implementations, except this class takes an additional argument called fallback, which specifies which other diff algorithm to use when we can’t find any matching lines. The Patience.diff method takes two documents (arrays of Diff::Line objects), creates an instance of the Patience class to hold them, and then calls its diff method with a Slice representing the entirety of both documents.

class Patience
  def self.diff(a, b, fallback: MyersLinear)
    slice = Slice.new(0, a.size, 0, b.size)
    new(fallback, a, b).diff(slice)
  end

  def initialize(fallback, a, b)
    @fallback = fallback
    @a, @b    = a, b
  end

  def diff(slice) 
    # TODO
  end
end

We’ll come back to that diff method once we’ve put all the necessary building blocks in place. The first thing we need to build the algorithm is a way to find all the pairs of unique matching lines. We won’t filter them by taking the longest increasing subsequence yet, we just want to find all the lines that appear exactly one in each document and record where they are.

To do this, we’ll create a hash table that maps lines of text to an array containing four elements: the number of times the line appears in @a (within the current slice), the same count for @b, the line number of the first occurrence of the line in @a and the same for @b. The following expression does that; it says that if a line of text doesn’t exist in the table, then assign it a new record with zero occurrence counts and no line numbers.

    counts = Hash.new { |h, text| h[text] = [0, 0, nil, nil] }

After this, we scan over the lines in @a and @b that fall within slice and update the counts table for each line we see, incrementing the counts and storing the first line number at which we see each line. Then we filter the table using counts.select! to select only those entries whose occurrence counts are both 1, and finally we map the remaining entries to a list of Match objects containing the line numbers from each document. Here’s the full function:

  def unique_matching_lines(slice)
    counts = Hash.new { |h, text| h[text] = [0, 0, nil, nil] }

    slice.a_range.each do |n|
      text = @a[n].text
      counts[text][0]  += 1
      counts[text][2] ||= n
    end

    slice.b_range.each do |n|
      text = @b[n].text
      counts[text][1]  += 1
      counts[text][3] ||= n
    end

    counts.select! { |text, count| count.take(2) == [1, 1] }

    counts.map do |text, (n, m, a_line, b_line)|
      Match.new(a_line, b_line)
    end
  end

Match is a structure that just stores a line number from each document. It also has prev and next fields, which we’ll use to form chains of these objects when we sort them using the patience algorithm.

  Match = Struct.new(:a_line, :b_line) do
    attr_accessor :prev, :next
  end

Because Ruby’s Hash class maintains insertion order and we scan over the lines in @a first, unique_matching_lines returns Match objects ordered by their a_line value. What we now need to do is select the longest subsequence we can from this list where the b_line value is always increasing. For example, given this list:

[
  #<struct Patience::Match a_line=1, b_line=3>,
  #<struct Patience::Match a_line=2, b_line=4>,
  #<struct Patience::Match a_line=3, b_line=2>,
  #<struct Patience::Match a_line=4, b_line=1>,
  #<struct Patience::Match a_line=5, b_line=5>
]

We should select these objects:

[
  #<struct Patience::Match a_line=1, b_line=3>,
  #<struct Patience::Match a_line=2, b_line=4>,
  #<struct Patience::Match a_line=5, b_line=5>
]

To do this, we’ll implement the patience sorting algorithm on the b_line value of the match objects. Let’s remind ourselves of the information we needed to store using a step from our previous example:

J   3   2   K                               6 -> 4
-------------                               Q -> 6
                                            8 -> 6
A       7                                   7 -> 6
4   5   8                                   5 -> A
9   6   Q  10                              10 -> 7

We have the array of items we’re sorting, an array of stacks, and a set of pointers from one item to another. We can store the pointers on the Match objects themselves; that’s what the prev field is for. To store the stacks, we actually only need a flat array: once an item has been placed on top of a stack, the rest of the stack is never referred to again. So we could actually store the above stack state as the flat array [A, 5, 7, 10]. Also, notice that the cards on the tops of the stacks are always sorted in ascending order. This means we can find which stack to put the next card on by using a binary search rather than scanning linearly from left to right.

The following patience_sort function takes an array of Match objects and returns a linked list representing the longest increasing subsequence on b_line value. It does this by iterating over matches and assigning each match to a place in the stacks array. binary_search finds which position to insert the next match at, returning the zero-based index of the stack to the left of where the match should be placed. If this number is zero or higher, we record the prev pointer on the current match to point at that stack, and then insert the current match in the stack to the right. After this loop completes, we follow the prev pointers backwards from the rightmost match in stacks, assigning the next pointers to form a forwards chain from the lowest to the highest item, and we return the item at the start of this chain.

  def patience_sort(matches)
    stacks = []

    matches.each do |match|
      i = binary_search(stacks, match)
      match.prev = stacks[i] if i >= 0
      stacks[i + 1] = match
    end

    match = stacks.last
    return nil unless match

    while match.prev
      match.prev.next = match
      match = match.prev
    end

    match
  end

The binary_search function takes the list of stacks and the next match, and finds the position the match should be placed at. A binary search works by assuming the list is sorted, and then repeatedly narrowing the range of possible positions a value could be at by half each time. The initial range is just the entire list we’re searching. On each iteration, we pick the position in the middle of the range, and if the thing we’re searching for is higher than the value at that position, that mid-point becomes the new start of the range, otherwise it becomes the new end.

  def binary_search(stacks, match)
    low, high = -1, stacks.size

    while low + 1 < high
      mid = (low + high) / 2
      if stacks[mid].b_line < match.b_line
        low = mid
      else
        high = mid
      end
    end

    low
  end

Having defined methods for finding the unique matching lines and returning their longest subsequence, we can now build the main diff method that uses this information to recursively split up the documents. This method will take a Slice and calculate the list of unique matching lines. If no matches are found (i.e. the linked list is nil) then we return the result of running the fallback algorithm over the slice.

Otherwise, we enter a loop. We start two line counters at the beginning of the current slice in each document. In the loop, we find the position of the next match, and create a new slice from our current line counters to this match’s position. We recursively calculate the diff over this slice and add it to our list of lines. If there is no next match, then the new slice is taken to the end of the current slice, and we return from the loop at this point. Otherwise, we emit a Diff::Edit representing an unchanged line and update our line counters to begin after the current match’s position. This is the only actual output this algorithm generates itself, all actual changes are detected by the fallback algorithm.

Here’s the full method:

  def diff(slice)
    match = patience_sort(unique_matching_lines(slice))
    return fallback_diff(slice) unless match

    lines = []
    a_line, b_line = slice.a_low, slice.b_low

    loop do
      a_next, b_next = match ?
                       [match.a_line, match.b_line] :
                       [slice.a_high, slice.b_high]

      subslice = Slice.new(a_line, a_next, b_line, b_next)
      lines   += diff(subslice)

      return lines unless match

      lines << Diff::Edit.new(:eql, @a[match.a_line], @b[match.b_line])

      a_line, b_line = match.a_line + 1, match.b_line + 1
      match = match.next
    end
  end

The fallback_diff method calls the @fallback algorithm with the current slice. This is where preserving line numbers on Diff::Line objects is useful; otherwise we’d need to convert the output of this call to offset the resulting line numbers based on the current slice.

  def fallback_diff(slice)
    @fallback.diff(@a[slice.a_range], @b[slice.b_range])
  end

In Bram Cohen’s blog post about this algorithm, he mentions two steps we’ve not covered here:

  1. Match the first lines of both if they’re identical, then match the second, third, etc. until a pair doesn’t match.
  2. Match the last lines of both if they’re identical, then match the next to last, second to last, etc. until a pair doesn’t match.

We’ve not implemented that here, and you may have noticed that using the patience algorithm in Git does not in fact match up the trailing } lines in our C examples. That’s because Git doesn’t perform these steps first, it performs them after calculating the matching lines but before recursing into each slice. After each new slice is constructed, we trim the head and tail of it by consuming any leading and trailing lines that match. Let’s modify our diff method to include that.

      subslice   = Slice.new(a_line, a_next, b_line, b_next)
      head, tail = [], []

      match_head(subslice) { |edit| head = head + [edit] }
      match_tail(subslice) { |edit| tail = [edit] + tail }

      lines += head + diff(subslice) + tail
      return lines unless match

The match_head and match_tail methods take the sub-slice and collect any matching lines that appear at the start or end of the slice. Remember that a slice between unique matching lines may still contain lines that match up: they just weren’t unique across the whole slice under consideration, or weren’t included in the longest subsequence. These methods adjust the offsets of the sub-slice for any lines they consume, and return lists of Diff::Edit objects for matching lines.

  def match_head(slice)
    while slice.not_empty? and @a[slice.a_low].text == @b[slice.b_low].text
      yield Diff::Edit.new(:eql, @a[slice.a_low], @b[slice.b_low])
      slice.a_low += 1
      slice.b_low += 1
    end
  end

  def match_tail(slice)
    while slice.not_empty? and @a[slice.a_high - 1].text == @b[slice.b_high - 1].text
      slice.a_high -= 1
      slice.b_high -= 1
      yield Diff::Edit.new(:eql, @a[slice.a_high], @b[slice.b_high])
    end
  end

All the pieces are now in place, and if we call Patience.diff with our two example C files, we obtain the diff we wanted:

+         1    int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+         2    {
+         3        if (chunk == NULL) return 0;
+         4
+         5        return start <= chunk->length && n <= chunk->length - start;
+         6    }
+         7
     1    8    void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
     2    9    {
     3   10        if (!Chunk_bounds_check(src, src_start, n)) return;
     4   11        if (!Chunk_bounds_check(dst, dst_start, n)) return;
     5   12
     6   13        memcpy(dst->data + dst_start, src->data + src_start, n);
     7   14    }
-    8
-    9         int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
-   10         {
-   11             if (chunk == NULL) return 0;
-   12
-   13             return start <= chunk->length && n <= chunk->length - start;
-   14         }

Git also performs some work to gather up multiple unique matching lines that appear next to each other without recursing. This is done as an optimisation rather than being a correctness concern; the recursion in this case would produce an empty slice, and so invoke the fallback algorithm, which will return an empty diff. This is not wrong, but it is wasteful, and so extra code can be added to avoid it if desired.

There are a few other edge cases Git handles directly rather than relying on the algorithm dealing with empty input lists. If the slice contains no lines for either input then the diff can be easily calculated as either all lines in the old version being deleted or all lines in the new version being added. If there are no unique matching lines the same trick can be applied. Again, these are optimisations and I encourage you to try adding them to this implementation.

An important point that explains why we need the patience algorithm and not just the original Myers implementation is that patience does not require quadratic space. The list of Match objects requires linear space (it’s proportional to the lengths of the shortest input string), and the patience sort’s stacks also require linear space since they store at most each Match object once. Besides, the Myers algorithm, considering as it does only one line at a time, can still make the same poor decisions as its linear-space variance on some inputs. By putting more up-front analysis into the total content of both documents, patience trades off execution time for a better chance of a diff matching the user’s expectations.

If you’ve enjoyed this article, you might enjoy my recently published book JavaScript Testing Recipes. It’s full of simple techniques for writing modular, maintainable JavaScript apps in the browser and on the server.