Myers diff in linear space: implementation

If you enjoy this article, I’m working on a book explaining the internals of Git through implementation: Building Git.

In the last article we explored a linear space variant of the Myers diff algorithm, which is the version of the algorithm actually used by Git. We saw how it achieves its space-saving trick and how it can lead to multiple possible edit paths since it has to make choices between multiple possible optimal edits.

Now I’d like to step through an implementation so you can see this working in code. We’d like a function MyersLinear.diff(a, b) to take two strings (or other sequences of values) and return an array of Diff::Line objects that record the line number and content of each line and whether it’s a deletion, insertion, or unchanged line.

Let’s start off by creating a class. Many of the functions here will need to refer to a and b and so rather than pass those around everywhere, I’m just going to store them off as instance variables on this object.

class MyersLinear
  def self.diff(a, b)
    new(a, b).diff
  end

  def initialize(a, b)
    a = a.lines if a.is_a?(String)
    b = b.lines if b.is_a?(String)

    @a, @b = a, b
  end

  def diff
    # TODO
  end
end

Now, recall that this algorithm works by recursively splitting the edit graph into smaller and smaller sub-regions. As we implement this recursive process, we’re going to need a way to represent the region we’re currently working on. We could pass the co-ordinates or width and height around, but I’d like to encapsulate this idea as a single value. The following Box class represents a region by its top-left and bottom-right co-ordinates and provides some convenience methods that will clarify the subsequent code. In Ruby, Struct just creates a class whose constructor assigns its arguments to the given names.

Box = Struct.new(:left, :top, :right, :bottom) do
  def width
    right - left
  end

  def height
    bottom - top
  end

  def size
    width + height
  end

  def delta
    width - height
  end
end

Our implementation will work in two phases: first, we run the algorithm described in the previous article, that is the recursive procedure for finding the series of snakes that form the midpoint of each region. Then, we will use this series to reconstruct the line-by-line diff.

Let’s introduce a method in the MyersLinear class for the first phase; this is the core of the recursive algorithm. Given any set of co-ordinates left, top, right and bottom, we find the middle snake within that box. If we don’t find a snake, it means this region is already as small as possible (it’s a single point) and we return nil to indicate this yielded no new moves. But if we do find a snake, we take its start and end points and construct two new regions from the top-left of our original box to the start of the snake, and from the end of the snake to the bottom-right of our box, and recursively find the edit path for each of them. Finally, we combine the edit paths for these two regions into a single list and return it.

  def find_path(left, top, right, bottom)
    box   = Box.new(left, top, right, bottom)
    snake = midpoint(box)

    return nil unless snake

    start, finish = snake

    head = find_path(box.left, box.top, start[0], start[1])
    tail = find_path(finish[0], finish[1], box.right, box.bottom)

    (head || [start]) + (tail || [finish])
  end

The midpoint method will return a representation of a snake, which is a pair of points for the start and end of the snake, for example [[8, 7], [11, 9]]. Each recursive call to find_path will either return a list of points or nil, and when it does return nil we use one of the snake’s points in its place. For example, when head is nil, that means the region from the top-left of the box to the start of the middle snake is just a single point, so we can use the start of the snake to represent it. In this way, the method ends up building a list of the ends of all the middle snakes it found as it divided the graph into smaller boxes.

Having defined the basic outline of the algorithm, we now need to implement the method for actually finding the midpoint for each region. Our midpoint method shown below will take a Box structure and try to return a pair of points representing the middle snake, in the form [[left, top], [right, bottom]]. To do this, it runs the original version of the algorithm for finding the shortest edit, but it runs it simultaneously in both directions, one beginning at the top-left and one at the bottom-right.

Therefore, rather than a single v array for storing the best position for each diagonal, we need two: vf for the forward direction and vb for the backward one. As before, I’m seeding the forward-scan with box.left, i.e. the x co-ordinate of the start of the search. However, for the backward scan I’m starting with box.bottom, rather than box.right. When we scan forward, we’re looking to maximise x, so as to prioritise deletions over insertions. But when we scan in the reverse direction, we’d like the opposite: we want prioritise insertions, so that those appear last in the final output. So, I’m basing the backward scan around minimising y rather than x, and have therefore used box.bottom as my starting value. This is an entirely arbitrary decision, and you can just as well use x for both scans and experiment to see what edit paths you end up with.

With the starting arrays set up, we then iterate d as before, but on each iteration we fill in a row of values in both the forwards and the backwards direction. I’ve implemented this as two method calls, forwards and backward, that perform the scan and yield any potential snakes they found. To do this, they need to take both vf and vb so they can detect whether they’ve found any overlapping paths.

Each snake yielded by these methods is an array of three values: a start point, an end point, and a score. Recall from the previous article that Git uses some heuristics to award scores to potential snakes and selects the one with the best score. We’re doing the same here by taking the max_by of these snakes by their last value, and then returning the first two values in the best snake – its start and end points – as the result.

  def midpoint(box)
    return nil if box.size == 0

    max = (box.size / 2.0).ceil

    vf    = Array.new(2 * max + 1)
    vf[1] = box.left
    vb    = Array.new(2 * max + 1)
    vb[1] = box.bottom

    (0 .. max).step do |d|
      snakes = []

      forwards(box, vf, vb, d) { |s| snakes << s }
      backward(box, vf, vb, d) { |s| snakes << s }

      next if snakes.empty?

      snake = snakes.max_by { |snake| snake.last }
      return snake.take(2)
    end
  end

The forwards and backward methods themselves will look quite familiar from the original Myers algorithm. To refresh your memory, here’s our original shortest_edit method:

  def shortest_edit(a, b)
    n, m = a.size, b.size
    max  = n + m

    v    = Array.new(2 * max + 1)
    v[1] = 0

    (0 .. max).step do |d|
      (-d .. d).step(2) do |k|
        if k == -d or (k != d and v[k - 1] < v[k + 1])
          x = v[k + 1]
        else
          x = v[k - 1] + 1
        end

        y = x - k

        while x < n and y < m and a[x] == b[y]
          x, y = x + 1, y + 1
        end

        v[k] = x

        return d if x >= n and y >= m
      end
    end
  end

Our new methods (see below) are just performing the inner loop over k, but there are a few modifications we’ve had to make. The first is that, because we’re trying to return the start and end points of the snake, not just the length of the edit sequence, we need to store the previous value of x before we make our next move. I’ve called this value px and it’s the same as x if we take a downward step, but one less than x if we step rightward.

The second change is that rather than simply saying y = x - k, we need to adjust our x and y values relative to the box’s top-left corner, so this becomes y = box.top + (x - box.left) - k. Then we must also calculate py, which will be one less than y if we moved downward (i.e. x did not change) and equal to y otherwise.

Then, as before, we follow any possible diagonal steps, taking care not to overshoot box.right or box.bottom, and we store off the final value of x. The final change is that at the end of this method, we check for overlaps between the forward and backward scans, here comparing our current y value to the value found in the corresponding c diagonal in the vb array. If our y value is greater than or equal to the corresponding vb[c] value, the scans have overlapped and we can yield the move we just took back to the caller as a potential middle snake. As Git does, I’ve calculated a score for the snake based on the relative x and y offsets within the box, and the absolute value of k representing how far we’ve strayed from the centre.

  def forwards(box, vf, vb, d)
    (-d .. d).step(2) do |k|
      c = k - box.delta

      if k == -d or (k != d and vf[k - 1] < vf[k + 1])
        px = x = vf[k + 1]
      else
        px = vf[k - 1]
        x  = px + 1
      end

      y  = box.top + (x - box.left) - k
      py = (d == 0 || x != px) ? y : y - 1

      while x < box.right and y < box.bottom and @a[x] == @b[y]
        x, y = x + 1, y + 1
      end

      vf[k] = x

      if box.delta.odd? and c.between?(-(d - 1), d - 1) and y >= vb[c]
        score = (x - box.left) + (y - box.top) - k.abs
        yield [[px, py], [x, y], score]
      end
    end
  end

The backward method is much the same, except that it tries to minimise y rather than maximise x, and values in the vb array are indexed by c rather than k.

  def backward(box, vf, vb, d)
    (-d .. d).step(2) do |c|
      k = c + box.delta

      if c == -d or (c != d and vb[c - 1] > vb[c + 1])
        py = y = vb[c + 1]
      else
        py = vb[c - 1]
        y  = py - 1
      end

      x  = box.left + (y - box.top) + k
      px = (d == 0 || y != py) ? x : x + 1

      while x > box.left and y > box.top and @a[x - 1] == @b[y - 1]
        x, y = x - 1, y - 1
      end

      vb[c] = y

      if box.delta.even? and k.between?(-d, d) and x <= vf[k]
        score = (box.right - x) + (box.bottom - y) - c.abs
        yield [[x, y], [px, py], score]
      end
    end
  end

We’ve now seen all the methods that find the shortest edit path: find_path uses midpoint to find the middle snake, which in turn uses forwards and backward methods to scan the current box. find_path then recurses into the two smaller boxes defined by the middle snake. When we run this on the C code in the previous article, it generates this sequence of points:

[ [0, 0],   [1, 0],   [1, 1],   [3, 2],   [4, 2],
  [5, 4],   [6, 4],   [6, 5],   [8, 6],   [8, 7],
  [11, 9],  [11, 10], [13, 11], [13, 12], [13, 13],
  [15, 14], [16, 16], [17, 16], [18, 18] ]

Notice that not every point in this sequence is a single step from the one before it. For example, while the difference between [1, 0] and [1, 1] is a single downward step, the gap between [1, 1] and [3, 2], or between [8, 7] and [11, 9], is composed of multiple steps. To construct the complete diff, we need to work out all the single steps that make up the jumps between these points.

Take that snake we found at the first iteration of the algorithm, from [8, 7] to [11, 9]. We know a snake is composed of a single rightward and downward step followed by zero or more diagonal ones. Because of snakes found by scanning backward, they can also be a sequence of diagonal steps followed by a single rightward or downward one (leftward or upward in reverse direction). So, this snake could correspond to one of these sets of steps:

        o---o   o   o               o   o   o   o
              \                       \
        o   o   o   o       or      o   o   o   o
                  \                       \
        o   o   o   o               o   o   o---o

We know this move includes a rightward step rather than a downward one, because the change in the x co-ordinate from 8 to 11 is greater than the change in y from 7 to 9. A greater change in y than x would indicate a downward move.

As well as deciding which direction this step is in, we also need to decide whether it appears at the beginning or the end of the snake, that is, do we have the sequence on the left in the above figure, or the one on the right? To figure this out we can compare the lines at the starting point of the snake. If line 8 in the old version is equal to line 7 in the new one, then the first step is a diagonal and the rightward step comes at the end. Otherwise, it comes at the start.

Putting all this togther, we can write a method the uses the output of find_path to to reconstruct the complete diff. The walk_snakes method below takes each consecutive pair of points in the find_path output and derives the complete path between them. For each pair of points, it first tries to walk a diagonal from the first point by checking the lines in the @a and @b strings. From there, it compares the change in x to the change in y; if it’s negative then the y change is greater and we yield a downward step, but if it’s positive then we yield a rightward step. Note that this code explicitly does nothing if the changes in x and y are equal. Some snakes will consist only of diagonals so there is no rightward or downward step to yield. Finally, we try to walk diagonals again to handle the case where the rightward/downward move happens first.

  def walk_snakes(&block)
    path = find_path(0, 0, @a.size, @b.size)
    return unless path

    path.each_cons(2) do |(x1, y1), (x2, y2)|
      x1, y1 = walk_diagonal(x1, y1, x2, y2, &block)

      case x2 - x1 <=> y2 - y1
      when -1
        yield x1, y1, x1, y1 + 1
        y1 += 1
      when 1
        yield x1, y1, x1 + 1, y1
        x1 += 1
      end

      walk_diagonal(x1, y1, x2, y2, &block)
    end
  end

  def walk_diagonal(x1, y1, x2, y2, &block)
    while x1 < x2 and y1 < y2 and @a[x1] == @b[y1]
      yield x1, y1, x1 + 1, y1 + 1
      x1, y1 = x1 + 1, y1 + 1
    end
    [x1, y1]
  end

Now we can loop back to where we started and implement the diff method. All this does is call walk_snakes and consumes the pairs of points yielded by that method. For each pair of points, if x is unchanged then we have a downward move, which corresponds to an insertion; if y is unchanged then we have a deletion; and otherwise we have a diagonal move and the lines are equal.

  def diff
    diff = []

    walk_snakes do |x1, y1, x2, y2|
      if x1 == x2
        diff << Diff::Line.new(:ins, nil, y2, @b[y1])
      elsif y1 == y2
        diff << Diff::Line.new(:del, x2, nil, @a[x1])
      else
        diff << Diff::Line.new(:eql, x2, y2, @a[x1])
      end
    end

    diff
  end

Putting everything together, when we run MyersLinear.diff(a, b) where a and b are the two versions of the C code in the last article, we indeed get the diff that Git produces:

-    1         size_t Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
+         1    int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
     2    2    {
-    3             if (!Chunk_bounds_check(src, src_start, n)) return 0;
-    4             if (!Chunk_bounds_check(dst, dst_start, n)) return 0;
+         3        if (chunk == NULL) return 0;
     5    4
-    6             memcpy(dst->data + dst_start, src->data + src_start, n);
+         5        size_t length = chunk->length;
     7    6
-    8             return n;
+         7        return start <= length && n <= length - start;
     9    8    }
    10    9
-   11         int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+        10    size_t Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
    12   11    {
-   13             if (chunk == NULL) return 0;
+        12        if (!Chunk_bounds_check(src, src_start, n)) return 0;
+        13        if (!Chunk_bounds_check(dst, dst_start, n)) return 0;
    14   14
-   15             size_t length = chunk->length;
+        15        memcpy(dst->data + dst_start, src->data + src_start, n);
    16   16
-   17             return start <= length && n <= length - start;
+        17        return n;
    18   18    }

The ability of this algorithm to make choices about which snake to pick means you can experiment with it. Try adjusting its scoring methods to see how that changes the diffs you get. Try making the backward method work using x rather than y co-ordinates. Allow snakes to be selected from either direction, regardless of whether delta is even or not. There are endless tweaks you can make to this algorithm, and getting it “right” is a question of observing how it works on real-world inputs. Why not have a look at Git’s source code to see how it does things, and try out your own ideas for improving its output.

If you’ve enjoyed this article, you might enjoy my recently published book JavaScript Testing Recipes. It’s full of simple techniques for writing modular, maintainable JavaScript apps in the browser and on the server.