Merging with diff3

If you enjoy this article, I’m working on a book explaining the internals of Git through implementation: Building Git.

Over the course of the last few articles, we’ve investigated how Git calculates diffs between different versions of a file, first looking at the Myers diff algorithm and then its linear space variant. As a programmer, you are probably most familiar with diffs as a way for you to see what’s changed; you see them whenever you run the git diff or git show commands, or when you use GitHub’s compare view or read a pull request. But diffs aren’t only for human consumption, although their design is strongly influenced by human expectations. They also form an ingredient in other computations that Git has to perform, the most common one being merging.

Merging happens whenever two or more concurrent branches have modified the same file, and we want to reconcile all those changes into a single final version. Say Alice and Bob have both forked from the master branch of a project, where a version of some file exists; we will call this version v1. Alice modifies the file on their branch, producing version v2 of the file, and Bob makes a different change to their copy, producing yet another version v3.

                     -------o [alice]
                   /       v2
                 /
      [master] o
              v1 \
                   \
                     -------o [bob]
                           v3

We now want to merge the branches back together, reconciling both Alice’s and Bob’s changes to produce a final version, v4.

                     -------o------
                   /       v2       \
                 /                    \
      [master] o                        o v4
              v1 \                    /
                   \                /
                     -------o------
                           v3

Merging is the process that takes v2 and v3 and combines them to find this final version. When you use Git, you might not think much about how this process works, because most of the time it does just what you expect. Multiple people can change different sections of the same file and all their edits are incorporated. But a moment’s thought suggests something more complex is going on; it’s not sufficient to say something like “Alice added something on line 3 and Bob deleted line 5”, because Alice adding something on line 3 moves all the lines following it. Deleting whatever is now at line 5 after applying that change would not remove the content Bob intended, in fact it might even delete something Alice added!

This problem reveals the core problem with merging: we don’t know what order to apply changes in. If we apply the operation “delete line 5” first, then the meaning of “insert something at line 3” is probably unchanged, whereas if we perform the insertion first, then “delete line 5” probably needs adjusting somehow to take the changed line offsets into account, and it may no longer make sense as an operation any more. Merge conflicts arise precisely because of this partial ordering of operations. Nevertheless, this model of changes is commonly used in applications like real-time collaborative editing, and goes by the name operational transformation. There are many different protocols for it that depend on the type of data being modified and the interaction model for collaboration.

Some version control systems use this model, and their primary means of storing commits represents them as diffs or patches. But Git doesn’t use that model; it stores commits as complete snapshots of your content, rather than storing the differences between versions. If it doesn’t store changes, how can it merge them? Well, another way to merge changes is to take the two versions you want to merge, and compare them to the last version they had in common, effectively reconstructing the set of changes from the texts themselves. These diffs are then used to merge the two versions together relative to their common ancestor.

To perform this task, Git and many other version control systems use the diff3 algorithm. diff3 is a Unix utility originally created in 1988 by Randy Smith, and described formally in a paper by Sanjeev Khanna, Keshav Kunal and Benjamin C. Pierce. The algorithm takes as input three texts: the two versions to be merged, and the original version they both derive from, and produces a single merged version, which may contain conflicts if some sections have been edited by both parties. The use of three files means this process is referred to as a three-way merge.

Let’s walk through an example to see how it works. This is adapted from the one given in section 2 of the Khanna, Kunal and Pierce paper. Say Alice and Bob are part of a team developing a recipe book. One of the recipes is for a fish soup and includes the following list of ingredients:

                          1. celery
                          2. garlic
                          3. onions
                          4. salmon
                          5. tomatoes
                          6. wine

Now, Alice and Bob are both going through the book and making edits. They each get their own copy of the original version and make some changes to it. Alice takes garlic and onions and moves them to appear after tomatoes in the list, while Bob moves salmon to appear second.

         Alice               Original            Bob

      1. celery           1. celery           1. celery
      2. salmon           2. garlic           2. salmon
      3. tomatoes         3. onions           3. garlic
      4. garlic           4. salmon           4. onions
      5. onions           5. tomatoes         5. tomatoes
      6. wine             6. wine             6. wine

To merge the changes made by Alice and Bob into a single document, the first thing diff3 does is calculate a diff between the original copy and Alice’s version, and between the original and Bob’s. The algorithm used to calculate the diff can make a big difference to the result of the merge, as we’ll see later, but for now let’s assume the diffs come out as follows. Here is Alice’s:

         Alice               Original

      1. celery           1. celery
-                         2. garlic                           
-                         3. onions                           
      2. salmon           4. salmon
      3. tomatoes         5. tomatoes
+     4. garlic                                               
+     5. onions                                               
      6. wine             6. wine

And here is Bob’s:

                             Original            Bob

                          1. celery           1. celery
+                                             2. salmon       
                          2. garlic           3. garlic
                          3. onions           4. onions
-                         4. salmon                           
                          5. tomatoes         5. tomatoes
                          6. wine             6. wine

Once we have these diffs, the next thing we do is find any lines that are unchanged in both diffs. If you look at the diffs above you’ll see that they both have matching line content for celery (line 1 in all versions), tomatoes (line 3 in Alice’s version, line 5 in the others), and wine (line 6 in all versions). We align the lines of the documents into chunks, so that these unchanged lines are matched up:

         Alice               Original            Bob

      1. celery           1. celery           1. celery         A
      -----------------------------------------------------------
                          2. garlic           2. salmon         B
      2. salmon           3. onions           3. garlic
                          4. salmon           4. onions
      -----------------------------------------------------------
      3. tomatoes         5. tomatoes         5. tomatoes       C
      -----------------------------------------------------------
      4. garlic                                                 D
      5. onions
      -----------------------------------------------------------
      6. wine             6. wine             6. wine           E

Having done this, the merge result is generated by scanning through these chunks. For chunks where all three versions agree – chunks A, C and E above – we can just output the original text since neither Alice nor Bob changed it. For chunks like B where both Alice and Bob differ from the original, we have a conflict; the merge algorithm, having no understanding of the meaning of the text, cannot decide how to resolve this and the conflict is emitted for the user to resolve by hand. In contrast, in chunk D Bob’s version is equal to the original (in this case it is empty) while Alice’s differs. In chunks like this where only one new version differs from the original, we pick the version that changed.

To summarise, for each block we emit the following:

  • A: all equal; emit celery
  • B: Both differ; emit all versions as a conflict
  • C: all equal; emit tomatoes
  • D: Alice differs; emit garlic, onions
  • E: all equal: emit wine

When Git emits these results as into the target file, conflicts are denoted by a <<<<<<< line, followed by Alice’s version, then ======= followed by Bob’s version, then a >>>>>>> line:

      celery
      <<<<<<< Alice
      salmon
      =======
      salmon
      garlic
      onions
      >>>>>>> Bob
      tomatoes
      garlic
      onions
      wine

If you set merge.conflictStyle = diff3 in your Git config, or use the standalone diff3 program, you’ll also see the original version of a conflicted chunk, denoted by the ||||||| line:

      celery
      <<<<<<< Alice
      salmon
      ||||||| Original
      garlic
      onions
      salmon
      =======
      salmon
      garlic
      onions
      >>>>>>> Bob
      tomatoes
      garlic
      onions
      wine

Now that we understand at a high level how the algorithm works, let’s look at implementing it. I’m going to create a class for doing this, since the algorithm uses several bits of state that I’d rather store as instance variables, rather than passing them between functions. We’ll start by creating a Diff3.merge method that takes o, a and b; the original version and the two copies to be merged. It also takes an optional argument called differ that specifies the diff algorithm to use, this will default to our linear Myers diff implementation.

Just like our previous diff implementations, we’re going to write this so it can take arbitrary lists of values, not just strings, but if it receives strings it splits them into lines before starting. After creating a new Diff3 object we call merge on it, which is going to set up some initial state, then generate the output chunks, and then return a new Merge object containing those chunks. We’ll see the Merge class in more detail later.

class Diff3
  def self.merge(o, a, b, differ: MyersLinear)
    o = o.lines if o.is_a?(String)
    a = a.lines if a.is_a?(String)
    b = b.lines if b.is_a?(String)

    new(o, a, b, differ).merge
  end

  def initialize(o, a, b, differ)
    @o, @a, @b, @differ = o, a, b, differ
  end

  def merge
    setup
    generate_chunks
    Merge.new(@chunks)
  end
end

First, let’s look at the setup method, which initialises a few bits of state needed to run the algorithm. It creates an empty list called @chunks that we will append new chunks to as we discover them. Then it sets a @line variable for each document o, a and b to record how far we have scanned through all the documents; this begins at zero for all of them. Finally, we create two “match sets” that record which lines in each version are equal to the original, and identifies those matches by line number.

  def setup
    @chunks = []
    @line_o = @line_a = @line_b = 0

    @match_a = match_set(@a)
    @match_b = match_set(@b)
  end

  def match_set(file)
    matches = {}

    @differ.diff(@o, file).each do |line|
      matches[line.old_line] = line.new_line if line.type == :eql
    end

    matches
  end

For our ingredient list example above, the match sets will look like this: they map line numbers in the original version to the line numbers that match them in the a and b copies. We’ll use these to match up lines that are equal in all versions as we scan through.

    @match_a = { 1 => 1, 4 => 2, 5 => 3, 6 => 6 }
    @match_b = { 1 => 1, 2 => 3, 3 => 4, 5 => 5, 6 => 6 }

Now we come to the main body of the algorithm. The generate_chunks method loops through a sequence of steps for finding and recording matching and differing chunks until it’s reached the end of all the documents. Each turn of the loop performs the following steps:

  • Find the start of the next non-matching chunk, returning i as the number of lines away from our current position that is. (If we’re in a matching chunk, then the end of that chunk will be the same number of lines away in all three documents.)
  • If i is 1, then we’re already in a non-matching chunk and we need to find the start of the next matching one. We try to find the start of the next matching chunk and record o, a and b as the line offset in each document of the start of that chunk.
    • If we found a match and a and b are therefore set, we emit a chunk up to the offsets we just found.
    • Otherwise, there are no further matches and we emit the remainder of all documents as the final chunk.
  • If i exists and is not 1, then we’ve found the start of the next non-match and we can emit a chunk up to i steps from our current line offsets.
  • If i does not exist, then we’ve searched to the end of all three documents and we emit the remainder of all of them as the final chunk.

Here is an implementation that draws that process out, leaving the implementation of finding the next match or mismatch, and emitting chunks, to other methods.

  def generate_chunks
    loop do
      i = find_next_mismatch

      if i == 1
        o, a, b = find_next_match

        if a and b
          emit_chunk(o, a, b)
        else
          emit_final_chunk
          return
        end

      elsif i
        emit_chunk(@line_o + i, @line_a + i, @line_b + i)

      else
        emit_final_chunk
        return
      end
    end
  end

Now we just need the methods for finding the next match or mismatch, and for emitting chunks.

To find the start of the next mismatch, we start a counter i at 1, and step through each document line-by-line from our current position. If i has not counted past the end of all the documents, and if the current line in all the documents match, we increment i and keep looping. Once the loop stops, if i is still within the bounds of any of the documents then we return it, otherwise we return nothing.

The match? method works by using the match sets we created at the start of the process. Rather than scanning through and comparing the actual text of the documents for equality, we use these match sets as an index. We know that line (@line_a + i) in document @a matches line (@line_o + i) in the original, if @match_a contains a mapping from the original line number to that in @a.

  def find_next_mismatch
    i = 1
    while in_bounds?(i) and
          match?(@match_a, @line_a, i) and
          match?(@match_b, @line_b, i)
      i += 1
    end
    in_bounds?(i) ? i : nil
  end

  def in_bounds?(i)
    @line_o + i <= @o.size or
    @line_a + i <= @a.size or
    @line_b + i <= @b.size
  end

  def match?(matches, offset, i)
    matches[@line_o + i] == offset + i
  end

For example, when we run this code at the beginning of our recipe example, @line_o, @line_a and @line_b are all 0. We run the loop with i = 1, and checking the match sets we see that @match_a[1] == 1 and @match_b[1] == 1, so we try again with i = 2. On this turn, the value of @match_b[2] is 3 and @match_a doesn’t even have an entry for 2, so we’ve found the start of a non-matching chunk. This value of i means the next non-matching chunk begins on line 2 relative to our current position, so we can emit one line from each document as a matching chunk. The emit_chunk(@line_o + i, @line_a + i, @line_b + i) call in generate_chunks accomplishes this.

Finding the start of the next match is a little simpler. We start a counter o at one more than our current @line_o offset, and we increment it until either it exceeds the size of @o or until both match sets have an entry for that line number, indicating that both diffs leave that line unchanged. We then return the final value of o, and the corresponding line numbers from each match set. If we didn’t find any matches, these latter two values will be nil.

  def find_next_match
    o = @line_o + 1
    until o > @o.size or (@match_a.has_key?(o) and @match_b.has_key?(o))
      o += 1
    end
    [o, @match_a[o], @match_b[o]]
  end

Returning to our example, after emitting the first matching chunk, @line_o, @line_a and @line_b are all 1, and recall that the match sets are as follows, the line numbers in @o appearing on the left-hand side of the arrows:

    @match_a = { 1 => 1, 4 => 2, 5 => 3, 6 => 6 }
    @match_b = { 1 => 1, 2 => 3, 3 => 4, 5 => 5, 6 => 6 }

We start with o = 2. @match_b has an entry for 2 but @match_a does not, so we try o = 3. Again, @match_b has an entry but @match_a does not, so we try o = 4. This time, @match_a has an entry but @match_b does not. Finally we try o = 5 and see that both match sets contain an entry: @match_a[5] = 3 and @match_b[5] = 5. So, we return [5, 3, 5] in this instance and generate_chunks calls emit_chunk(5, 3, 5).

The methods for emitting chunks are what construct the output of the merge algorithm, and keep the line offsets up to date, moving them to the end of the chunk we just emitted in each document. emit_chunk creates a chunk from the current line offsets up to a given index in each document, and then sets the line offsets to the end of these ranges. We subtract one from all the offsets we’re given because the diffs and the finder methods use 1-indexed line numbers, but our @line variables are 0-indexed offsets from the start of each document. emit_final_chunk just emits whatever is left from all the documents from their current line offsets onwards.

  def emit_chunk(o, a, b)
    write_chunk(
      @o[@line_o ... o - 1],
      @a[@line_a ... a - 1],
      @b[@line_b ... b - 1])

    @line_o, @line_a, @line_b = o - 1, a - 1, b - 1
  end

  def emit_final_chunk
    write_chunk(
      @o[@line_o .. -1],
      @a[@line_a .. -1],
      @b[@line_b .. -1])
  end

Finally, we reach the end of our chain of methods with write_chunk. This takes a set of lines from each document and emits the appropriate kind of chunk depending on their contents. If all three sets are equal, then we emit a Clean chunk object containing the original version; if Alice’s chunk is equal to the original then we emit Bob’s and vice versa; and if neither is equal then we emit a Conflict chunk containing all three versions.

  def write_chunk(o, a, b)
    if o == a and o == b
      @chunks << Clean.new(o)
    elsif o == a
      @chunks << Clean.new(b)
    elsif o == b
      @chunks << Clean.new(a)
    else
      @chunks << Conflict.new(o, a, b)
    end
  end

Once generate_chunks finishes looping, the merge method returns Merge.new(@chunks), creating a data structure representing the result of the whole process. That data structure comes with a to_s method for turning the whole thing into a string that can be written back to a file, but keeping it as a structure lets us inspect it, particularly to check whether the merge was clean or not. While the to_s methods assume the original inputs were a series of lines of text, exposing the merge result as a data structure allows us to use this algorithm with other types of input.

We can also use these structures to generate the output text with added data that isn’t necessary for the merge algorithm itself, for example the file or commit names of the two versions being merged.

  Clean = Struct.new(:content) do
    def to_s(*)
      content.join("")
    end
  end

  Conflict = Struct.new(:o, :a, :b) do
    def to_s(a_name = "", b_name = "")
      text = ""
      separator(text, "<", a_name)
      a.each { |line| text << line }
      separator(text, "=")
      b.each { |line| text << line }
      separator(text, ">", b_name)
      text
    end

    def separator(text, char, name = "")
      text << char * 7 + " " + name + "\n"
    end
  end

  Merge = Struct.new(:chunks) do
    def clean?
      chunks.grep(Conflict).empty?
    end

    def to_s(a: "", b: "")
      chunks.map { |chunk| chunk.to_s(a, b) }.join("")
    end
  end

We now have a complete implementation that we can use to merge data from different sources, and print out the result:

o = File.read("original.txt")
a = File.read("alice.txt")
b = File.read("bob.txt")

merged = Diff3.merge(o, a, b)

puts merged.to_s(a: "alice.txt", b: "bob.txt")

# =>  celery
#     <<<<<<< alice.txt
#     salmon
#     =======
#     salmon
#     garlic
#     onions
#     >>>>>>> bob.txt
#     tomatoes
#     garlic
#     onions
#     wine

While the diff3 algorithm is relatively simple, it is highly sensitive to the output of the underlying diff algorithm. The diffs calculated should not produce different outcomes when the merge is clean, but they can lead to changes in the conflicts you get, in some cases leading to conflicts that are deeply misleading and surprising to the user. We’ll look at how this happens and what can be done about it in future articles.

If you’ve enjoyed this article, you might enjoy my recently published book JavaScript Testing Recipes. It’s full of simple techniques for writing modular, maintainable JavaScript apps in the browser and on the server.