Why merges fail and what can be done about it

If you enjoy this article, I have published a book explaining the internals of Git through implementation: Building Git.

—

In Merging with diff3 we saw how diff algorithms are used to compare versions of a file when performing a three-way merge, and in Myers diff in linear space we learned how different diff algorithms can produce wildly different results. Not only are these different results confusing to human readers, they also affect how well merging works. Before we discuss one final diff algorithm, I’d like to show some ways in which diff output can affect the success of a merge.

In the last article, we considered merging Alice and Bob’s modified versions of a shopping list:

         Alice               Original            Bob

celery           1. celery           1. celery
salmon           2. garlic           2. salmon
tomatoes         3. onions           3. garlic
garlic           4. salmon           4. onions
onions           5. tomatoes         5. tomatoes
wine             6. wine             6. wine

Using our linear-space Myers implementation to perform the diff, the diff3 algorithm produced this result with a single conflicted region:

      celery
      <<<<<<< alice.txt
      salmon
      =======
      salmon
      garlic
      onions
      >>>>>>> bob.txt
      tomatoes
      garlic
      onions
      wine

Now, this outcome was based on taking the following diff between the original version and Alice’s changed copy:

         Alice               Original

      1. celery           1. celery
-                         2. garlic                           
-                         3. onions                           
      2. salmon           4. salmon
      3. tomatoes         5. tomatoes
+     4. garlic                                               
+     5. onions                                               
      6. wine             6. wine

However, this is not the only minimal diff between these two versions. An equally valid diff, rather than moving garlic and onions, moves salmon and tomatoes:

         Alice               Original

      1. celery           1. celery
+     2. salmon                                               
+     3. tomatoes                                             
      4. garlic           2. garlic
      5. onions           3. onions
-                         4. salmon                           
-                         5. tomatoes                         
      6. wine             6. wine

If we take this diff, then matching the regions of the files up produces these groups:

         Alice               Original            Bob

      1. celery           1. celery           1. celery         A
      -----------------------------------------------------------
      2. salmon                               2. salmon         B
      3. tomatoes
      -----------------------------------------------------------
      4. garlic           2. garlic           3. garlic         C
      5. onions           3. onions           4. onions
      -----------------------------------------------------------
                          4. salmon           5. tomatoes       D
                          5. tomatoes
      -----------------------------------------------------------
      6. wine             6. wine             6. wine           E

Notice that this grouping includes two conflicted regions: in region B, both Alice and Bob insert new items at the same point in the original version. In region D, Alice removes salmon and tomatoes while Bob is missing only salmon. So the diff3 output based on this alternative diff for Alice will be:

      celery
      <<<<<<< alice.txt
      salmon
      tomatoes
      ======= 
      salmon
      >>>>>>> bob.txt
      garlic
      onions
      <<<<<<< alice.txt
      ======= 
      tomatoes
      >>>>>>> bob.txt
      wine

This is one small and contrived example of how a seemingly innocuous change in a diff can introduce new merge conflicts. In real-world situations, a bad diff can lead to really misleading conflicts that are hard to resolve correctly. In our discussion of linear-space Myers diff, we saw one particularly bad example of some C code that produced a confusing diff when two functions traded places in the source code. Here’s the original again:

// original.c

void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
    if (!Chunk_bounds_check(src, src_start, n)) return;
    if (!Chunk_bounds_check(dst, dst_start, n)) return;

    memcpy(dst->data + dst_start, src->data + src_start, n);
}

int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
    if (chunk == NULL) return 0;

    return start <= chunk->length && n <= chunk->length - start;
}

And this time, let’s say Alice swaps the order of the functions, and makes no other changes, producing this version:

// alice.c

int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
    if (chunk == NULL) return 0;

    return start <= chunk->length && n <= chunk->length - start;
}

void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
    if (!Chunk_bounds_check(src, src_start, n)) return;
    if (!Chunk_bounds_check(dst, dst_start, n)) return;

    memcpy(dst->data + dst_start, src->data + src_start, n);
}

Meanwhile, Bob takes the original version and makes one small change, adding a comment before the memcpy call in the Chunk_copy function.

// bob.c

void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
    if (!Chunk_bounds_check(src, src_start, n)) return;
    if (!Chunk_bounds_check(dst, dst_start, n)) return;

    // copy the bytes
    memcpy(dst->data + dst_start, src->data + src_start, n);
}

int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
    if (chunk == NULL) return 0;

    return start <= chunk->length && n <= chunk->length - start;
}

Using the linear-space Myers algorithm, the diff between the original and Alice’s version, rather than clearly showing the functions trading places, is the following (this is the diff Git produces by default):

-void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
+int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
 {
-    if (!Chunk_bounds_check(src, src_start, n)) return;
-    if (!Chunk_bounds_check(dst, dst_start, n)) return;
+    if (chunk == NULL) return 0;

-    memcpy(dst->data + dst_start, src->data + src_start, n);
+    return start <= chunk->length && n <= chunk->length - start;
 }

-int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
 {
-    if (chunk == NULL) return 0;
+    if (!Chunk_bounds_check(src, src_start, n)) return;
+    if (!Chunk_bounds_check(dst, dst_start, n)) return;

-    return start <= chunk->length && n <= chunk->length - start;
+    memcpy(dst->data + dst_start, src->data + src_start, n);
 }

The important thing about this diff from the point of view of the diff3 merge algorithm is that it has a very fragmented view of which regions of the file remain unchanged. Rather than showing one function not changing while the other is deleted and inserted somewhere else, this diff shows many isolated lines matching up, and most of those matches are common boilerplate like blank lines and braces rather than meaningfully distinct pieces of code.

This means that diff3 will struggle to meaningfully match up content in Alice and Bob’s versions, so conflicts are much more likely, and Bob’s extra comment will be positioned relative to the matching blank lines and braces, rather than relative to distinct program statements.

Indeed, when we merge these two versions using linear-space Myers as the diff algorithm, this is the result:

int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
    if (chunk == NULL) return 0;

<<<<<<< alice.c
    return start <= chunk->length && n <= chunk->length - start;
=======
    // copy the bytes
    memcpy(dst->data + dst_start, src->data + src_start, n);
>>>>>>> bob.c
}

void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
    if (!Chunk_bounds_check(src, src_start, n)) return;
    if (!Chunk_bounds_check(dst, dst_start, n)) return;

    memcpy(dst->data + dst_start, src->data + src_start, n);
}

(Again, this is what Git will give you if you make it perform this merge without telling it to use an alternative diff algorithm.)

Although the conflict here is surprisingly small given the messy diff above, the conflict is incredibly misleading: it suggests that the memcpy call and the comment Bob added appear in the Chunk_bounds_check function, not the Chunk_copy function as is the case. If this merge is being resolved by somebody who is not familiar with this code, they could easily keep Bob’s addition in its suggested place, introducing a bug into the program.

This conflict happens because diff3 has matched up the blank lines and braces from Alice’s diff with the blank lines and braces in Bob’s copy. Bob’s added comment appears after one opening brace and one blank line in his version, while after one matching opening brace and one blank line in the diff, Alice has the change:

-     memcpy(dst->data + dst_start, src->data + src_start, n);
+     return start <= chunk->length && n <= chunk->length - start;

Therefore, Bob’s comment is considered to conflict with this change of Alice’s. A better outcome would be to just keep Alice’s version here – the return start <= ... line – and add Bob’s comment to Chunk_copy, with the functions having traded places:

int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
    if (chunk == NULL) return 0;

    return start <= chunk->length && n <= chunk->length - start;
}

void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
    if (!Chunk_bounds_check(src, src_start, n)) return;
    if (!Chunk_bounds_check(dst, dst_start, n)) return;

    // copy the bytes
    memcpy(dst->data + dst_start, src->data + src_start, n);
}

This is perfectly possible if we obtain a diff for Alice that leaves more distinct lines of code unchanged, for example the diff given by our original Myers implementation that more clearly shows one function moving while the other does not change:

+int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+{
+    if (chunk == NULL) return 0;
+
+    return start <= chunk->length && n <= chunk->length - start;
+}
+
 void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
 {
     if (!Chunk_bounds_check(src, src_start, n)) return;
     if (!Chunk_bounds_check(dst, dst_start, n)) return;

     memcpy(dst->data + dst_start, src->data + src_start, n);
 }
-
-int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
-{
-    if (chunk == NULL) return 0;
-
-    return start <= chunk->length && n <= chunk->length - start;
-}

With this diff, much of the body of the Chunk_copy function will be matched up with the corresponding lines in Bob’s version, and Bob’s comment can easily be inserted at the correct place.

If you weren’t aware of how the merge process works, you might see the conflicted merge that places Bob’s comment inside the Chunk_bounds_check function and be tempted to ask why something so obviously wrong could happen. Hopefully the above makes it clear: diff and merge algorithms don’t have any understanding of C or of any other programming language or format, and so they don’t recognise Bob’s comment as being inside any particular function. They don’t even know what comments and functions are! All they are doing is matching up equal lines of text, and the lines that the diff denotes as equal determine the outcome.

But then you might ask: why organise things around lines? Why not some other unit like characters or words, or why not have tools that do understand programming languages and can perform diffs and merges on parse trees rather than flat text? Well, it all comes down to trade-offs. Let’s look at merging based on words first. We can take a sentence, and create two versions of it with more words added, and merge those versions just fine:

def file(words)
  Merge::File.new("", Diff.lines(words))
end

o = file(%w[the fox jumps over the dog])
a = file(%w[the quick brown fox jumps over the dog])
b = file(%w[the fox jumps over the lazy dog])

merged = Diff3.merge(o, a, b)

puts merged.chunks.flat_map(&:content).join(" ")
# => "the quick brown fox jumps over the lazy dog"

This works because the two inserted phrases are clearly separated by a list of words that haven’t changed, so they can be matched up between versions in such a way that the two new words can both be added at the right place. But, it makes a big assumption about what a word is. In some languages, edit-distance is a single variable name, while in others it is an expression meaning “edit minus distance”. The concept of word is too language-dependent for a general-purpose tool like Git to make a reasonable choice that will work for all users. Given this ambiguity, you could diff at the character level, but that’s more likely to result in unconflicted merges whose result is meaningless code.

Going in the opposite direction and diffing on parse trees rather than simple text units has a similar problem of generality: rather than one diff program that anyone can use, every language out there would need its own diff tools, and they would all need to interoperate since most non-trivial projects contain code in multiple languages. This would clearly be a lot more work than having a single version control system for everybody, and the added complexity may result in more bugs and confusion than it’s worth, especially when you consider that programming languages change over time and we’d need a way to tell which version we were parsing and a new parser for each version. It’s not impossible, but it would be more expensive to build and maintain than a general-purpose tool.

And so, while diffing based on lines is by no means perfect, especially when dealing with formatting changes like changing indentation, it’s good enough considering the cost of implementing anything significantly better. In most programming languages, a line of code is a reasonable unit to talk about adding or removing, and so diffing on lines will work reasonably well for most use cases. It works well enough while being simple to implement, making it a good trade-off for a general-purpose tool.

So, if we can’t change the basis unit of text that we calculate diffs on, what else can we do to make merges more likely to succeed? The answer is to use better diff algorithms. Let’s take a look at how you can do this in Git, using our C example from above. We’ll make a new directory, start a Git repository in it, and commit the file original.c from above:

$ mkdir src
$ cd src
$ git init
# put code in file.c
$ git add file.c
$ git commit -m "first commit"

Now, let’s start a new branch for Alice’s changes. Create this branch, edit file.c to change the order of the two functions, then commit:

$ git checkout -b alice
# edit file.c
$ git commit -am "swap functions"

If you run git show at this point, you’ll see the messy diff that matches blank lines and braces, rather than the one that clearly shows one function moving. However, if you instead run this:

$ git show --diff-algorithm=patience

Then you’ll see a much better diff, where Chunk_bounds_check is clearly removed from after Chunk_copy and inserted before it.

Now, fork another branch off of master and add a comment before memcpy, just as Bob did:

$ git checkout master
$ git checkout -b bob
# edit file.c
$ git commit -am "add comment"

Then switch back to the alice branch and try to merge:

$ git checkout alice
Switched to branch 'alice'

$ git merge bob
Auto-merging file.c
CONFLICT (content): Merge conflict in file.c
Automatic merge failed; fix conflicts and then commit the result.

Because Git’s default diff algorithm generates a messy diff for Alice’s changes, the merge conflicts just as we saw above. Looking at the file, we see the exact same merge conflict that we generated using our own implementations:

<<<<<<< HEAD
    return start <= chunk->length && n <= chunk->length - start;
=======
    // copy the bytes
    memcpy(dst->data + dst_start, src->data + src_start, n);
>>>>>>> bob

Finally, let’s throw this merge away and try again with a different algorithm:

$ git reset --hard
$ git merge -Xdiff-algorithm=patience bob
Auto-merging file.c
Merge made by the 'recursive' strategy.
 file.c | 1 +
 1 file changed, 1 insertion(+)

By changing the diff algorithm, we’ve made Git perform the merge without conflicts. That doesn’t necessarily mean the code will be correct – merges can be unconflicted but still result in broken code – but it’s certainly better than the misleading conflict we got from the messy diff.

In the next and final article in this series on Git diffs, we’ll take a look at the Patience algorithm and find out why it produces better diff and merge results.

The If Works

by James Coglan

Why merges fail and what can be done about it