If you enjoy this article, I have published a book explaining the internals of Git through implementation: Building Git.
—
In Merging with diff3 we saw how diff algorithms are used to compare versions of a file when performing a three-way merge, and in Myers diff in linear space we learned how different diff algorithms can produce wildly different results. Not only are these different results confusing to human readers, they also affect how well merging works. Before we discuss one final diff algorithm, I’d like to show some ways in which diff output can affect the success of a merge.
In the last article, we considered merging Alice and Bob’s modified versions of a shopping list:
Alice Original Bob
1. celery 1. celery 1. celery
2. salmon 2. garlic 2. salmon
3. tomatoes 3. onions 3. garlic
4. garlic 4. salmon 4. onions
5. onions 5. tomatoes 5. tomatoes
6. wine 6. wine 6. wine
Using our linear-space Myers implementation to perform the diff, the diff3 algorithm produced this result with a single conflicted region:
celery
<<<<<<< alice.txt
salmon
=======
salmon
garlic
onions
>>>>>>> bob.txt
tomatoes
garlic
onions
wine
Now, this outcome was based on taking the following diff between the original version and Alice’s changed copy:
Alice Original
1. celery 1. celery
- 2. garlic
- 3. onions
2. salmon 4. salmon
3. tomatoes 5. tomatoes
+ 4. garlic
+ 5. onions
6. wine 6. wine
However, this is not the only minimal diff between these two versions. An
equally valid diff, rather than moving garlic
and onions
, moves salmon
and
tomatoes
:
Alice Original
1. celery 1. celery
+ 2. salmon
+ 3. tomatoes
4. garlic 2. garlic
5. onions 3. onions
- 4. salmon
- 5. tomatoes
6. wine 6. wine
If we take this diff, then matching the regions of the files up produces these groups:
Alice Original Bob
1. celery 1. celery 1. celery A
-----------------------------------------------------------
2. salmon 2. salmon B
3. tomatoes
-----------------------------------------------------------
4. garlic 2. garlic 3. garlic C
5. onions 3. onions 4. onions
-----------------------------------------------------------
4. salmon 5. tomatoes D
5. tomatoes
-----------------------------------------------------------
6. wine 6. wine 6. wine E
Notice that this grouping includes two conflicted regions: in region B, both
Alice and Bob insert new items at the same point in the original version. In
region D, Alice removes salmon
and tomatoes
while Bob is missing only
salmon
. So the diff3 output based on this alternative diff for Alice will be:
celery
<<<<<<< alice.txt
salmon
tomatoes
=======
salmon
>>>>>>> bob.txt
garlic
onions
<<<<<<< alice.txt
=======
tomatoes
>>>>>>> bob.txt
wine
This is one small and contrived example of how a seemingly innocuous change in a diff can introduce new merge conflicts. In real-world situations, a bad diff can lead to really misleading conflicts that are hard to resolve correctly. In our discussion of linear-space Myers diff, we saw one particularly bad example of some C code that produced a confusing diff when two functions traded places in the source code. Here’s the original again:
// original.c
void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
if (!Chunk_bounds_check(src, src_start, n)) return;
if (!Chunk_bounds_check(dst, dst_start, n)) return;
memcpy(dst->data + dst_start, src->data + src_start, n);
}
int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
if (chunk == NULL) return 0;
return start <= chunk->length && n <= chunk->length - start;
}
And this time, let’s say Alice swaps the order of the functions, and makes no other changes, producing this version:
// alice.c
int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
if (chunk == NULL) return 0;
return start <= chunk->length && n <= chunk->length - start;
}
void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
if (!Chunk_bounds_check(src, src_start, n)) return;
if (!Chunk_bounds_check(dst, dst_start, n)) return;
memcpy(dst->data + dst_start, src->data + src_start, n);
}
Meanwhile, Bob takes the original version and makes one small change, adding a
comment before the memcpy
call in the Chunk_copy
function.
// bob.c
void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
if (!Chunk_bounds_check(src, src_start, n)) return;
if (!Chunk_bounds_check(dst, dst_start, n)) return;
// copy the bytes
memcpy(dst->data + dst_start, src->data + src_start, n);
}
int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
if (chunk == NULL) return 0;
return start <= chunk->length && n <= chunk->length - start;
}
Using the linear-space Myers algorithm, the diff between the original and Alice’s version, rather than clearly showing the functions trading places, is the following (this is the diff Git produces by default):
-void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
+int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
- if (!Chunk_bounds_check(src, src_start, n)) return;
- if (!Chunk_bounds_check(dst, dst_start, n)) return;
+ if (chunk == NULL) return 0;
- memcpy(dst->data + dst_start, src->data + src_start, n);
+ return start <= chunk->length && n <= chunk->length - start;
}
-int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
- if (chunk == NULL) return 0;
+ if (!Chunk_bounds_check(src, src_start, n)) return;
+ if (!Chunk_bounds_check(dst, dst_start, n)) return;
- return start <= chunk->length && n <= chunk->length - start;
+ memcpy(dst->data + dst_start, src->data + src_start, n);
}
The important thing about this diff from the point of view of the diff3 merge algorithm is that it has a very fragmented view of which regions of the file remain unchanged. Rather than showing one function not changing while the other is deleted and inserted somewhere else, this diff shows many isolated lines matching up, and most of those matches are common boilerplate like blank lines and braces rather than meaningfully distinct pieces of code.
This means that diff3 will struggle to meaningfully match up content in Alice and Bob’s versions, so conflicts are much more likely, and Bob’s extra comment will be positioned relative to the matching blank lines and braces, rather than relative to distinct program statements.
Indeed, when we merge these two versions using linear-space Myers as the diff algorithm, this is the result:
int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
if (chunk == NULL) return 0;
<<<<<<< alice.c
return start <= chunk->length && n <= chunk->length - start;
=======
// copy the bytes
memcpy(dst->data + dst_start, src->data + src_start, n);
>>>>>>> bob.c
}
void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
if (!Chunk_bounds_check(src, src_start, n)) return;
if (!Chunk_bounds_check(dst, dst_start, n)) return;
memcpy(dst->data + dst_start, src->data + src_start, n);
}
(Again, this is what Git will give you if you make it perform this merge without telling it to use an alternative diff algorithm.)
Although the conflict here is surprisingly small given the messy diff above, the
conflict is incredibly misleading: it suggests that the memcpy
call and the
comment Bob added appear in the Chunk_bounds_check
function, not the
Chunk_copy
function as is the case. If this merge is being resolved by
somebody who is not familiar with this code, they could easily keep Bob’s
addition in its suggested place, introducing a bug into the program.
This conflict happens because diff3 has matched up the blank lines and braces from Alice’s diff with the blank lines and braces in Bob’s copy. Bob’s added comment appears after one opening brace and one blank line in his version, while after one matching opening brace and one blank line in the diff, Alice has the change:
- memcpy(dst->data + dst_start, src->data + src_start, n);
+ return start <= chunk->length && n <= chunk->length - start;
Therefore, Bob’s comment is considered to conflict with this change of Alice’s.
A better outcome would be to just keep Alice’s version here – the return start
<= ...
line – and add Bob’s comment to Chunk_copy
, with the functions having
traded places:
int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
{
if (chunk == NULL) return 0;
return start <= chunk->length && n <= chunk->length - start;
}
void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
if (!Chunk_bounds_check(src, src_start, n)) return;
if (!Chunk_bounds_check(dst, dst_start, n)) return;
// copy the bytes
memcpy(dst->data + dst_start, src->data + src_start, n);
}
This is perfectly possible if we obtain a diff for Alice that leaves more distinct lines of code unchanged, for example the diff given by our original Myers implementation that more clearly shows one function moving while the other does not change:
+int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
+{
+ if (chunk == NULL) return 0;
+
+ return start <= chunk->length && n <= chunk->length - start;
+}
+
void Chunk_copy(Chunk *src, size_t src_start, Chunk *dst, size_t dst_start, size_t n)
{
if (!Chunk_bounds_check(src, src_start, n)) return;
if (!Chunk_bounds_check(dst, dst_start, n)) return;
memcpy(dst->data + dst_start, src->data + src_start, n);
}
-
-int Chunk_bounds_check(Chunk *chunk, size_t start, size_t n)
-{
- if (chunk == NULL) return 0;
-
- return start <= chunk->length && n <= chunk->length - start;
-}
With this diff, much of the body of the Chunk_copy
function will be matched up
with the corresponding lines in Bob’s version, and Bob’s comment can easily be
inserted at the correct place.
If you weren’t aware of how the merge process works, you might see the
conflicted merge that places Bob’s comment inside the Chunk_bounds_check
function and be tempted to ask why something so obviously wrong could happen.
Hopefully the above makes it clear: diff and merge algorithms don’t have any
understanding of C or of any other programming language or format, and so they
don’t recognise Bob’s comment as being inside any particular function. They
don’t even know what comments and functions are! All they are doing is matching
up equal lines of text, and the lines that the diff denotes as equal determine
the outcome.
But then you might ask: why organise things around lines? Why not some other unit like characters or words, or why not have tools that do understand programming languages and can perform diffs and merges on parse trees rather than flat text? Well, it all comes down to trade-offs. Let’s look at merging based on words first. We can take a sentence, and create two versions of it with more words added, and merge those versions just fine:
def file(words)
Merge::File.new("", Diff.lines(words))
end
o = file(%w[the fox jumps over the dog])
a = file(%w[the quick brown fox jumps over the dog])
b = file(%w[the fox jumps over the lazy dog])
merged = Diff3.merge(o, a, b)
puts merged.chunks.flat_map(&:content).join(" ")
# => "the quick brown fox jumps over the lazy dog"
This works because the two inserted phrases are clearly separated by a list of
words that haven’t changed, so they can be matched up between versions in such a
way that the two new words can both be added at the right place. But, it makes a
big assumption about what a word is. In some languages, edit-distance
is a
single variable name, while in others it is an expression meaning “edit
minus
distance
”. The concept of word is too language-dependent for a
general-purpose tool like Git to make a reasonable choice that will work for all
users. Given this ambiguity, you could diff at the character level, but that’s
more likely to result in unconflicted merges whose result is meaningless code.
Going in the opposite direction and diffing on parse trees rather than simple text units has a similar problem of generality: rather than one diff program that anyone can use, every language out there would need its own diff tools, and they would all need to interoperate since most non-trivial projects contain code in multiple languages. This would clearly be a lot more work than having a single version control system for everybody, and the added complexity may result in more bugs and confusion than it’s worth, especially when you consider that programming languages change over time and we’d need a way to tell which version we were parsing and a new parser for each version. It’s not impossible, but it would be more expensive to build and maintain than a general-purpose tool.
And so, while diffing based on lines is by no means perfect, especially when dealing with formatting changes like changing indentation, it’s good enough considering the cost of implementing anything significantly better. In most programming languages, a line of code is a reasonable unit to talk about adding or removing, and so diffing on lines will work reasonably well for most use cases. It works well enough while being simple to implement, making it a good trade-off for a general-purpose tool.
So, if we can’t change the basis unit of text that we calculate diffs on, what
else can we do to make merges more likely to succeed? The answer is to use
better diff algorithms. Let’s take a look at how you can do this in Git, using
our C example from above. We’ll make a new directory, start a Git repository in
it, and commit the file original.c
from above:
$ mkdir src
$ cd src
$ git init
# put code in file.c
$ git add file.c
$ git commit -m "first commit"
Now, let’s start a new branch for Alice’s changes. Create this branch, edit
file.c
to change the order of the two functions, then commit:
$ git checkout -b alice
# edit file.c
$ git commit -am "swap functions"
If you run git show
at this point, you’ll see the messy diff that matches
blank lines and braces, rather than the one that clearly shows one function
moving. However, if you instead run this:
$ git show --diff-algorithm=patience
Then you’ll see a much better diff, where Chunk_bounds_check
is clearly
removed from after Chunk_copy
and inserted before it.
Now, fork another branch off of master
and add a comment before memcpy
, just
as Bob did:
$ git checkout master
$ git checkout -b bob
# edit file.c
$ git commit -am "add comment"
Then switch back to the alice
branch and try to merge:
$ git checkout alice
Switched to branch 'alice'
$ git merge bob
Auto-merging file.c
CONFLICT (content): Merge conflict in file.c
Automatic merge failed; fix conflicts and then commit the result.
Because Git’s default diff algorithm generates a messy diff for Alice’s changes, the merge conflicts just as we saw above. Looking at the file, we see the exact same merge conflict that we generated using our own implementations:
<<<<<<< HEAD
return start <= chunk->length && n <= chunk->length - start;
=======
// copy the bytes
memcpy(dst->data + dst_start, src->data + src_start, n);
>>>>>>> bob
Finally, let’s throw this merge away and try again with a different algorithm:
$ git reset --hard
$ git merge -Xdiff-algorithm=patience bob
Auto-merging file.c
Merge made by the 'recursive' strategy.
file.c | 1 +
1 file changed, 1 insertion(+)
By changing the diff algorithm, we’ve made Git perform the merge without conflicts. That doesn’t necessarily mean the code will be correct – merges can be unconflicted but still result in broken code – but it’s certainly better than the misleading conflict we got from the messy diff.
In the next and final article in this series on Git diffs, we’ll take a look at the Patience algorithm and find out why it produces better diff and merge results.