If you enjoy this article, I have published a book explaining the internals of Git through implementation: Building Git.
—
Over the course of the last few articles, we’ve investigated how Git calculates
diffs between different versions of a file, first looking at the Myers diff
algorithm and then its linear space variant. As a programmer, you are
probably most familiar with diffs as a way for you to see what’s changed; you
see them whenever you run the git diff
or git show
commands, or when you use
GitHub’s compare view or read a pull request. But diffs aren’t only for human
consumption, although their design is strongly influenced by human expectations.
They also form an ingredient in other computations that Git has to perform, the
most common one being merging.
Merging happens whenever two or more concurrent branches have modified the same
file, and we want to reconcile all those changes into a single final version.
Say Alice and Bob have both forked from the master branch of a project, where a
version of some file exists; we will call this version v1
. Alice modifies the
file on their branch, producing version v2
of the file, and Bob makes a
different change to their copy, producing yet another version v3
.
-------o [alice]
/ v2
/
[master] o
v1 \
\
-------o [bob]
v3
We now want to merge the branches back together, reconciling both Alice’s and
Bob’s changes to produce a final version, v4
.
-------o------
/ v2 \
/ \
[master] o o v4
v1 \ /
\ /
-------o------
v3
Merging is the process that takes v2
and v3
and combines them to find this
final version. When you use Git, you might not think much about how this process
works, because most of the time it does just what you expect. Multiple people
can change different sections of the same file and all their edits are
incorporated. But a moment’s thought suggests something more complex is going
on; it’s not sufficient to say something like “Alice added something on line 3
and Bob deleted line 5”, because Alice adding something on line 3 moves all the
lines following it. Deleting whatever is now at line 5 after applying that
change would not remove the content Bob intended, in fact it might even delete
something Alice added!
This problem reveals the core problem with merging: we don’t know what order to apply changes in. If we apply the operation “delete line 5” first, then the meaning of “insert something at line 3” is probably unchanged, whereas if we perform the insertion first, then “delete line 5” probably needs adjusting somehow to take the changed line offsets into account, and it may no longer make sense as an operation any more. Merge conflicts arise because the commit graph is partially ordered; when commits occur on concurrent branches we don’t know which order they should be interpreted in. Nevertheless, this model of changes is commonly used in applications like real-time collaborative editing, and goes by the name operational transformation. There are many different protocols for it that depend on the type of data being modified and the interaction model for collaboration.
Some version control systems use this model, and their primary means of storing commits represents them as diffs or patches. But Git doesn’t use that model; it stores commits as complete snapshots of your content, rather than storing the differences between versions. If it doesn’t store changes, how can it merge them? Well, another way to merge changes is to take the two versions you want to merge, and compare them to the last version they had in common, effectively reconstructing the set of changes from the texts themselves. These diffs are then used to merge the two versions together relative to their common ancestor.
To perform this task, Git and many other version control systems use the diff3
algorithm. diff3
is a Unix utility originally created in 1988 by Randy
Smith, and described formally in a paper by Sanjeev Khanna, Keshav Kunal
and Benjamin C. Pierce. The algorithm takes as input three texts: the two
versions to be merged, and the original version they both derive from, and
produces a single merged version, which may contain conflicts if some sections
have been edited by both parties. The use of three files means this process is
referred to as a three-way merge.
Let’s walk through an example to see how it works. This is adapted from the one given in section 2 of the Khanna, Kunal and Pierce paper. Say Alice and Bob are part of a team developing a recipe book. One of the recipes is for a fish soup and includes the following list of ingredients:
1. celery
2. garlic
3. onions
4. salmon
5. tomatoes
6. wine
Now, Alice and Bob are both going through the book and making edits. They each
get their own copy of the original version and make some changes to it. Alice
takes garlic
and onions
and moves them to appear after tomatoes
in the
list, while Bob moves salmon
to appear second.
Alice Original Bob
1. celery 1. celery 1. celery
2. salmon 2. garlic 2. salmon
3. tomatoes 3. onions 3. garlic
4. garlic 4. salmon 4. onions
5. onions 5. tomatoes 5. tomatoes
6. wine 6. wine 6. wine
To merge the changes made by Alice and Bob into a single document, the first thing diff3 does is calculate a diff between the original copy and Alice’s version, and between the original and Bob’s. The algorithm used to calculate the diff can make a big difference to the result of the merge, as we’ll see later, but for now let’s assume the diffs come out as follows. Here is Alice’s:
Alice Original
1. celery 1. celery
- 2. garlic
- 3. onions
2. salmon 4. salmon
3. tomatoes 5. tomatoes
+ 4. garlic
+ 5. onions
6. wine 6. wine
And here is Bob’s:
Original Bob
1. celery 1. celery
+ 2. salmon
2. garlic 3. garlic
3. onions 4. onions
- 4. salmon
5. tomatoes 5. tomatoes
6. wine 6. wine
Once we have these diffs, the next thing we do is find any lines that are
unchanged in both diffs. If you look at the diffs above you’ll see that they
both have matching line content for celery
(line 1 in all versions),
tomatoes
(line 3 in Alice’s version, line 5 in the others), and wine
(line 6
in all versions). We align the lines of the documents into chunks, so that these
unchanged lines are matched up:
Alice Original Bob
1. celery 1. celery 1. celery A
-----------------------------------------------------------
2. garlic 2. salmon B
2. salmon 3. onions 3. garlic
4. salmon 4. onions
-----------------------------------------------------------
3. tomatoes 5. tomatoes 5. tomatoes C
-----------------------------------------------------------
4. garlic D
5. onions
-----------------------------------------------------------
6. wine 6. wine 6. wine E
Having done this, the merge result is generated by scanning through these chunks. For chunks where all three versions agree – chunks A, C and E above – we can just output the original text since neither Alice nor Bob changed it. For chunks like B where both Alice and Bob differ from the original, we have a conflict; the merge algorithm, having no understanding of the meaning of the text, cannot decide how to resolve this and the conflict is emitted for the user to resolve by hand. In contrast, in chunk D Bob’s version is equal to the original (in this case it is empty) while Alice’s differs. In chunks like this where only one new version differs from the original, we pick the version that changed.
To summarise, for each block we emit the following:
- A: all equal; emit
celery
- B: Both differ; emit all versions as a conflict
- C: all equal; emit
tomatoes
- D: Alice differs; emit
garlic
,onions
- E: all equal: emit
wine
When Git emits these results as into the target file, conflicts are denoted by a
<<<<<<<
line, followed by Alice’s version, then =======
followed by Bob’s
version, then a >>>>>>>
line:
celery
<<<<<<< Alice
salmon
=======
salmon
garlic
onions
>>>>>>> Bob
tomatoes
garlic
onions
wine
If you set merge.conflictStyle = diff3
in your Git config, or use the
standalone diff3
program, you’ll also see the original version of a conflicted
chunk, denoted by the |||||||
line:
celery
<<<<<<< Alice
salmon
||||||| Original
garlic
onions
salmon
=======
salmon
garlic
onions
>>>>>>> Bob
tomatoes
garlic
onions
wine
Now that we understand at a high level how the algorithm works, let’s look at
implementing it. I’m going to create a class for doing this, since the
algorithm uses several bits of state that I’d rather store as instance
variables, rather than passing them between functions. We’ll start by creating a
Diff3.merge
method that takes o
, a
and b
; the original version and the
two copies to be merged. It also takes an optional argument called differ
that
specifies the diff algorithm to use, this will default to our linear Myers diff
implementation.
This class will take three Merge::File
objects, each of which has a name
and
a list of Line
objects. We wrap the files up like this because if we need to
display a merge conflict, it’s useful to display the filename (or commit ID or
message) of each changed version in the final output. Let’s create that
structure and a helper function for creating one from a file path.
module Merge
File = Struct.new(:name, :lines)
def self.file(path)
lines = Diff.lines(::File.read(path))
File.new(path, lines)
end
end
After creating a new Diff3
object we call merge
on it, which is going to set
up some initial state, then generate the output chunks, and then return a new
Merge
object containing those chunks. We’ll see the Merge
class in more
detail later.
class Diff3
def self.merge(o, a, b, differ: MyersLinear)
new(o, a, b, differ).merge
end
def initialize(o, a, b, differ)
@o, @a, @b, @differ = o, a, b, differ
end
def merge
setup
generate_chunks
Merge.new(@o, @a, @b, @chunks)
end
end
First, let’s look at the setup
method, which initialises a few bits of state
needed to run the algorithm. It creates an empty list called @chunks
that we
will append new chunks to as we discover them. Then it sets a @line
variable
for each document o
, a
and b
to record how far we have scanned through all
the documents; this begins at zero for all of them. Finally, we create two
“match sets” that record which lines in each version are equal to the original,
and identifies those matches by line number.
def setup
@chunks = []
@line_o = @line_a = @line_b = 0
@match_a = match_set(@a)
@match_b = match_set(@b)
end
def match_set(file)
matches = {}
@differ.diff(@o.lines, file.lines).each do |edit|
matches[edit.old_line.number] = edit.new_line.number if edit.type == :eql
end
matches
end
For our ingredient list example above, the match sets will look like this: they
map line numbers in the original version to the line numbers that match them in
the a
and b
copies. We’ll use these to match up lines that are equal in all
versions as we scan through.
@match_a = { 1 => 1, 4 => 2, 5 => 3, 6 => 6 }
@match_b = { 1 => 1, 2 => 3, 3 => 4, 5 => 5, 6 => 6 }
Now we come to the main body of the algorithm. The generate_chunks
method
loops through a sequence of steps for finding and recording matching and
differing chunks until it’s reached the end of all the documents. Each turn of
the loop performs the following steps:
- Find the start of the next non-matching chunk, returning
i
as the number of lines away from our current position that is. (If we’re in a matching chunk, then the end of that chunk will be the same number of lines away in all three documents.) - If
i
is 1, then we’re already in a non-matching chunk and we need to find the start of the next matching one. We try to find the start of the next matching chunk and recordo
,a
andb
as the line offset in each document of the start of that chunk.- If we found a match and
a
andb
are therefore set, we emit a chunk up to the offsets we just found. - Otherwise, there are no further matches and we emit the remainder of all documents as the final chunk.
- If we found a match and
- If
i
exists and is not 1, then we’ve found the start of the next non-match and we can emit a chunk up toi
steps from our current line offsets. - If
i
does not exist, then we’ve searched to the end of all three documents and we emit the remainder of all of them as the final chunk.
Here is an implementation that draws that process out, leaving the implementation of finding the next match or mismatch, and emitting chunks, to other methods.
def generate_chunks
loop do
i = find_next_mismatch
if i == 1
o, a, b = find_next_match
if a and b
emit_chunk(o, a, b)
else
emit_final_chunk
return
end
elsif i
emit_chunk(@line_o + i, @line_a + i, @line_b + i)
else
emit_final_chunk
return
end
end
end
Now we just need the methods for finding the next match or mismatch, and for emitting chunks.
To find the start of the next mismatch, we start a counter i
at 1, and step
through each document line-by-line from our current position. If i
has not
counted past the end of all the documents, and if the current line in all the
documents match, we increment i
and keep looping. Once the loop stops, if i
is still within the bounds of any of the documents then we return it, otherwise
we return nothing.
The match?
method works by using the match sets we created at the start of the
process. Rather than scanning through and comparing the actual text of the
documents for equality, we use these match sets as an index. We know that line
(@line_a + i)
in document @a
matches line (@line_o + i)
in the original,
if @match_a
contains a mapping from the original line number to that in @a
.
def find_next_mismatch
i = 1
while in_bounds?(i) and
match?(@match_a, @line_a, i) and
match?(@match_b, @line_b, i)
i += 1
end
in_bounds?(i) ? i : nil
end
def in_bounds?(i)
@line_o + i <= @o.lines.size or
@line_a + i <= @a.lines.size or
@line_b + i <= @b.lines.size
end
def match?(matches, offset, i)
matches[@line_o + i] == offset + i
end
For example, when we run this code at the beginning of our recipe example,
@line_o
, @line_a
and @line_b
are all 0
. We run the loop with i = 1
,
and checking the match sets we see that @match_a[1] == 1
and @match_b[1] ==
1
, so we try again with i = 2
. On this turn, the value of @match_b[2]
is
3
and @match_a
doesn’t even have an entry for 2
, so we’ve found the start
of a non-matching chunk. This value of i
means the next non-matching chunk
begins on line 2 relative to our current position, so we can emit one line from
each document as a matching chunk. The emit_chunk(@line_o + i, @line_a + i,
@line_b + i)
call in generate_chunks
accomplishes this.
Finding the start of the next match is a little simpler. We start a counter o
at one more than our current @line_o
offset, and we increment it until either
it exceeds the size of @o
or until both match sets have an entry for that line
number, indicating that both diffs leave that line unchanged. We then return the
final value of o
, and the corresponding line numbers from each match set. If
we didn’t find any matches, these latter two values will be nil
.
def find_next_match
o = @line_o + 1
until o > @o.lines.size or (@match_a.has_key?(o) and @match_b.has_key?(o))
o += 1
end
[o, @match_a[o], @match_b[o]]
end
Returning to our example, after emitting the first matching chunk, @line_o
,
@line_a
and @line_b
are all 1
, and recall that the match sets are as
follows, the line numbers in @o
appearing on the left-hand side of the arrows:
@match_a = { 1 => 1, 4 => 2, 5 => 3, 6 => 6 }
@match_b = { 1 => 1, 2 => 3, 3 => 4, 5 => 5, 6 => 6 }
We start with o = 2
. @match_b
has an entry for 2
but @match_a
does not,
so we try o = 3
. Again, @match_b
has an entry but @match_a
does not, so we
try o = 4
. This time, @match_a
has an entry but @match_b
does not. Finally
we try o = 5
and see that both match sets contain an entry: @match_a[5] = 3
and @match_b[5] = 5
. So, we return [5, 3, 5]
in this instance and
generate_chunks
calls emit_chunk(5, 3, 5)
.
The methods for emitting chunks are what construct the output of the merge
algorithm, and keep the line offsets up to date, moving them to the end of the
chunk we just emitted in each document. emit_chunk
creates a chunk from the
current line offsets up to a given index in each document, and then sets the
line offsets to the end of these ranges. We subtract one from all the offsets
we’re given because the diffs and the finder methods use 1-indexed line numbers,
but our @line
variables are 0-indexed offsets from the start of each document.
emit_final_chunk
just emits whatever is left from all the documents from their
current line offsets onwards.
def emit_chunk(o, a, b)
write_chunk(
@line_o ... o - 1,
@line_a ... a - 1,
@line_b ... b - 1)
@line_o, @line_a, @line_b = o - 1, a - 1, b - 1
end
def emit_final_chunk
write_chunk(
@line_o .. -1,
@line_a .. -1,
@line_b .. -1)
end
Finally, we reach the end of our chain of methods with write_chunk
. This takes
a set of lines from each document and emits the appropriate kind of chunk
depending on their contents. If all three sets are equal, then we emit a Clean
chunk object containing the original version; if Alice’s chunk is equal to the
original then we emit Bob’s and vice versa; and if neither is equal then we emit
a Conflict
chunk containing all three versions.
def write_chunk(o_range, a_range, b_range)
o = @o.lines[o_range].map(&:text)
a = @a.lines[a_range].map(&:text)
b = @b.lines[b_range].map(&:text)
if o == a and o == b
@chunks << Clean.new(o)
elsif o == a
@chunks << Clean.new(b)
elsif o == b
@chunks << Clean.new(a)
else
@chunks << Conflict.new(o, a, b)
end
end
Once generate_chunks
finishes looping, the merge
method returns
Merge.new(@o, @a, @b, @chunks)
, creating a data structure representing the
result of the whole process. That data structure comes with a to_s
method for
turning the whole thing into a string that can be written back to a file, but
keeping it as a structure lets us inspect it, particularly to check whether the
merge was clean or not. While the to_s
methods assume the original inputs were
a series of lines of text, exposing the merge result as a data structure allows
us to use this algorithm with other types of input.
We can also use these structures to generate the output text with added data that isn’t necessary for the merge algorithm itself, for example the file or commit names of the two versions being merged.
Clean = Struct.new(:content) do
def to_s(*)
content.join("")
end
end
Conflict = Struct.new(:o, :a, :b) do
def to_s(a_name = "", b_name = "")
text = ""
separator(text, "<", a_name)
a.each { |line| text << line }
separator(text, "=")
b.each { |line| text << line }
separator(text, ">", b_name)
text
end
def separator(text, char, name = "")
text << char * 7 + " " + name + "\n"
end
end
Merge = Struct.new(:o, :a, :b, :chunks) do
def clean?
chunks.grep(Conflict).empty?
end
def to_s
chunks.map { |chunk| chunk.to_s(a.name, b.name) }.join("")
end
end
We now have a complete implementation that we can use to merge data from different sources, and print out the result:
o = Merge.file("original.txt")
a = Merge.file("alice.txt")
b = Merge.file("bob.txt")
merged = Diff3.merge(o, a, b)
puts merged
# => celery
# <<<<<<< alice.txt
# salmon
# =======
# salmon
# garlic
# onions
# >>>>>>> bob.txt
# tomatoes
# garlic
# onions
# wine
While the diff3
algorithm is relatively simple, it is highly sensitive to the
output of the underlying diff algorithm. The diffs calculated should not produce
different outcomes when the merge is clean, but they can lead to changes in the
conflicts you get, in some cases leading to conflicts that are deeply misleading
and surprising to the user. We’ll look at how this happens and what can be done
about it in future articles.