Quick binary diffs with XDelta

Last month I decided to revive my long-neglected MP3 collection after relying on Spotify for many years. This meant using Apple Music for the first time since it was called iTunes, and in the course of ripping a few hundred CDs I found a great many bugs. One particularly fun one was that sometimes, the application will just stop accepting new artwork; you will drag an image into the file info dialog as usual but the window then displays a placeholder image instead of the one you just tried to save. The only way I have found to stop this happening is to completely restart the computer.

In the course of investigating the bug, I wondered whether the app might be genuinely adding the artwork to the MP3 files, but failing to display the artwork in the UI. After all, it has several other bugs where the file metadata is correct but the UI displays them incorrectly, for example splitting one album into multiple objects in the library browser, or displaying tracks out of numerical order.

There’s another reason I wanted to check if the artwork was actually being saved into the files. Apple Music has an option to fetch album art for you, but aside from the fact that it’s very bad at finding the correct artwork, it doesn’t actually burn it into the file, it caches it somewhere else on disk. So, if you move the files to another device you lose the artwork. You need to burn the art into the MP3 files if you want to reliably keep it.

Now I don’t know anything about the MP3 file format so I wasn’t sure how to check whether a given file actually contains the expected image data. I assumed that reading the entire JPG file into a string and looking for that string inside the MP3 might not work, because it might not keep a literal copy of the entire image file – it might discard some headers, for example.

My work on Building Git taught me about a very useful compression algorithm called XDelta, which Git uses to compress objects it stores in packfiles and sends over the network. It works by building an index from a source string, and then compresses a target string to produce a list of instructions for rebuilding the target from the source.

index = Pack::XDelta.create_index('the quick brown fox jumps over the lazy dog')

index.compress('the quick brown fox writes ruby')
# => [
#      #<struct Pack::Delta::Copy offset=0, size=20>,
#      #<struct Pack::Delta::Insert data="writes ruby">
#    ]

Here we can recover our target string by copying the first 20 bytes of the source string, and then appending the new text "writes ruby". When serialized, the Copy instructions are encoded as just a few bytes each, so this massively shrinks the size of files that have long strings of bytes in common.

XDelta isn’t a diff algorithm per se, but it is very good at identifying parts of strings that are the same. It’s also much faster than typical diff algorithms when comparing files at the level of individual bytes, rather than line-by-line, so it works great for large binary files.

I wondered whether this could help me determine whether an MP3 file contained the intended image data. I took two copies of the file, one before I tried adding artwork to it, and one after. I also stored a copy of the artwork I was trying to import.

cover = File.read('cover.jpg')
no_art = File.read('no-art.mp3')
with_art = File.read('with-art.mp3')

Now, if we build an index using the “after” file, which we expect to contain the artwork, we can use it to compress both the cover image and the “before” file, showing us which parts of those files are the same as the final MP3.

index = Pack::XDelta.create_index(with_art)

compressed_cover = index.compress(cover)
pp compressed_cover

compressed_audio = index.compress(no_art)
pp compressed_audio

For the cover image, this prints a single Copy instruction, showing that we can reconstruct cover.jpg by just copying one big range of bytes out of the final MP3 file:

[#<struct Pack::Delta::Copy offset=551, size=361841>]

This is surprisingly simple – the MP3 just contains a verbatim copy of the entire cover image file. When I tested files after Apple Music stopped accepting new artwork, the output would look more like this:

[#<struct Pack::Delta::Insert
  data=
   "\xFF\xD8\xFF\xE0\x00\x10JFIF\x00\x01\x01\x01\x00H\x00H\x00\x00\xFF\xDB...">,
 #<struct Pack::Delta::Insert
  data=
   "\x1E\x1E\x1E\x1E\x1E\x1E\x1E\x1E\x1E\x1E\x1E\x1E\x1E\x1E\x1E\x1E\x1E\x1E...">,
 ...
]

Seeing a lot of Insert instructions means the target file contains a lot of data that’s not present in the source file, so this quickly told me that the application was genuinely not saving the artwork, as opposed to saving it but not displaying it.

It’s also worth having a look at the compressed audio – the compression of the file without art based on the one with art. This contains a few more instructions but shows that most of the data in the “before” file is present verbatim in the “after” file. The only novel data is the ID3 header at the very beginning of the file.

[#<struct Pack::Delta::Insert data="ID3\x02\x00\x00\x00\x00\x11\"">,
 #<struct Pack::Delta::Copy offset=10, size=529>,
 #<struct Pack::Delta::Copy offset=362400, size=1681>,
 #<struct Pack::Delta::Copy offset=372632, size=15576294>]

We can check a few more things by printing the sizes of all the files:

p \
  cover: cover.bytesize,
  no_art: no_art.bytesize,
  with_art: with_art.bytesize

This gives us:

{:cover=>361841, :no_art=>15578514, :with_art=>15948926}

Notice that the filesize of with-art.mp3 is 15,948,926 bytes, which equals the value you get by adding the offset and size values from the final Copy instruction: 372,632 + 15,576,294 = 15,948,926. This tells us that the final 15,576,294 bytes of both files – the majority of their content – are identical; the compression instructions read right up to the end of the source file. Also, this song is 6:29 long, and encoded at 320kbps, so we’d expect the audio data to be around 15,560,000 bytes in length, so it looks very likely that this long range of identical data is the audio.

Finally let’s convert the offset/size values from the Copy instructions into start/end addresses hex:

compressed_audio.grep(Pack::Delta::Copy).each do |copy|
  p \
    start: copy.offset.to_s(16),
    end: (copy.offset + copy.size - 1).to_s(16)
end

This prints the following:

{:start=>"a", :end=>"21a"}
{:start=>"587a0", :end=>"58e30"}
{:start=>"5af98", :end=>"f35c7d"}

If we look at with-art.mp3 using hexdump we discover that the first slice from a to 21a contains the song’s metadata. The other slices begin at 587a0 and 5af98, and hexdump displays the following around those addresses:

00058790  a6 ab e2 33 b7 93 ff d9  00 00 00 00 00 00 00 00  |...3............|
000587a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0005af90  00 00 00 00 00 00 00 00  ff fb e2 00 00 00 07 44  |...............D|
0005afa0  90 3a 81 28 63 70 d7 31  e7 60 18 cc 6e 5c 3a 40  |.:.(cp.1.`..n\:@|

So the first slice is just copying a long range of null bytes out of the source file, not particularly interesting. The second slice is the audio data, and hexdump shows that this address is where non-null data starts showing up again in the source file.

This is just a quick example of how you can use XDelta to compare files when you don’t know much about their internal structure. By revealing similarities and showing you which portions of a file are the same between different versions, you can learn quite a bit about how the files are structured and spot inconsistencies easily. You can also implicitly check which parts of the source file are not copied into the target, and I have previously used this to diagnose a bug where a database was truncating stored files. It’s a very useful tool to have on hand when dealing with binary data.

Building Git recently passed 2,000 copies sold and is on sale for 50% off until the end of January. Use voucher code TWENTYFOUR at checkout to redeem the discount.