In 2017, I wrote a series of articles about diff and merge algorithms, and promised that one day they would form part of a book about the Git version control system. In 2018 I didn’t write anything at all on this blog, and that’s because I was writing that book.
Just over a month ago, Building Git was published, and so far has sold over 450 copies. I’ve done a couple of interviews about it (The Yak Shave, Tech Done Right), and one question you get asked a lot if you write a book is: why did you write it? So in this article I want to expand a little on what the book’s about, why I wrote it, and who I think it’s for.
The quick answer to why I wrote it is, well, for the hell of it. For the same reason I wrote those diff articles, I was interested in digging in and learning as much as I could, out of sheer curiosity, and I hope as much comes across to readers. It was always supposed to be a side project alongside my day job, not something I’d be working on full time. But that’s not a very useful answer, and it’s worth reflecting on why I was interested in doing it and what motivated me to finish it.
First, there’s the surface reason: Git is notoriously confusing to new users, and they are frequently instructed to learn how it works inside in order to understand the user interface. While this is true, it’s a slightly unsatisfying answer: this is really a design problem in Git rather than a shortcoming in its users. There are already numerous excellent write-ups of how git works at various levels of abstraction: Pro Git gives a good account, Gitlet is an accessible and concise demonstration of these concepts in code, and many other third-party implementations exist. Why write another one?
That brings us to the second-level motivation I had: writing an implementation of a complex program like Git exposes you to a very broad range of computer science topics, from abstract mathematical ideas like models of concurrent editing, to the details of the Unix filesystem API. Most technical books are very narrowly focussed and designed to be easily keyword-marketed: you want to learn this year’s hot framework, you can quickly find half a dozen books on it, you pick one, work through, put it on your CV, rinse repeat. There’s nothing wrong with this per se: people need to learn things for their jobs and the tech book market does a reasonable job of serving this need.
But there’s a kind of learning that this model does not lend itself to, and that’s seeing how all this disparate stuff fits together. Most books necessarily draw on other topics in order to build useful programs, but for reasons of scope they must assume the reader already knows these incidental topics. If you don’t know them, it can be unclear where to go find out more.
If you take a degree in a subject, someone has developed a curriculum to guide you through the field and integrate the various topics you learn along the way. As a self-directed learner, it’s much harder to find that story among the mountain of targeted books available, and so I wanted to write something that was more of an extended project. Not something you approach with a checklist of things to learn, but something that takes you on a journey through a lot of topics you didn’t even know existed, and shows you how they fit together into a big picture. Something that goes through the process of building a reasonably large program rather than showing you some toy examples and leaving the rest up to your imagination. I wanted to bridge the gap between the scope of typical educational examples, and the sort of system people work on in production.
A secondary effect of the book’s broad scope is how it changes the narrative. Most book examples are of a size where you can show the reader the entire thing at once, and then explain how it works. What this leaves out is the process of getting to the end design, and this is a topic that developers tend to struggle with a great deal, especially in legacy systems and refactoring. They know the end state they want a system to arrive at, but they find it hard to make the journey there incrementally, and being comfortable with the idea of deploying it gradually, leading to the One Big Rewrite model where you hack for six months and then do an incredibly risky roll-out.
Jit, the codebase that Building Git describes, is about 6,000 lines of Ruby code. I believe it’s impossible to describe such a codebase in a linear fashion where you show the end state of the project and attempt to explain it. A lot of it only makes sense if you go on the journey to get there, building up each piece of functionality in small increments, and refactoring when necessary. Looking only at the current state of a codebase leaves out a lot of information that you can only get from its history, and that’s why version control logs are so valuable. As well as the content in the book, Jit’s commit logs contain over 30,000 words of text, including a lot of things I couldn’t fit into the book – think of it as the extended footnotes. For example, in one commit I use some type theory to derive an abstraction that unifies two apparently different structures, so that a single class can work with both of them. I find it hard to explain why such abstractions exist without going through the process that led to them, and I think there’s a gap in technical literature to explore this outside of books specifically dedicated to the process of changing code.
Finally, there’s the big picture stuff. Why did I choose Ruby? Well you have to choose something, and I know Ruby, and that’s about all the justification I can offer. It’s not that I didn’t evaluate other languages, but for me Ruby led to the least amount of incidental complexity in the early material in terms of installation, project structure, build tooling and so on, and it has a rich enough standard library that you don’t need any third party code to do this project. I wanted the book to be able to cross language barriers and not be a Ruby book, and so I’m really glad people are following it in other languages. So far I know of people doing it in C++, Clojure, Elixir, Go, Haskell, Java, Node.js, Rust, and Swift. Having said that, I do believe Ruby lowered the language barrier more than I personally could have done in other languages, and doing this project reminded me of why I like it so much. That doesn’t mean it’s The Best Language, it was just the best language for me, for this project, for now.
But the choice of Ruby is also significant for cultural reasons. I’ve spent most of my career so far in web development, and Ruby is primarily known for its role in that space. It’s dismissed as a language you cannot write programs like Git in – it’s too slow, it’s too “high level” – and web developers are dismissed as not being “real programmers” by other parts of the tech ecosystem. What is a “real programmer”? I don’t know, I just know that people who work in my sector are often told they aren’t one. Real programmers know C. Real programmers work in systems programming. Real programmers do open source. Real programmers can tell you every member of every good data structure from 1962 to 1978. But they don’t write web apps, oh no.
There’s a huge section of the tech ecosystem that’s constantly told they’re not smart enough to be here and that their work doesn’t matter. I spent a decade hearing C was beyond mere mortals, that you must be a genius to go anyway near low-level code, or algorithms, or distributed systems. The inventor of Git is notorious for pushing this narrative! But the truth is, anyone with enough brains and patience to learn how to do any kind of computing is “smart enough” to learn things like this. The thing that makes any kind of programming hard is gigantic functions that do seven different tangentially related things and hide important concepts so that their file formats end up with half a dozen different ways to encode an integer. It’s programs like that where you can’t actually see the system design in the code; if your codebase looks like that it’s going to be difficult in any language.
So, I wrote this book for the not-real programmers, the people told they’re not hardcore enough, that programs like Git are written by brain geniuses and that mere mortals cannot understand them. I was inspired by Gary Bernhardt’s From Scratch videos, and by Julia Evans’s zines, that demystify everyday software tools. I want you to feel like there’s nothing about computers you can’t ultimately figure out, and it has been so rewarding to see people publishing their first commits with their own Git clones, full of excitement at the world they’ve just created.