How to fix bugs in software the hard way, or, why open source software is so damn helpful

If nothing else, this week has taught me a few things about bug fixing. As I’ve written about before, my IncludeByDefault plugin (or, more accurately, the project I’m using it for) exposed a bug or two in Rails. Revision 17 is the result of a very messy process trying to chase a bug up and down the bowels of ActiveRecord. To make sure I never do this again, I’ll let you know how I arrived at such a tangled messy situation.

IncludeByDefault works its mind-numbingly simple magic by intercepting ActiveRecord::Base.find_every using alias_method_chain. My feeling is that, as far as possible, your methods should not care about the innards of the methods they’re replacing, that way the original method’s author can alter it and you still get that change in your code. That’s why I like to intercept rather than overload.

So then I’m coding away with my new plugin, when suddenly I decide I want to do this:

Country.find(id).news_stories.find(:all, :include => :countries)

which is what we call a many-to-many scoped find with cascaded eager loading. Rails (at time of writing) will choke on this and raise ActiveRecord::StatementInvalid, because the SQL it generates contains duplicate table aliases. So, I figured that, while I’m intercepting find_every, I’ll catch that exception and try to do something about it. This is the crux of why solving the bug was so tricky: I started solving it in completely the wrong place.

find_every is quite a bit higher up the call stack than the method that first throws the exception (ActiveRecord::Associations::ClassMethods.select_limited_ids_list). Even still, your options for dealing it with are fairly similar whichever one you intercept – the arguments you have at your disposal are basically the :options hash from find. You could try and do something with the JoinDependency class at this stage as that’s where the errant JOIN statements come from, but that led me round in circles.

I decided that what I’d do was convert any troublesome :include options to :joins and write out tons of SQL by hand to rename the tables used. That’s basically fine, except for two things: for one, Rails won’t let you use :joins in this context, so I had to hack that in by completely overwriting ActiveRecord::Base.add_joins!. Second, removing anything from the :include array means that JoinDependency won’t write out column aliases for the tables you want to include, so now you need an extra method to write out all your column aliases. If you don’t do this, the data simply won’t be pulled from the database. Cue a wrapper for column_aliases and a new instance variable to pass around information about your new :joins.

The next problem is that the method using_limitable_reflections? will forget that you were not using limitable reflections, because you’ve removed all HABTM associations from the :include array. Cue another wrapper for that to get it to remember what you were doing before you generated loads of bad SQL. If you don’t do this and you have duplicate many-to-many links in the DB, you’ll get back result sets that are too small because uniqueness is enforced in Ruby rather than SQL. (I’ve posted this on Rails Trac.)

So you’ve written out your JOIN fragments and all your column aliases and managed to get the uniqueness condition to load properly, so let’s send your SQL off to the database. One small problem: JoinDependency has no idea about your JOIN statements and won’t load the data from them into objects, thus completely negating the point of eager loading anything. Several times during this fixing process, it dropped the associations on the floor entirely so I coudn’t even retrieve them, let along eager load anything.

The solution to this is, unfortunately, another method overwrite, this time of ActiveRecord::Associations::ClassMethods.find_with_associations. Thing is, by this stage, you’ve chased the bug round so many different methods by intercepting that you’re convinced there’s no other path to fixing it, so another overwrite will have to do. This one passes information about your join tables to JoinDependency#instantiate so it can do something useful with them and load them back in as if they’d come from an :include, thus completing the eager loading process.

So I got it all working, and thought: why don’t I try and fix the framework itself rather than trying to intercept all over the place? And that’s exactly what I did. Once you’ve decided you’re going to modify a system rather than intercept its workings, you don’t have any issues with overwriting anything. If you make a change and the unit tests pass, all is good. The irony is that fixing this in the framework itself only required me to modify one existing method: add_joins!, which my so-called workaround did anyway. The fix works by extracting the table names from every join you try to add, and re-aliasing them if it spots any duplicates. You can find the patch in this patch, and in an up-to-date copy of IncludeByDefault.

Let this be a lesson to you: if you need to fix a bug, make sure you start from the most sensible place: the code that directly introduces the bug into your system. Trying to work around it will only give you headaches. This of course assumes you can modify your system – if Rails weren’t open source, this would literally have been impossible to fix.