How to fix bugs in software the hard way, or, why open source software is so damn helpful

If nothing else, this week has taught me a few things about bug fixing. As I’ve written about before, my IncludeByDefault plugin (or, more accurately, the project I’m using it for) exposed a bug or two in Rails. Revision 17 is the result of a very messy process trying to chase a bug up and down the bowels of ActiveRecord. To make sure I never do this again, I’ll let you know how I arrived at such a tangled messy situation.

IncludeByDefault works its mind-numbingly simple magic by intercepting ActiveRecord::Base.find_every using alias_method_chain. My feeling is that, as far as possible, your methods should not care about the innards of the methods they’re replacing, that way the original method’s author can alter it and you still get that change in your code. That’s why I like to intercept rather than overload.

So then I’m coding away with my new plugin, when suddenly I decide I want to do this:

Country.find(id).news_stories.find(:all, :include => :countries)

which is what we call a many-to-many scoped find with cascaded eager loading. Rails (at time of writing) will choke on this and raise ActiveRecord::StatementInvalid, because the SQL it generates contains duplicate table aliases. So, I figured that, while I’m intercepting find_every, I’ll catch that exception and try to do something about it. This is the crux of why solving the bug was so tricky: I started solving it in completely the wrong place.

find_every is quite a bit higher up the call stack than the method that first throws the exception (ActiveRecord::Associations::ClassMethods.select_limited_ids_list). Even still, your options for dealing it with are fairly similar whichever one you intercept – the arguments you have at your disposal are basically the :options hash from find. You could try and do something with the JoinDependency class at this stage as that’s where the errant JOIN statements come from, but that led me round in circles.

I decided that what I’d do was convert any troublesome :include options to :joins and write out tons of SQL by hand to rename the tables used. That’s basically fine, except for two things: for one, Rails won’t let you use :joins in this context, so I had to hack that in by completely overwriting ActiveRecord::Base.add_joins!. Second, removing anything from the :include array means that JoinDependency won’t write out column aliases for the tables you want to include, so now you need an extra method to write out all your column aliases. If you don’t do this, the data simply won’t be pulled from the database. Cue a wrapper for column_aliases and a new instance variable to pass around information about your new :joins.

The next problem is that the method using_limitable_reflections? will forget that you were not using limitable reflections, because you’ve removed all HABTM associations from the :include array. Cue another wrapper for that to get it to remember what you were doing before you generated loads of bad SQL. If you don’t do this and you have duplicate many-to-many links in the DB, you’ll get back result sets that are too small because uniqueness is enforced in Ruby rather than SQL. (I’ve posted this on Rails Trac.)

So you’ve written out your JOIN fragments and all your column aliases and managed to get the uniqueness condition to load properly, so let’s send your SQL off to the database. One small problem: JoinDependency has no idea about your JOIN statements and won’t load the data from them into objects, thus completely negating the point of eager loading anything. Several times during this fixing process, it dropped the associations on the floor entirely so I coudn’t even retrieve them, let along eager load anything.

The solution to this is, unfortunately, another method overwrite, this time of ActiveRecord::Associations::ClassMethods.find_with_associations. Thing is, by this stage, you’ve chased the bug round so many different methods by intercepting that you’re convinced there’s no other path to fixing it, so another overwrite will have to do. This one passes information about your join tables to JoinDependency#instantiate so it can do something useful with them and load them back in as if they’d come from an :include, thus completing the eager loading process.

So I got it all working, and thought: why don’t I try and fix the framework itself rather than trying to intercept all over the place? And that’s exactly what I did. Once you’ve decided you’re going to modify a system rather than intercept its workings, you don’t have any issues with overwriting anything. If you make a change and the unit tests pass, all is good. The irony is that fixing this in the framework itself only required me to modify one existing method: add_joins!, which my so-called workaround did anyway. The fix works by extracting the table names from every join you try to add, and re-aliasing them if it spots any duplicates. You can find the patch in this patch, and in an up-to-date copy of IncludeByDefault.

Let this be a lesson to you: if you need to fix a bug, make sure you start from the most sensible place: the code that directly introduces the bug into your system. Trying to work around it will only give you headaches. This of course assumes you can modify your system – if Rails weren’t open source, this would literally have been impossible to fix.

IncludeByDefault progress

IncludeByDefault, as mentioned in my last post, hit some snags with ActiveRecord generating duplicate table aliases when doing cascaded includes, e.g.

Tag.find(8).posts.find(:all, :include => :tags)

So, I set out to work around it, only to run into further problems. I went with option C: let find operations get all the way to the database, and then catch StatementInvalid if it is thrown and convert problematic :includes to :joins fragments by hand.

But, turns out Rails won’t let you use the :joins option on finds scoped on HABTM associations (like the one above), meaning I had to hack support for that in order to get my workaround for the duplicate naming bug to work. Even then, eager loading refuses to work properly – asking for the tags of any of the returned posts will go and fetch them from the database, even though the find operation included JOIN fragments to load the tags.

The other complication is that, if you have duplicate many-to-many links in your database, Rails returns incorrect result sets if you use :limit and :include, as long as the :include contains no has_many or has_and_belongs_to_many associations. Which means that converting troublesome :includes (those that raise StatementInvalid in the example up top, and are to-many associations) to :joins triggers this bug. I had to hack in a way for ActiveRecord to remember to select unique records properly, but this still only works as long as you :include suitable associations to begin with.

I have been round and round in circles intercepting various bits of the ActiveRecord call stack trying to get rid of this, but it’s whack-a-mole: fixing the eager loading thing sets off the unique records problem, or makes the call stack retry the database calls too many times, or drops the association on the floor so you can’t even read it never mind eager load it. I’ve got the plugin to a state where:

  • If your :include doesn’t cause any SQL problems, it runs just fine.
  • If it raises StatementInvalid, we’ll rewrite it. The association causing the problem probably won’t get eager loaded, but your query will at least run.
  • You will avoid the duplicate links problem as long as your :include includes a to-many association.

All these apply whether the :include is added using include_by_default or explicitly in your find call – the plugin will catch the exceptions either way. If anyone knows a way to get all this to work without actually including large parts of rewritten Rails (so far there’s one actual method overwrite – the rest are wrappers that catch exceptions) then I’d love to hear about it.

Including in circles

Not long had I been using my new plugin when I discovered it made this happen when trying to eager-load on a many-to-many association:

SELECT DISTINCT news_stories.id
FROM news_stories
LEFT OUTER JOIN countries_news_stories
    ON countries_news_stories.news_id = news_stories.id
LEFT OUTER JOIN countries
    ON countries.id = countries_news_stories.country_id
INNER JOIN countries_news_stories
    ON news_stories.id = countries_news_stories.news_id
WHERE (countries_news_stories.country_id = 30)

which is what happens if you ask for

Country.find(30).news_stories.find(:all, :include => :countries)

The SELECT DISTINCT and INNER JOIN ... WHERE parts are what fetch the news stories for country #30. The LEFT OUTER JOIN parts are the eager loading of all the countries for each news story. Isn’t SQL fun.

This seems like a perfectly reasonable thing to want to do: I know each news story is only linked to a handful of countries at most, so eager loading them isn’t going to grind the database to a halt. The problem is that Rails constructs an ambiguous statement – I’m joining the countries_news_stories table, with the same alias, twice. MySQL doesn’t know what to do with that, and nor should it. That statement will create duplicate column names in the result set, which is why we’re not allowed to run it without aliasing the table. That is, giving it a different name each time we use it by saying JOIN countries_news_stories AS table_1 or some such. This may have something to do with this Rails bug, though I can’t be sure until I’ve written some tests.

Assuming I’m not going to see a copy of Rails with the bugs fixed for some time, and I want to develop against a stable version, what can I do? As far as I can tell, my options are as follows.

A. Leave my plugin as it is, and if people want to do silly things like eager load anything but a belongs_to association, then it’s up to them to deal with it. Generally, eager loading has_(and_belongs_to_)many associations is risky because you don’t know how big a list of records you’ll be trying to fetch and it could bog down the database. I don’t really like this option, as I’d quite like to eager-load in this particular instance.

B. Before any find operation hits the database, rewrite any HABTM :includes as :joins statements with numeric table aliases. This has the disadvantage that, because you’re screwing around with table names, the result set columns won’t get mapped to the methods on the returned objects and you’ll end up re-fetching the data from the database anyway. That’s what the Rails docs say, but I’ve found that doing this does cut down database time in my case. I have a page of news articles for a particular country, each of which has to display a list of all the countries it’s linked to (see the query up top). Eager loading the countries for each story cuts down on database queries but doesn’t get rid of all of them. The other downside is that it’s a terrible hack involving hand-writing a mass of SQL and inserting table aliases as required.

C. Let finds hit the database without doing anything, then catch the StatementInvalid exception thrown if an invalid statement is generated and try to rewrite the query using plan B. I assmue it’s impossible to test for invalid statements before they hit the database. The whole point is that the statement in question is ambiguous and cannot be parsed properly by the database. I don’t imagine that trying to parse it in Ruby is going to be any more successful. This option would have the upside that if no exceptions are thrown, the result set is cached properly and your method calls won’t generate extra database hits. The downside is that handling ambiguous statements now involves two trips to the database. I have no idea whether the pros outweight the cons with this technique.

I’m going to write some unit tests for Rails to see if this is a bug in the framework rather than a peculiarity of my application, then I’ll try out plan C on my project tomorrow. I’ll let you know how it went after I’ve made the necessary changes to IncludeByDefault.

New Rails plugin: IncludeByDefault

It’s true, I’m a plugin writing machine. Seriously though, this one’s tiny. I took all of five minutes to write. What it does is, it lets you specify a default value for the :include option on ActiveRecord::Base.find, so you can automate eager loading of associations. I’ll use an example I’m comfortable with:

  class BlogEntry < < ActiveRecord::Base
    has_many :photos
    has_many :comments
    include_by_default :photos, :comments
  end

With that in place, the photos and comments will automatically be loaded in the same database query as the entry you ask for. You can still use the :include option with find – it will override the defaults you specify. Install as usual:

script/plugin install
    git://github.com/jcoglan/include_by_default