Saturday, September 02, 2006

 

Chasing a Race Condition

For a long time, we've had an unsolved mystery: every once in a great while, some combination of permissions, participants, and roles will just vanish from somebody's site. I always chalked it up to something I did wrong in our AuthzGroupProvider, but now that our traffic is way up, the problem is worse and we have much better information about what's going on. Jeff started keeping a log of every query to MySQL. When the problem reared its ugly head again, we started scanning the logs for anything that does a delete from any of the SAKAI_REALM tables. It was easy to find: DbAuthzGroupService.save(AuthzGroupEdit edit) deletes every participant, role and permission for a given site before building them all back up again from scratch. We think the problem is insufficient isolation. That is, one thread gets a dirty read of the SAKAI_REALM tables while the save() operation in another thread is still in the middle of reconstituting the data. This problem is exacerbated in MySQL by the fact that all the delete statements use a sub-select, something that performs very poorly on MySQL. I'm betting the reason the institutions using Oracle don't have a problem is that a) the isolation is right, and b) the queries run really fast. I'll have more on this after we've played around with a few exploratory scenarios.

Comments:
Zach,

Any news on this? Or a jira?
 
So glad you asked. It's SAK-6224

Our current status is that we're planning a way to convert our tables to InnoDB before we play around with isolation levels. We think we can do it with a minimum of disruption, but we'll have to wait and see.
 
Ah! Interesting, because we are at the very same point (converting all our tables to innodb)!
 
Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?