For a long time, we've had an unsolved mystery: every once in a great while, some combination of permissions, participants, and roles will just vanish from somebody's site. I always chalked it up to something I did wrong in our AuthzGroupProvider, but now that our traffic is way up, the problem is worse and we have much better information about what's going on.
Jeff started keeping a log of every query to MySQL. When the problem reared its ugly head again, we started scanning the logs for anything that does a delete from any of the SAKAI_REALM tables. It was easy to find: DbAuthzGroupService.save(AuthzGroupEdit edit) deletes every
participant, role and permission for a given site before building them all back up again from scratch.
We think the problem is insufficient isolation. That is, one thread gets a dirty read of the SAKAI_REALM tables while the save() operation in another thread is still in the middle of reconstituting the data.
This problem is exacerbated in MySQL by the fact that all the delete statements use a sub-select, something that performs very poorly on MySQL.
I'm betting the reason the institutions using Oracle don't have a problem is that a) the isolation is right, and b) the queries run really fast.
I'll have more on this after we've played around with a few exploratory scenarios.