Archive for December, 2010

Chatty bots and minimizing disruptions in continuous integration

Tuesday, December 7th, 2010

Those who use IRC are probably familiar with bots. Esstentially bot is a client which is a not human. This time I’m talking about specific kind of bots, let’s call them reporting bots. Their purpose is to alert the channel about recent happenings in (near) real time. Open source project channel usually at least have a bot that reports every new commit and bug report filed.

Also the channel #mediawiki-i18n has reporting bots. We have one CIA bot reporting any i18n related commit to any of our supported projects. I have to mention that the ability to have own ruleset for picking and formatting commits is just awesome. There is also another bot, rakkaus (“love” in Finnish).

Its purpose is to report issues with the site. To accomplish this we pipe the output of error_log, which contains PHP warnings, database errors and MediaWiki exceptions, to the bot. It worked mostly fine, except that bot would flood everyone when the log was growing fast. Few days ago it went too far. We had a database error (a deadlock), which was reported by the bot… including the database query… which happened to contain few hundred kilobytes of serialized and compressed data–in other words binary garbage. Guess how happy we are were when we save channel full of that??

Okay, something had to be done. And so I did. I wrote a short PHP script which:

  • Reads new data every 10 second
  • Takes the last line, truncates it to suitable length and forwards only the snippet and notifies how many lines were skipped in the log

And now everything is nice again :) The script is not yet in SVN, but I will commit it later.

By the way, this bot is half of the reason why we might complain to you in few minutes after you committed code which breaks something in MediaWiki. Fortunately MediaWiki has taken steps to prevent committing code which doesn’t even compile, so we can skip some of the useless mistakes caused by carelessness.

Because we care about the users using, we want to minimize any disruptions. The measures we have taken are:

  • Even though we update code often, we can rollback easily. With small updates it is easy to identify the cause and chances are it is fixed very fast too.
  • I personally am doing code review, trying to spot most issues before they reach us.

-- .