Category Archives: English

Goes to some planets

Memory optimisations

Yesterday (or in the midnight hours) I finally committed a patch to MediaWiki’s message cache. Betawiki uses MediaWiki in a way that puts a heavy pressure on the message cache. While normal MediaWiki installations have maybe dozens or few hundreds of customisations to MediaWiki interface messages (pages in MediaWiki namespace), Betawiki has hundreds of thousands of messages in hundreds of languages

The amount of messages that needs to be cached effectively is really in a different decade. Normally those messages take maybe few hundreds of kilobytes in PHP’s serialised format, stored in the database or in memory cache. In Betawiki all messages together would take about 23 megabytes! It is clear that loading and handling such a big blob is not going to work, especially when it is needed on every page request and needs to be updated on every change to the messages.

Some time ago we started to hit the memory limit we have set for PHP requests. I made some hacks to the code reduce the burden—but those were only hacks. Before this patch we basically stored only customisations to be used for Betawiki itself and skipping message cache updates totally, so it would only be updated after a timeout.

This was far from an ideal solution. The message cache was caching all the other messages individually. This is of course waste of memory and more importantly fragmentation increased a lot and request per second to memory cache (we use APC in Betawiki) sky-rocketed to thousands per second.

What made me hesitant to commit this patch was, that I needed to update code paths we don’t use in Betawiki, and thus wouldn’t get a much real testing. At the time of writing this message, it seems to be live on the servers of Wikimedia Foundation and is not reverted or got any comments so far, so it probably isn’t totally broken or unacceptable :).

What the new patch actually does, is that it adds a new configuration option, which when set to true will split the cache to smaller caches that contain messages for one language only. This greatly reduces to memory consumption, as only a couple of languages needs to be loaded in normal use. Full localisation of MediaWiki and all supported extensions takes from 500 to 800 kilobytes, depending on the script. The default setting for the new configuration option is false, which should result behaviour identical to the old version. I also added more comments and standardised the names of per language memory cache keys.

This will not solve all memory use problem in Betawiki, but is big step to keep it running efficiently, and with as few hacks as possible. Custom hacks are bad because they add maintenance burden and prevents others from creating a similar setup easily.

Of course the amount of messages will only grow in the future. To tackle this I have planned to move non-MediaWiki related messages to a another namespace, so at message cache will not handle them at all.

Betawiki status report

It is raining again—or at least it would be nice if it did. Betawiki has had some nice new progression in the spring. Aside from the general growth in translators, page requests, translation and languages, the community itself has evolved.

They have created a news letter that is sent out once a month at most. Some translators have started to suggest enhancements to the messages, for example if it is missing plural handling or bad wording that is hard to translate. Also some of our projects pages are being translated, even though the process to do so is a bit awkward.

As a platform we have adopted one new external project Word2MediaWiki plus, which converts word documents to wikitext. New extension named Babel—used by the users indicating what languages they speak and how fluently—is in development, and Betawiki has helped by providing translations and by acting as a test platform. Babel extension is developed by MinuteElectron. Let’s hope it will soon get ready for use in Wikimedia projects.

Also the first external project in Betawiki—FreeCol—got some revitalisation. I have agreed with Michael Burschik that I commit language updates from Betawiki once or twice a week. As always, faster integration cycle helps in testing the translations and messages themselves before release. Well, not everything is great and perfect yet. FreeCol development is active in the trunk branch mostly, while the translation in Betawiki are for 0.7.x branch. We support branched translations for MediaWiki and it should be possible to do so for FreeCol also. There are some jumps and hops to go trough, so it hasn’t been done yet, but should be quite easy. Also, we currently can’t generate statistics for FreeCol, but that will be fixed too.

I’m quite happy how the different work tasks are spreading out. It’s not only one man’s project anymore, and normal things go forward even if I’m not there every day. It leaves me more time to actually make it better than just run the whole project. :)

Using MediaWiki’s interface in your own language

Today I fixed bug 13463. It is relevant to people who use MediaWiki with interface language that is different from the wiki’s default language. When person logs in to MediaWiki, the first page saying your login was successful was shown in the default language.

It has apparently been like this for years, so I wonder why it only recently came up. I remember fixing a similar issue when changing the interface language in preferences few years back. Maybe people are not using their native language as often as possible as interface language. It may be that they are multilingual and don’t care about what language the interface is.

Of course there is also real reason not to use custom interface language. Interface messages can be customised, and they often are. All these customisation are “lost” when another language is chosen. Is this a problem? Can we do something to it?

MediaWiki has a feature that some interface messages are always displayed in a content language. It is a good thing for important and often customised messages like the one containing copyright information. The bad thing is that this list is somewhat arbitrary and it is not always clear what belongs to the list. It is also possible to remove messages from this list using a configuration variable. Adding is not possible.

Now, what if we just added all customised messages to this list and force them to be shown in the content language? Users would always see customisations, but we would also lose a bit in the localisation support. This may be acceptable on some wikis, but on large multilingual wikis this is not optimal. We could go one step further and translate these customisation to other languages. But to do that we need a translation infrastructure. Special:Allmessages isn’t usable for that.

One solution could be to use Translate extension. It has all needed features to easily group and translate messages. As I see it it would require two steps:

  • Automatic or manual creation of messages groups of customised messages
  • Change MediaWiki to use different message loading order for these messages (skip the translations in message files)

Is this needed? Would it be just a nice toy or useful feature?

I have a summer job

So, I was one of the five lucky winners who where chosen for Kesäkoodi (Summercode Finland). This means that I will be improving the Translate extension we use on Betawiki, and some i18n support on MediaWiki. Of course I will be active on the spring too, but the big features are coming in the summer. More about that later.

I also moved this blog to a new host, and updated WordPress. In this short time I’ve already got hate-hate relationship with it. Where is the delete all the N spam comments where N is big for example? Anyway maybe I get over it.

I’ll probably blog something about MediaWiki here too, if this works and I’m not too lazy.

Do you want your text plain or with furries

Bug 8521 was filed today. It is about an old problem: text escaping in MediaWiki. We dump some of our user interface strings (messages) without escaping to html output. Some of the messages are parsed as wikitext, rest are escaped and outputted as plain text.

Debugging view

There are various reasons why we try to get rid of the unescaped output. One of reasons is that any sysop can edit those messages to contain anything ranging from invalid html to some code exploiting some security holes in the browser. Latter isn’t that much of concern because sysops should be trusted enough not to do evil things, and there is always common.js where they can do it anyway. Invalid html, however, is bad. If we ever get to the point where we output xhtml or xml, the code must be 100% valid, or the site doesn’t work at all. We don’t want sysop to break whole site accidentally with no easy way to repair it.

Unfortunately fixing unescaped output isn’t that straightforward. Two main problems are backward compatibility and performance. Many people are actively using complex html-markup in some messages and consider it as a feature. I think we removed the worst one—we provided [[mediawiki:edittools]] to use instead of [[mediawiki:summary]], latter still unescaped. But there are many left. And from time to time when we “demote” message from html to wikitext or plain text where someone uses html, the user goes nuts and breeds squad of terrorists. And if we add a wikitext message, which is parsed by the parser on every page view, one of our fellow developers commits a suicide, which we don’t want to happen :). Actually, there is third type of messages (not counting unescaped), where we only parse brace-thingies, like magic words. Now imagine you being a translator or a sysop customizing some UI message. Which format should you use in that particular messages that you are currently modifying? I don’t have answer to that question, sorry.

It’s been a while since I last wrote longer pieces of English, if you didn’t notice already :)