Niklas Laxström

Doing stuff with language and translation.

Author Archives: Niklas Laxström

About Niklas Laxström

Doing stuff with language and translation.

The special Special Pages of extensions

First phase of my Summercode Finland is almost ready. Support for native Gettext projects is in testing phase and Xliff support is waiting for comments about which parts of the Standard should be supported. In other words, there hasn’t been many changes to file format support lately. This week I fixed some bugs found in Gettext testing which actually affected all groups not depending on the file format. For some reason every time I look at my code I find places to improve and clean up it. I cleaned up the command line maintenance scripts and sprinkled few headers for copyright and so on. In the process I managed to introduce handful of new bugs, but that happens always when I code :).

But let’s talk about the post title. It means the names of special pages shown in your browser’s address bar are no more sacred but can be translated like almost everything else. Now that Firefox 3 has been released many current browser even display them nicely and not in some unfriendly percent encoding like %D0%97%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0 instead of Заглавная_страница.

Actually, we have supported this for a long time already, but only for MediaWiki itself and not for special pages provided by the MediaWiki extensions. Special pages can have multiple aliases, and all of those can be used to access it, which means that they need some special handling. All of the complexity (yeah right… one do-while loop) is fortunately hidden behind a variable.

To make your extension support translating of special page aliases, you only need to put one line of code and create one file.

$wgExtensionAliasesFiles['YourExtension'] = $dir . 'YourExtension.i18n.alias.php';

And that file should look something like this:

<?php
/**
* Aliases for special pages of YourExtension extension.
*/

$aliases = array();

/** English
* @author YourName
*/
$aliases['en'] = array(
	'YourSpecialPage'          => array( 'YourSpecialPage' ),
);

At least the first instance YourSpecialPage should be the same as they key you used for declaring your special page with $wgSpecialPages. Note that WordPress likes to mangle quotes, so it is not safe to copy-paste verbatim from the above.

All this was committed today, so there may be some changes still, as always with brand new code. And the good news does not stop there. I already rewrote the Special:Magic of translate extension to support translating these! It already has two extension defined: Translate and Configure. The number of supported extensions will probably grow soon.

Project progress

Had to spend some time maintaining Betawiki, so the progress has been a little slow for the past week. Aside from that I’ve been working on many things.

I have setup a test project for Gettext: a Wesnoth campaign. It is now shown to all, just to us few testers who are going to translate it using Betawiki. It already helped to find some bugs, simplify the code and the edit view got support from displaying information extracted from the pot file.

To make a project available for translation it is not enough to only add it to the list. That part is easy to do, just checking out the files and about twenty lines of code. But to really support a project, we need to work closely with the development team and with the existing translation communities around it. It would be easy if we could just get everyone to use Betawiki immediately, but often some people don’t want to use the web interface for one reason or another. We need to setup rules which languages are translated and where, to avoid conflicts, map our language codes to what the project uses and setup some kind of integration process that translation actually get to the upstream, and that upstream changes propagate in a proper way back to us.

But back to the project. I have been reading Xliff format specifications. It’s fortunately quite short and clear and has nice examples. Xliff supports all kinds of nice features, and I have been trying to decide which features we need to support. I wrote a simple implementation that can export translations in minimalistic Xliff file. It was actuall a pretty easy to do, under 100 lines of code. It would be a really good to get someone who uses programs that accept Xliff files to comment which features would be useful to implement. In any case, I will implement a parser too this week, so that we can get those translations back too :).

If the test project doesn’t bring any big surprises, I start preparing to tackle the next task in the project schedule.

Unproductive start for a week

Well, maybe unproductive is a bit overstatement, but considering I didn’t advance much in my summer project it wasn’t very productive either.

Anyway, on Monday the internet connection was down for good number of hours. I got fed up with it and started cooking! I don’t usually cook to myself, so I’m not very good at it. It was tasty however, giving courage to do it more often. I spent the rest of the day playing Sid Meier’s Alpha Centauri with the extension disc. I love that game!

Oh, and FreeCol 0.7.4 was released yesterday (Monday). It didn’t go as well as I hoped. I was unable to commit latest changes done after Sunday before the release, because the connection was broken. I hope there wasn’t too much effort put into it after Sunday. Now that 0.7.4 is released and the branch officially dead, we have to finally migrate to the 8-branch. Most of the preparations have been already done. I wrote a script that tries to guess key mappings and other changes. So I have the list. In few days we will rename all FreeCol messages, changing to own namespace for FreeCol, removing the prefix and renaming old keys to new names. Keys have to be fixed in the files also. Except a short downtime when it is not possible to translate FreeCol while we do these changes.

And then Tuesday. On Monday Tim Starling committed a change to MediaWiki code that moved files around. (Tim is btw my summer project mentor, but this is unrelated). There was short breakage when Siebrand tried to update the code normally, and it was quickly reverted. Today I committed most of our local changes again: fix to comments; special page normalisation try 2; rewrite or Special:RecentChanges to add few hooks etc. I don’t know how many bugs I introduced yet again, but let’s hope not too many. Thanks to Ialex who already fixed few.

After that I put Betawiki into maintenance mode and started updating and merging changes. It really didn’t help that my local shitty Internet Service Provider had 35% of packet loss while doing it… ssh was irritatingly slow. I managed to do it in less than 10 minutes, and now we are back running, with less local changes.

Rest of the time was spent on eating (the same food as yesterday) and fighting with the papers to put an application for a new place… I have to get away from the dorm before my head explodes. For the evening I probably have to read up about XLIFF format to implement support for it, or test the Gettext implementation, or write some documentation for it, or something else… who knows.

As grain of salt, some nice features (or something like that) we got last week:

Possibility to count fuzzy messages in group statistics
Possibility to hide languages that have no translations at all
First pieces of Gettext plural support
Ability to blacklist some authors from the credits (mainly for bots or us who do maintenance like work)
FreeCol got lots of optional messages
Improvements to the RecentChanges filters that got implemented in the previous week and finally committed to svn repository today.

Gettext pecularities

All kinds of weird things show when looking different po files. While implementing support for the standard way of doing plurals, I found that for example Wesnoth has comments inside the definitions itself, which is quite unexpected, when the po format itself has designated syntax for many kinds of comments! Just look at the snippet taken from Gettext manual:

     #  translator-comments
     #. extracted-comments
     #: reference...
     #, flag...
     #| msgid previous-untranslated-string-singular
     #| msgid_plural previous-untranslated-string-plural

I’m not sure what to do to those… It doesn’t hurt to leave them as is… but it is counter-intuitive to translators… but that would make the code more complex… decisions.

Anyway, the plural support seems to kinda work. At least it parses nicely, but I still need to make sure the special handling doesn’t break comparing messages (for import) and gets exporter properly. Also, KDE3 uses some kind of own format for plurals, but doubt if it is worth making it a special case too.

Aside from improving Gettext, I had to fix somewhat broken author export. It didn’t all authors always, and while I was fixing it, I implemented a blacklist to filter our bots so that they do not appear in every author list.

I also changed the code to support having messages in different namespaces. Now that we support multiple formats, and the number of external projects is probably growing, it was needed. The message cache of MediaWiki was not designed to handle hundreds of thousands of messages. Having one namespace per project also helps filtering and reducing key conflicts. Unfortunately it broke our recentchanges hacks, which means I had to rewrite those, now in a more proper way.

Gettext support is coming

I have managed to translate one string with the extension and export a file including existing translations, and have it work. This means that I’ve implemented a parser and exporter for po files.

Most of the time went into refactoring the old code to support other file formats easily. Exporting is as easy as possible:

$lang = array( 'fi', 'de' );
$target = '/tmp';
$group = MessageGroups::getGroup( 'out-freecol' );
$writer = $group->getWriter();
$writer->fileExport( $langs, $target );

The above code exports FreeCol translations to the tmp directory. Those file can then be committed to the vcs. Each project (group) has a preferred output format. It is also possible to export to other supported file formats by creating a writer manually.

As gettext messages don’t have ids, I had to create those. Currently it is just hash + snippet, which produces page titles like MediaWiki:33e5da6ddaa6edf2f1cdf8c235813747e40fc326-Disable Ethernet Wake-On-Lan w/fi. It is not pretty, but good enough for now.

Anyway, the support is “coming”, not ready. Still to do are for example gettext plural support, better formatting of authors, importing external changes and so on.

Two paradigms: Gettext and Mediawiki

Gettext starts from quite a different perspective to i18n. Especially it differs in who should do the ugly work. The thing is, Gettext tries to hide the i18n from developers, while the system we build in Betawiki aims for minimising the work translators have to do. These two aims produce systems that are different and it needs some thought how to combine them together. Fortunately these two aims aren’t entirely incompatible. I have to say that hiding i18n from the developers has its good and bad sides, but I’m not to judge whether it has more good or more bad.

Paradigms aside the main difference seems to boil down to tracking changes to messages. Betawiki does it, and it is easy because every message is identified by a unique name. Gettext doesn’t really, it just prefills the translations for new and changed messages by guesswork.

We use MediaWiki pages, which have a concept of unique name. So obviously I need to generate some kind of unique names for the messages in Gettext files. Maybe hash of the contents and context, which is the Gettext definition of uniqueness. Not pretty, but as the developers aren’t forced to name the messages, there is probably no way to get meaningful names.

That should be a start at least. I have been fiddling with the code trying to separate file format support to own classes from other code, but I’m not yet happy with it.

I also hope I figure out some clever trick to track messages changes from .po files to keep more history in the wiki. It may be that i could use Gettext guesswork algorithm to some extend, but it may also be that it is not worth it, due the nature of Gettext. In any case we have the history for changes in the translations.

Memory optimisations

Yesterday (or in the midnight hours) I finally committed a patch to MediaWiki’s message cache. Betawiki uses MediaWiki in a way that puts a heavy pressure on the message cache. While normal MediaWiki installations have maybe dozens or few hundreds of customisations to MediaWiki interface messages (pages in MediaWiki namespace), Betawiki has hundreds of thousands of messages in hundreds of languages

The amount of messages that needs to be cached effectively is really in a different decade. Normally those messages take maybe few hundreds of kilobytes in PHP’s serialised format, stored in the database or in memory cache. In Betawiki all messages together would take about 23 megabytes! It is clear that loading and handling such a big blob is not going to work, especially when it is needed on every page request and needs to be updated on every change to the messages.

Some time ago we started to hit the memory limit we have set for PHP requests. I made some hacks to the code reduce the burden—but those were only hacks. Before this patch we basically stored only customisations to be used for Betawiki itself and skipping message cache updates totally, so it would only be updated after a timeout.

This was far from an ideal solution. The message cache was caching all the other messages individually. This is of course waste of memory and more importantly fragmentation increased a lot and request per second to memory cache (we use APC in Betawiki) sky-rocketed to thousands per second.

What made me hesitant to commit this patch was, that I needed to update code paths we don’t use in Betawiki, and thus wouldn’t get a much real testing. At the time of writing this message, it seems to be live on the servers of Wikimedia Foundation and is not reverted or got any comments so far, so it probably isn’t totally broken or unacceptable :).

What the new patch actually does, is that it adds a new configuration option, which when set to true will split the cache to smaller caches that contain messages for one language only. This greatly reduces to memory consumption, as only a couple of languages needs to be loaded in normal use. Full localisation of MediaWiki and all supported extensions takes from 500 to 800 kilobytes, depending on the script. The default setting for the new configuration option is false, which should result behaviour identical to the old version. I also added more comments and standardised the names of per language memory cache keys.

This will not solve all memory use problem in Betawiki, but is big step to keep it running efficiently, and with as few hacks as possible. Custom hacks are bad because they add maintenance burden and prevents others from creating a similar setup easily.

Of course the amount of messages will only grow in the future. To tackle this I have planned to move non-MediaWiki related messages to a another namespace, so at message cache will not handle them at all.

Betawiki status report

It is raining again—or at least it would be nice if it did. Betawiki has had some nice new progression in the spring. Aside from the general growth in translators, page requests, translation and languages, the community itself has evolved.

They have created a news letter that is sent out once a month at most. Some translators have started to suggest enhancements to the messages, for example if it is missing plural handling or bad wording that is hard to translate. Also some of our projects pages are being translated, even though the process to do so is a bit awkward.

As a platform we have adopted one new external project Word2MediaWiki plus, which converts word documents to wikitext. New extension named Babel—used by the users indicating what languages they speak and how fluently—is in development, and Betawiki has helped by providing translations and by acting as a test platform. Babel extension is developed by MinuteElectron. Let’s hope it will soon get ready for use in Wikimedia projects.

Also the first external project in Betawiki—FreeCol—got some revitalisation. I have agreed with Michael Burschik that I commit language updates from Betawiki once or twice a week. As always, faster integration cycle helps in testing the translations and messages themselves before release. Well, not everything is great and perfect yet. FreeCol development is active in the trunk branch mostly, while the translation in Betawiki are for 0.7.x branch. We support branched translations for MediaWiki and it should be possible to do so for FreeCol also. There are some jumps and hops to go trough, so it hasn’t been done yet, but should be quite easy. Also, we currently can’t generate statistics for FreeCol, but that will be fixed too.

I’m quite happy how the different work tasks are spreading out. It’s not only one man’s project anymore, and normal things go forward even if I’m not there every day. It leaves me more time to actually make it better than just run the whole project. :)

VR – onko mikään muuttunut?

No ei sitten mikään – paitsi hinnat – ne ovat nousseet, mutta laatu ei. Joudun taas kerran toteamaan: jos pitää valita junan tai bussin väliltä: ota bussi.

Saavuin noin 12.05 rautatieasemalle noustakseni M-junaan. 12.03 lähtevä M-juna saapui myöhässä, nousin siihen pahaa aavistamattomana ja se lähti 12.09. Matka sujui normaaliin tapaan aina Ilmalaan asti, jossa kaikki alkoi mennä pieleen.

Hetken Ilmalassa seisottuamme saimme kuulutuksen että Huopalahdessa on ruuhkaa ja joudumme odottamaan hetken. Hyvä, paitsi että se antoi kuvan, että pääsisimme pian jatkamaan. Odottelimme lisää, ja saimme toisen kuulutuksen että odotamme ajon sallivaa valoa. Noin 20 minuutin odottelun jälkeen tulee kolmas kuulutus, että Huopalahdessa on joku laitevika. Näihin aikoihin VR:n sivuilla oli ilmoitus, että junat saattavat myöhästyä 15 minuuttia(!). Lopulta noin kolmen vartin odottelun jälkeen pääsimme jatkamaan matkaa; vartin junamatka venyi reiluun tuntiin.

Analysoidaampas vähän:

Jotain vikaa siellä oli jo ennen junan lähtöä rautatieasemalta, koska se tuli myöhässä.
Juna lähti rautatieasemalta, vaikka jonkun olisi pitänyt tietää että ongelmia on tiedossa.
Juna ajoi Ilmalaan asti, jolloin suurin osa matkustajista ei tiedä miten sieltä pääsisi helposti pois.
Kuulutuksissa puhuttiin ensin lyhyestä ajasta, joten junan kuljettavat eivät tienneet ongelmista.

Ongelmia:

VR:llä ei selkeästi tiedotus pelaa, ei edes henkilöstön välillä, koska junan kuljettajat olivat yhtä pihalla kuin matkustajatkin.
Vähäinen tiedotus on yltiöoptimista ja jopa väärää, sekä junassa että sivuilla.

Miten asiat olisi voinut tehdä paremmin – vahinkojen minimointi? Jos tieto olisi kulkenut, juna olisi voinut jäädä joko Rautatieasemalle ja Pasilaan, ja matkustajat olisivat helposti vaihtaneet kulkuvälinettä. Vaihtoehtoisesti juna olisi voinut palata Ilmalasta Pasilaan, tai olisi voitu heti kertoa, että vikaa korjataan ja kestosta ei tietoa – parhaassa tapauksessa kertoa korvaavista liikennevälineistä. Mutta ei, onneksi sentään ei jääty keskelle rataa ilman mahdollisuutta päästä pois junasta.

Nyt taitaa olla sillä lailla, että tämä ”ympäristöystävällinen ja energiatehokas liikennemuoto” jää minulta jatkossa käyttämättä. Surullisinta on, että tätä on nyt katseltu tarpeeksi kauan, eikä mitään ole tapahtunut. Asiakas kärsii, eikä VR:lle tule mitään sanktioita vaikka junat seisovat. Verkkohesariin tulee uutinen, ihmiset haukkuvat VR:ää vuorokauden ajan ja kaikki jatkuu vanhaan malliin. Vertailun vuoksi voi katsoa miten raitiovaunut seisovat.

Using MediaWiki’s interface in your own language

Today I fixed bug 13463. It is relevant to people who use MediaWiki with interface language that is different from the wiki’s default language. When person logs in to MediaWiki, the first page saying your login was successful was shown in the default language.

It has apparently been like this for years, so I wonder why it only recently came up. I remember fixing a similar issue when changing the interface language in preferences few years back. Maybe people are not using their native language as often as possible as interface language. It may be that they are multilingual and don’t care about what language the interface is.

Of course there is also real reason not to use custom interface language. Interface messages can be customised, and they often are. All these customisation are “lost” when another language is chosen. Is this a problem? Can we do something to it?

MediaWiki has a feature that some interface messages are always displayed in a content language. It is a good thing for important and often customised messages like the one containing copyright information. The bad thing is that this list is somewhat arbitrary and it is not always clear what belongs to the list. It is also possible to remove messages from this list using a configuration variable. Adding is not possible.

Now, what if we just added all customised messages to this list and force them to be shown in the content language? Users would always see customisations, but we would also lose a bit in the localisation support. This may be acceptable on some wikis, but on large multilingual wikis this is not optimal. We could go one step further and translate these customisation to other languages. But to do that we need a translation infrastructure. Special:Allmessages isn’t usable for that.

One solution could be to use Translate extension. It has all needed features to easily group and translate messages. As I see it it would require two steps:

Automatic or manual creation of messages groups of customised messages
Change MediaWiki to use different message loading order for these messages (skip the translations in message files)

Is this needed? Would it be just a nice toy or useful feature?

It rains like a saavi

About me, me and me