Category Archives: MediaWiki

The Translate extension for MediaWiki has documentation

The Translate extension for MediaWiki is no longer just a hack for translatewiki.net. Actually it hasn’t been that for a long time anymore, but recently other projects have started using it. That means lots of things, like supporting stable releases of MediaWiki, instead of just development versions.

Today’s topic is documentation. I have been amending our existing documentation with Siebrand. Previously there was only some documentation how to install the Translate extension. Now we have sections for the page translation feature, the configuration of the extension, message group configuration and command line scripts. All these have been collected into our documentation index page along with links to other resources. One of those other resources is code documentation generated with Doyxgen. That should really help anyone who is interested in developing for the Translate extension – yes, we are looking for help!

Naturally documentation is a moving target and it will be improved continuously, like the code itself. While we have documentation for developers and those who want to install and configure the Translate extension, we are still lacking great user documentation in many areas. Even though the saying goes that good software does not need separate documentation, that does not mean we shouldn’t have any. It is important to show everyone what can be done with the Translate extension and to get them either interested or have them use the software (more) efficiently as an end user.

GSoC wrap-up – Translate extension

GSoC is almost over now. Lots of cool things have happened, but unfortunately you may not be aware of it, because I have neglected to blog about it.  That is definitely a regression compared to last year and something to keep in mind in the future. I managed to do almost all tasks from the project plan with priority higher than 4, with some rough edges there and here. Next I will pick some highlights from the completed tasks.

Improved usability

This year there were many usability related issues to improve the translation work flow. The improvements done by Wikimedia Usability project nicely complements my work for the benefit our less technically oriented audience. Most important improvement is probably the buzzword compatible ajax-editing. No more do translators need to open new browser tab for each message they want to edit, but instead they get floating dialog inside the current page (implemented using jQuery dialog). This means they never need to leave the list of messages any more, but it stays always in the background. It also makes easier to do quick edits to message documentation or other languages, because you just get a new dialog, and once you finished editing it, you are back in the previous message.

Ajax edit interface

Ajax edit interface

Other features to include user preference to choose additional languages to show when translating. The feature itself is not new, but now users can customise the list of languages.

Languages can be selected from the dropdown and using the button or typing in the language codes directly.

Languages can be selected from the dropdown and using the button or typing in the language codes directly.

To date we not really taken advantage of achievements in language technology. Now we have taken the first steps towards it by implementing a simple translation memory. It is a very simple setup, where we use tmserver from Translate Toolkit and fill it from time to time with existing translations from translatewiki.net. Tmserver uses well known Levenshtein algorithm to give suggestions. It isn’t very good, nor anything compared to state of the art systems, but the suggestions have already been useful, as told us by the translators itself. There is many ways to improve the suggestions from better algorithms to using larger set of translations as source data and preprocessing the source data (text alignment, case and punctuation normalisation). I’m looking forward to them.

Other changes

There were many improvements to the lesser used features. Special features (magic names, special page alias, namespaces) can now be exported using a script. No more time wasted in copy-pasting. In addition it is now possible to localise magic words for extensions. It is up to translation teams to decide, whether they want to do this understandably controversial thing.

In message checks there were at times false positives which caused confusion among translators. Now there is flexible system to suppress those warnings.

Gettext-style plurals are now supported better, but no one of our Gettext projects is currently using those yet. Related, there is now a special page to import offline translations. We can now give trusted translators or users the permission to import offline translations, delegating them away from server admins. It supports download-from-url, files uploaded to the wiki and local file uploads.

The offline importer actually uses the same engine that I developed for another feature: web based message group management. It is now possible for project admins to import external changes, fuzzy other changes if necessary using their browser. It is much easier than doing those steps manually on the command line, but there is still some practical problems to solve. One major piece still missing is integration with version control systems, so command line access is still needed to do svn up or similar for other systems. It is somewhat related to the other problem, which is limited execution time for web requests. It is currently wise enough to check after every action if we are near the limit, and stop further processing and give the user ability to continue from that point. We can’t increase the execution time limitlessly, but there might be hope for example by doing multiple requests with ajax to spare the user from clicking continue button many times.

The future

There is always something to be done or something that can be improved. I will target on improving the new web interface and group management, which is still quite immature. Ajax-editing works, but is still missing the cool-factor without proper polishing. And like that isn’t enough, Siebrand has collected wish list for me. I will try my best to fulfil each request with my time which is limited especially now that study year starts again.

It will be interesting to see where we are next year. We are not alone any more and while other platforms are developing I want to keep translatewiki.net special – to give a face to internationalisation and localisation instead of being just a dumping ground for translations.

GSoC status report – Translate extension

Last year I participated in Summer code Finland. During that I added many new features to Translate extension, to allow biggest user of the extension, translatewiki.net grow bigger. And now translatewiki.net is indeed bigger. This year the project plan contains many tasks, which aim to make the using experience more pleasant both the translators and the project admins. In addition there is a pile of bug fixes and i18n improvements to MediaWiki. I will tell more about those features when I finish them.

The first coding week is now in the past. The big task for that week turned out to be more difficult than estimated. It was about making certain things faster, mostly regarding to generating translation statistics. The cause for the slowness was fuzzy messages, which are messages which have translation but the translation needs updating or reviewing. Information about fuzziness was stored as text string in the message content itself.  Now it is mirrored to another table, where it can be queried without loading the translation and checking the existence of the fuzzy string. Thanks to everyone who helped with that.

Fortunately I managed to do some other tasks too. Siebrand is likely to be happy, that he can export translation of MediaWiki’s namespaces, magic words and special page aliases with one command on command line. That is, instead of using web browser and requesting an export of one of those features for each language individually and pasting them to the translation files. Should save some precious time for better use.

Stay tuned for the next status report! It may take a week or two, as I am planning a little holiday trip to Sottunga in Åland and I don’t expect to be connected very often.

Drawing i18ned text in images.

A picture is worth a thousand words, but drawing a word can be harder than one expects.

Usually it is a good idea to avoid text in images for multiple reasons. Foremost, images make localisation hard. It requires tools, some skill in image manipulation and handwork. Another benefit is the need to store only one copy of the image.

In some cases it is unavoidable to use text in images. In other cases… it is just used for lesser reasons. In this post I will not talk about layout issues, like limited space and inflexibility in image size. In Betawiki we have hundreds of languages, of which many of them are using poorly supported scripts.

PHP GD library provides two methods to draw text. imagestring can be used only to draw text in latin-2, so we can forget it immediately. The other one is imagettftext, which since PHP 5.2.0 allows to use UTF-8. Great, now we can pass all translations we have to it. The next problem is choosing a suitable font, since imagettftext specifically needs path to one in its parameters. As we know, there is no font to cover all scrips, and too many fonts manually map language codes to them and require everyone using the code to install just those fonts.

The only way to automatically choose a proper font for a language (script) code is fontconfig. I have written a wrapper, which calls command line utilities of fontconfig to fetch the most suitable font. This does not solve the missing font problem, but if there is a suitable font in the system and fontconfig knows about it, it will be used. And yet, there is still problems like wrong rotation for Japanese.

The big question: is there any better way to do this?

Status update: Statistics etc.

My progress on implementing nice statistics has been an on-off trip. Both MediaWiki and FreeCol are going to make releases soon. And then there is all kinds of bugs here and there I feel obligated to fix. During the weekend I managed to fix a very bad memory leak where one of our scripts was using all our memory from the server, compared to quite stable 30M after the fix. I really want to thank milian from #geshi for the help using xdebug and his nice tools to identify the cause.

Gettext and Xliff: Nothing much here. Still haven’t tested msgmerge, so it is to be seen how well it works.

Other features: Special page alias translation got a really big boom. Suddenly the number of supported extensions has grown to 23, and we have already “produced” hundreds of translations in many languages. Message formatting checks got little improvements, and now that the leak is fixed, we can update those regularly too.

So let’s go ahead to the stuff I was meant to do: Stats. Thanks to a friend who suggested using PHPlot, I have managed to make pretty good progress on this anyway with all the other stuff going on. I think I’m going to explain my progress by using few examples and eye candy. Click the images to show full size versions if they are scaled.

First we have a graph of showing the number of translation edits per day in Betawiki.

All translation edits in MediaWiki

It is also possible to compare projects:

Edits to MediaWiki and FreeCol compared

And then we have graphs in our portals:

Finnish translation edits

Or if you want to compare how your worst (best?) rival is doing much better than your language:

Comparison of Finnish and Swedish activity

Or do it only for one project:

Comparison of Finnish and Swedish activity for mobile broadband configuration assistant

We also have graphs in our project pages.

As you can see, the labels could use some polishing. There is no GUI for generating these, but it is easy if one knows the configuration parameters. It is possible to include them in pages with the special page inclusion syntax: {{Special:TranslationStats/language=xx;days=nn;group=id}} The size can also be changed with width and height parameters.

Every graph is visually about the same. I kind of like it, but YMMV. If this feature turns out to be very popular, I have to figure out how to do more aggressive caching. The data is is fetched from Betawiki recent changes table. It means that external changes are not counted—one more reason to use Betawiki.

Localisation of images

Amidst of fixing bugs I remembered a old feature request for localising images. One image may be worth of thousand words, but what if those words are in a foreign language? Now it is possible to replace anglocentric images in the user interface with localised ones. I use this opportunity to add some images to my pretty boring blog entries :)

So here is the current default toolbar in MediaWiki’s edit view:

Here is the same when using Arabic as the user interface language:

And one more example, which is for Belarusian (Taraškievica orthography):

The special Special Pages of extensions

First phase of my Summercode Finland is almost ready. Support for native Gettext projects is in testing phase and Xliff support is waiting for comments about which parts of the Standard should be supported. In other words, there hasn’t been many changes to file format support lately. This week I fixed some bugs found in Gettext testing which actually affected all groups not depending on the file format. For some reason every time I look at my code I find places to improve and clean up it. I cleaned up the command line maintenance scripts and sprinkled few headers for copyright and so on. In the process I managed to introduce handful of new bugs, but that happens always when I code :).

But let’s talk about the post title. It means the names of special pages shown in your browser’s address bar are no more sacred but can be translated like almost everything else. Now that Firefox 3 has been released many current browser even display them nicely and not in some unfriendly percent encoding like %D0%97%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0 instead of Заглавная_страница.

Actually, we have supported this for a long time already, but only for MediaWiki itself and not for special pages provided by the MediaWiki extensions. Special pages can have multiple aliases, and all of those can be used to access it, which means that they need some special handling. All of the complexity (yeah right… one do-while loop) is fortunately hidden behind a variable.

To make your extension support translating of special page aliases, you only need to put one line of code and create one file.

$wgExtensionAliasesFiles['YourExtension'] = $dir . 'YourExtension.i18n.alias.php';

And that file should look something like this:

<?php
/**
* Aliases for special pages of YourExtension extension.
*/

$aliases = array();

/** English
* @author YourName
*/
$aliases['en'] = array(
	'YourSpecialPage'          => array( 'YourSpecialPage' ),
);

At least the first instance YourSpecialPage should be the same as they key you used for declaring your special page with $wgSpecialPages. Note that WordPress likes to mangle quotes, so it is not safe to copy-paste verbatim from the above.

All this was committed today, so there may be some changes still, as always with brand new code. And the good news does not stop there. I already rewrote the Special:Magic of translate extension to support translating these! It already has two extension defined: Translate and Configure. The number of supported extensions will probably grow soon.

Memory optimisations

Yesterday (or in the midnight hours) I finally committed a patch to MediaWiki’s message cache. Betawiki uses MediaWiki in a way that puts a heavy pressure on the message cache. While normal MediaWiki installations have maybe dozens or few hundreds of customisations to MediaWiki interface messages (pages in MediaWiki namespace), Betawiki has hundreds of thousands of messages in hundreds of languages

The amount of messages that needs to be cached effectively is really in a different decade. Normally those messages take maybe few hundreds of kilobytes in PHP’s serialised format, stored in the database or in memory cache. In Betawiki all messages together would take about 23 megabytes! It is clear that loading and handling such a big blob is not going to work, especially when it is needed on every page request and needs to be updated on every change to the messages.

Some time ago we started to hit the memory limit we have set for PHP requests. I made some hacks to the code reduce the burden—but those were only hacks. Before this patch we basically stored only customisations to be used for Betawiki itself and skipping message cache updates totally, so it would only be updated after a timeout.

This was far from an ideal solution. The message cache was caching all the other messages individually. This is of course waste of memory and more importantly fragmentation increased a lot and request per second to memory cache (we use APC in Betawiki) sky-rocketed to thousands per second.

What made me hesitant to commit this patch was, that I needed to update code paths we don’t use in Betawiki, and thus wouldn’t get a much real testing. At the time of writing this message, it seems to be live on the servers of Wikimedia Foundation and is not reverted or got any comments so far, so it probably isn’t totally broken or unacceptable :).

What the new patch actually does, is that it adds a new configuration option, which when set to true will split the cache to smaller caches that contain messages for one language only. This greatly reduces to memory consumption, as only a couple of languages needs to be loaded in normal use. Full localisation of MediaWiki and all supported extensions takes from 500 to 800 kilobytes, depending on the script. The default setting for the new configuration option is false, which should result behaviour identical to the old version. I also added more comments and standardised the names of per language memory cache keys.

This will not solve all memory use problem in Betawiki, but is big step to keep it running efficiently, and with as few hacks as possible. Custom hacks are bad because they add maintenance burden and prevents others from creating a similar setup easily.

Of course the amount of messages will only grow in the future. To tackle this I have planned to move non-MediaWiki related messages to a another namespace, so at message cache will not handle them at all.

Using MediaWiki’s interface in your own language

Today I fixed bug 13463. It is relevant to people who use MediaWiki with interface language that is different from the wiki’s default language. When person logs in to MediaWiki, the first page saying your login was successful was shown in the default language.

It has apparently been like this for years, so I wonder why it only recently came up. I remember fixing a similar issue when changing the interface language in preferences few years back. Maybe people are not using their native language as often as possible as interface language. It may be that they are multilingual and don’t care about what language the interface is.

Of course there is also real reason not to use custom interface language. Interface messages can be customised, and they often are. All these customisation are “lost” when another language is chosen. Is this a problem? Can we do something to it?

MediaWiki has a feature that some interface messages are always displayed in a content language. It is a good thing for important and often customised messages like the one containing copyright information. The bad thing is that this list is somewhat arbitrary and it is not always clear what belongs to the list. It is also possible to remove messages from this list using a configuration variable. Adding is not possible.

Now, what if we just added all customised messages to this list and force them to be shown in the content language? Users would always see customisations, but we would also lose a bit in the localisation support. This may be acceptable on some wikis, but on large multilingual wikis this is not optimal. We could go one step further and translate these customisation to other languages. But to do that we need a translation infrastructure. Special:Allmessages isn’t usable for that.

One solution could be to use Translate extension. It has all needed features to easily group and translate messages. As I see it it would require two steps:

  • Automatic or manual creation of messages groups of customised messages
  • Change MediaWiki to use different message loading order for these messages (skip the translations in message files)

Is this needed? Would it be just a nice toy or useful feature?

I have a summer job

So, I was one of the five lucky winners who where chosen for Kesäkoodi (Summercode Finland). This means that I will be improving the Translate extension we use on Betawiki, and some i18n support on MediaWiki. Of course I will be active on the spring too, but the big features are coming in the summer. More about that later.

I also moved this blog to a new host, and updated WordPress. In this short time I’ve already got hate-hate relationship with it. Where is the delete all the N spam comments where N is big for example? Anyway maybe I get over it.

I’ll probably blog something about MediaWiki here too, if this works and I’m not too lazy.