Category Archives: English

Goes to some planets

MediaWiki Language Extension Bundle 2012.12

MediaWiki language extension bundle 2012.12 was released just before Christmas. It is compatible with MediaWiki versions 1.20 and 1.21alpha. Downloads and installation instructions can be found at https://www.mediawiki.org/wiki/MLEB. Announcements of new releases will be posted to mediawiki-i18n mailing list.

Here are the highlights:

cldr

English name for Azerbaijani (arz) was added.
A bug that caused local names for be-tarask not to be used was fixed.
Translations for be-tarask were updated.

Translate

Lots of development is ongoing on the translation user interface redesign project conducted by the WMF Language Engineering Team. New message list and translation editor (pictured) are in alpha stage, but interested users can activate them by using URL parameter tux=1 while on Special:Translate. Also, tux=0 gets back the old interface.

$wgTranslateAC and $wgTranslateEC were removed. If you were still using these, switch over to the TranslatePostInitGroups hook or $wgTranslateCC.

Bundled Solarium library was removed. Install it manually or use the MediaWiki Solarium extensions.

Sneak peek from the new translation UX: the new group selector (top) and part of the revamped translation editor.

Other noteworthy changes in Translate:

ApiQueryMessageGroups module has lots of new functionality.
There is new ApiQueryLanguageStats module.
(bug 39761) Special:TranslationStats counts for edits includes also reviews
GettextFFS: Handle empty but existing msgctxt properly
New hook: TranslateSupportedLanguages

Universal Language Selector

Fixed a display issue in the Modern skin.
(bug 42382) Indicate context in input settings/more languages

MediaWiki Language Extension Bundle launches!

The Wikimedia Language Engineering team is pleased to announce the first release of the MediaWiki Language Extension Bundle. The bundle is a collection of a few selected MediaWiki extensions needed by any wiki which desires to be multilingual.

This first bundle release (2012.11) is compatible with MediaWiki 1.19, 1.20 and 1.21alpha.

Get it from https://www.mediawiki.org/wiki/MLEB

The Universal Language Selector is a must have, because it provides an essential functionality for any user regardless on the number of languages he/she speaks: language selection, font support for displaying scripts badly supported by operating systems and input methods for typing languages that don’t use Latin (a-z) alphabet.

Maintaining multilingual content in a wiki is a mess without the Translate extension, which is used by Wikimedia, KDE and translatewiki.net, where hundreds of pieces of documentation and interface translations are updated every day; with Localisation Update your users will always have the latest translations freshly out of the oven. The Clean Changes extension keeps your recent changes page uncluttered from translation activity and other distractions.

Don’t miss the chance to practice your rusty language skills and use the Babel extension to mark the languages you speak and to find other speakers of the same language in your wiki. And finally the cldr extension is a database of language and country translations.

We are aiming to make a new release every month, so that you can easily stay on the cutting edge with the constantly improving language support. The bundle comes with clear installation and upgrade instructions. It is tested against MediaWiki release versions, so you can avoid most of the temporary breaks that would happen if you were using the latest development versions instead.

Because this is our first release, there can be some rough edges. Please provide us a lot of feedback so that we can improve for the next release.

Performance tuning translatewiki.net

One of the biggest advantage of desktop translation tools is that they don’t have delays rendering the interface – at least not in such a scale as websites have. In translatewiki.net it is crucial that our pages load very fast. In certain places we can and do use intelligent preloading to remove the delays, in other places we have to employ complex caching algorithms to reach that target. I am regularly monitoring the automatically collected profiling information to avoid regressions and to pick low-hanging fruit from time to time.

In the last sprint my main task was to convert the way we handle the translation of MediaWiki extensions in translatewiki.net to use the same processes and interfaces as pretty much everything else. MediaWiki and MediaWiki extensions were the first things supported in translatewiki.net and now they are among the last things to get modernized to take advantage of better interfaces built on the years of experience supporting various kinds of products.

The only user visible change is improved performance. The new interfaces are more efficient and enable more optimizations, which allows us to deliver faster page views and scale to more messages. It will also simplify the work of translatewiki.net staff, as they don’t need to follow two different processes, especially after we update also MediaWiki translation code.

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

As a developer I’m proud that the new code is unit tested. The culmination, however, was a change which removed hundreds of lines of old code: in fact, the above quote applies to software development too.

For those interested in details, the biggest performance boosts were achieved by avoiding the need to parse the translation files in many places – the list of message keys and their values are stored in intermediate cache files in CDB format. In addition there were many smaller performance optimizations, like not using some MediaWiki method to construct a link element, which consumed 20 kilobytes of memory for each link. When there are thousands of links, it adds up and is excessive for just making some hundred bytes of output. I switched it to a more low level method (memory usage: from 175 to 12 MB).

Some low-hanging fruit might not be as easy to pick as it seems at first. (Photo CC-BY-SA by Asit K. Ghosh.)

At the time of writing I still have some more fixes pending further testing and cleanup. For example, to access any message group, those all have to be loaded. They are cached as serialized PHP objects, but loading them takes 20 milliseconds and 10 megabytes of memory. I’m working on making it possible to load cached message groups individually.

The website anyone can translate

Translatewiki.net has started using Puppet. Puppet is a tool designed to manage the configuration of servers. Like Wikimedia’s, our configuration is public and stored in the translatewiki.net git repository, where anyone can submit patches. I don’t expect a flood of them coming in anytime soon, my motivations for this were different. If you remember, some months back I had to learn some Puppet to write the Solr configuration for Wikimedia deployment. Now I wanted to learn more and gather more experience on using Puppet. It will also greatly help if we ever need to reinstall the translatewiki.net server from scratch (which is quite likely to happen soon). As a bonus it gives transparency and something I can refer people to when they ask how some particular thing is done in translatewiki.net. As time permits, I will be moving more configuration to Puppet.

Mitä isot edellä, sitä pienet perässä. (Internet suggest the closest translation is Monkey see, monkey do.)

I also added the translatewiki.net repository to Ohloh. If you use translatewiki.net as localisation platform, feel free to add it to your stacks by clicking “I use this”, or to embed its widgets in your website. Ohloh also gives some cool stats:

In a Nutshell, translatewiki.net…

…has had 739 commits made by 20 contributors representing 3,288 lines of code
…has a young, but established codebase maintained by a large development team with stable year-over-year commits

Together with the introduction of Puppet, I also switched the webserver of translatewiki.net from lighttpd to nginx. The biggest reason for this is that https was broken for Google Chrome users, but in general nginx feels faster and more robust and the way PHP is used with it is much simpler (php-fpm instead of spawn-cgi). The Wikimedia operations team is supposedly going to test nginx soon, so we will see whether the tide also goes that way.

Muir Woods has one tree – plural issues in MediaWiki

While I was having fun with the rest of the Wikimedia I18n team in San Francisco, a stream of plural related bug reports started coming in. The cause is that we have recently scrapped the custom plural rules in MediaWiki in favor of using plural rules from the CLDR database. A temporary fix has been applied to mitigate the reported issues.

The problem manifestation is pretty simple; in some languages in some contexts the message was always one something. For example the category page would say This category has one page regardless of how many pages there were in it. At first I was baffled. After all we had written unit tests for all languages in MediaWiki and they reported no regressions. Turns out we had ignored one particular set of languages: those which don’t always use plurals and had no plural rules defined in MediaWiki. The problems started when those language used plural even though they weren’t supposed to. When plural rules are not defined for a language, those languages use the plural rules as defined for the English language: 1 book, 2 books. In CLDR, however, some languages have been defined to not use any plural rules at all.

We could blame the translators for using plural syntax when they are not supported, or we could blame the CLDR for having no plurals rules for languages which do use plurals in some cases. It is not that simple, however. The typical example is a language which doesn’t have distinct plural forms (like some words in English: 1 fish, 2 fish; but for all nouns), but do use plural quantifiers if the number is not present: one fish, many fish.

As a compromise I have proposed an extension to the plural syntax to allow specifying the output when the number is 0 or 1 regardless of the usual plural rules for that language. Let’s take a real example:

Accepted by {{PLURAL:$1|you|$1 users including you}}.

This works fine in English, because the first form is always for number 1. In Belarusian it doesn’t work, because the first form is used for number 1, but also for numbers 21, 31, 41 etc. It could be solved by the following syntax:

{{PLURAL:$1|1=you|$1 users including you}}.

The slightly confusing part here is that now the second form is actually the singular form. This is more evident in the imaginary Belarus translation:

{{PLURAL:$1|1=you|one|few|many|other}}

"you" is used for number 1, “one" for 21, 31, 41 but not 1, and the remaining forms as they usually are.

The explicit zero form (0=something) can also be useful for English and many other languages to have a different wording – something which is now usually done with separate messages.

The message used above is from the Translate extension. Unfortunately we cannot start using this syntax until we have dropped backwards compatibility with the last MediaWiki version not supporting this syntax i.e. 1.20, which would be around when MediaWiki 1.22 is released. We are seriously considering to backport this functionality, but we also need to add support for the same syntax in JavaScript first.

During further testing we also found issues in Hebrew plural rules. The position of dual was changed and we didn’t notice it because the unit tests were wrong. This resulted in problems like the login page saying Remember my login for two days. It just helps reminding how bugs in i18n can cause potentially severe issues.

Niklas in Muir Woods. Testing new counting methods? (Photo by Pau Giner.)

Efficient translation: Translation memory enabled on all Wikimedia wikis

I am pleased to announce that a long development project has been released and taken into production. We now have translation memory services enabled on Wikimedia projects (since August 28, in our last sprint).

The translation editor on Wikimania 2013 wiki shows a suggestion from Wikimania 2012 wiki

Users translating for Wikimania 2013 are provided with suggestions from 2012 (right arrow); a click is enough to copy it to the text area (down arrow). See also on Meta, in English interface.

Translation memory is a feature which provides likely translations for a text based on previous translations of similar texts: translators use them to speed up their work and to increase consistency (more in Wikipedia).

If you have translated at translatewiki.net or usebase.kde.org, you may have already noticed it. The translation memory on Wikimedia wikis has been filled with existing translations made with the Translate extension in WMF projects including Meta, mediawiki.org and Wikimania wikis.

Translators from all Wikimedia projects using the Translate extension can now work more efficiently, sharing their work and experience across the boundaries of wikis. Translators on Wikimania 2013 wiki can now find translations already provided for the previous year (see screenshot) and be quicker without sacrificing quality and consistency. Translators of technical documentation on mediawiki.org can benefit from the translation of Wikimedia terminology on Meta-Wiki and vice versa.

Technical challenges

A translation memory service has been in use at translatewiki.net for years, and the process of getting it enabled on Wikimedia was started about a year ago.

Naturally WMF operations is a very different thing from the small shared server translatewiki.net runs on. Yet, there were many unexpected turns that caused delay. The phases here are named retroactively.

Phase1

Originally we used the tmserver component from the translate toolkit. It had its own problems: it was hard to set up, it was an external dependency and the SQLite database engine it used was problematic for updates – it failed if there were multiple processes accessing at the same time. Sometimes the included standalone webserver got stuck and the other option, WSGI, didn’t play nicely with our lighttpd web server.

I did lots of research with Siebrand trying to find other open source translation memories, but failed to find anything that had any active or recent development.

Phase2

The next step was the standalone version. To avoid external dependencies, to make it usable in the WMF infrastructure, and not to require separate services, I started porting the tmserver algorithm from Python to PHP. At the same time I was able to take advantage of MediaWiki’s database abstraction code, which in theory should make it work on SQLite, MySQL and PostgreSQL. At the moment, however, only MySQL is tested and in use at translatewiki.net.

Performance of this new system was mostly the same, though it’s a constant fight for not letting the Levenshtein algorithm, used for ranking in the core, get exponentially slow. The major new feature was the support for shared databases, so that multiple wikis can use the translations made in other wikis for suggestions. A lot of time was spent on this, and also on making the initial bootstrap efficient with use of multiple threads.

Phase3

When we thought everything was ready for deployment on Wikimedia wikis, we waited for feedback from ops and finally we got a simple, yet unwanted reply: “Full-text search with MySQL cannot be used in the WMF cluster (because it depends on the problematic MyISAM storage engine)”. Yay. Back to the drawing board.
Since everything at Wikimedia is using a heavily modified Apache Lucene for full text search, the same was obviously suggested as a solution. So started the development of phase3; if the past predicts anything, this will have been the final rewrite.

I decided not to touch Wikimedia’s version of Lucene, as I already had lots of experience on it due to playing with it for my Master’s thesis (English summary on my blog), and decided to use standard Lucene with a Solr frontend. Solr simplified many things and the development was swift using the PHP Solarium library.

In fact, the most difficult “feature” to develop was the Puppet configuration for Jetty and Solr, and testing it on WMF Labs. So I learned to write Puppet configuration files from scratch and did it mostly myself. Oren Bochman helped a lot with the Labs testing phase. The last hurdle was backporting recent packages of Solr and its dependency Jetty for the Ubuntu that Wikimedia was using on Labs and in production. Luckily I was fortunate enough to get quick help from ops, so I didn’t have to also learn how to make Ubuntu packages.

So somewhat ironically, we went from separate services to standalone and again to a separate service. The first phase is long forgotten, but the standalone and Solr versions complement each other. The former is enabled by default for anyone using the Translate extension, the latter provides superior scalability and hopefully in the future even better suggestions.

Fact is that the Levenshtein based ranking is not the state of the art for translation memories[1] and does not compare to the state of art i18n we are doing with MediaWiki and translatewiki.net.

On to the next adventure!

[1] Paper abstract (full text behind paywall; DOI:10.1007/3-540-39965-8_14).

Language validation in MediaWiki

Validating language codes like en or fi or chr might seem to be an easy task at first. You would expect this problem is already solved in MediaWiki, but that is far from the truth.

In fact, we are not even handling language codes, but language tags as defined by IETF. The linked standard brings together many standards, like the two and three letter language codes from ISO 639 standards, script names and region names, and more. This means that we have to handle language tags like pt-BR, sr-Latn, be-x-old and of course in the mix are invalid tags like de-formal and tokipona, and deprecated language codes like bat-smg (better: sgs).

The language tags are case insensitive, but there is preferred casing for different parts. MediaWiki has wfBCP47() which handles the “pretty-formatting”.

Let me list the language tag validation functions that already exists…

Language::isValidCode() – Contrary to its name, this function only checks that the language tag doesn’t contain certain characters which are not valid in page names or unsafe in html. Recently we had some issues with XSS exploits when code expected language codes to be html safe.
Language::isValidBuiltinCode() – This is slightly more strict, it only accepts language tags which consist of letters a-z, numbers 0-9 and hyphens.

…and what I think should exist – these will be probably implemented very soon:

Language::isKnownLanguageTag() – Checks that the language tag is known to MediaWiki. This basically means that we know the name of the language in English or in another language. Sources of known language codes are the built-in Names.php, the codes optionally added through the CLDR extension and the list of language names in English (pending merge).
Language::isSupportedLanguageTag() – Checks whether any localisation is available for that language tag in MediaWiki (MessagesXx.php exists).

I can also imagine a use case for:

Language::isWellFormedLanguageTag() – Checks whether the language tag is well formed. Like isKnownLanguageTag but less tight and more flexible. Would accept non-sense stuff like fi-Cyrl-JA-x-foo that semantically makes no sense but is valid according to the rules.

Wikimania videos: the next billion users on Wikipedia and beyond

Wikimedia DC has started publishing the Wikimania videos on YouTube. They are not split by presentation, only by track, but here are some about localisation and internationalisation.

My Wikimania presentation (see my previous post), Translating the wiki way (starts at 28:05; watch on YouTube):

Amir’s Supporting languages, all of them and Siebrand’s A Tale of Language Support and Ask the Language Support People (watch on YouTube):

Santhosh’s Read and Write in your language has not been published yet and nobody seems to know if it will, or if it has been recorded at all.

Alolita’s The next billion users on Wikipedia with Open Source Webfonts and Amir’s The software localization paradox (watch on YouTube):

See also the category on Wikimania wiki for abstracts and slides for these presentations.

My presentations at Akademy and Wikimania

In July I gave two presentations: one at Akademy 2012 in Tallinn, and one at Wikimania 2012.

Short summary of my Akademy presentation (slides): If you are translating content in MediaWiki and you are not using Translate extension, you are doing it wrong. Statistics, translation and proofreading interface – you get them all with Translate. Because Translate keeps track of changes to pages, you can spend your time translating instead of trying to figure what needs translating or updating.

Also, have a look at UserBase, it has now been updated to include the latest features and fixes of Translate extension, like the ability to group translatable pages into larger groups.

Akademy presenation by Niklas and Claus: click for video. Yes, there’s a a typo.

Short summary of my Wikimania presentation (slides; video not yet available): Stop wasting translators’ time.
Forget signing up to e-mail lists, forget sending files back and forth. Use translation platforms that move files from and to the version control system transparently to the translator.
If you have sentences split into multiple messages, you are doing it wrong. If your i18n framework doesn’t have support for plural, gender and grammar dependent translations, you are doing it wrong. If you are not documenting your interface messages for translators, you are doing it wrong.

Niklas maybe having fun at Library of Congress. Photo tychay, CC-BY-NC-ND

Translation sprint for KDE in Finnish

In our sprint website we’re translating the upcoming KDE SC 4.9 release into Finnish. If you know Finnish, you only have to register to start translating: please join us!
We have a simple goal: translate 10,000 new messages and have all the changes proofread and accepted. In two weeks we have translated more than 3,000 messages and the majority of them have been proofread and accepted. We still have about three weeks to go, so your help is needed to increase the output to reach the goal of 10,000 new translations. As a secondary activity we are also proofreading the existing translations and discussing and harmonizing the terminology. For example should filter be suodin or suodatin.

Keep reading if you are interested in how we organized the sprint from a technical perspective.

This is the second translation sprint I’m organizing with the Translate extension. The first one was in March, when we translated Gnome 3.4 into Finnish and this time we are translating KDE 4.9 into Finnish. I can say that the Translate extension fits for this purpose pretty well:

You can set up everything in few hours.
There are minimal barriers to start using it (we do require registration).
It is suitable for novice translators, because they get feedback when other people proofread and correct their translations.

It is not without its issues either, but I see this as a great opportunity to make the MediaWiki Translate extension even better and have it support a variety of use cases. Let me describe some.

Bugs. There are always some bugs. This time I found a regression in the workflow states feature where the recent changes weren’t backwards compatible with the old configuration format. That was quickly fixed and I also submitted fixes for a few minor issues, which were not encountered before. All in all I have 7 local patches, mostly small behaviour changes like the formatting of message keys or showing the message context field to translators. Most of those can be cleaned up and submitted for merging.

Scalability. I had an impression for a long time that the Translate extension scales up pretty well. After all we have thousands of message groups and 50k messages translated into hundreds of languages at translatewiki.net. How naive I was. All of KDE as we use it (stable and trunk branches merged; including playground and extragear, calligra and other related stuff) contains 200k messages. Turns out that our import tools choke when you try to feed them 350k new messages at once (this includes Finnish translations). As a workaround I had to limit the amount of messages that are processed at once and iterate over the whole process multiple times. This is where the bulk of my time was spent. Of course I also ran out of disk space in the middle of the import. It takes about 1G of space, but currently I have only a tiny 10G disk on the server.

Search. The most requested feature is better search. Currently it is not possible to limit the search to a message group nor to see the translation when searching source texts, or the source text when searching for translations. Also it takes a few clicks before you can edit the message from the search results. Building a good search backend is currently on the backlog of the Wikimedia Localisation team, but it is not yet scheduled for any sprint.

Stay tuned for the results of the KDE Finnish translation sprint.

It rains like a saavi

About me, me and me