New language stuff for developers and users

MLEB 2013.01 has been released by Amir Aharoni. Lots of development has been happening in Translate due to the work on the new translation interfaces. If you are a developer, please also checkout the latest new and changed Web APIs and give us a shout in #mediawiki-i18n @freenode if you see something obviously wrong or missing. Also included in this release are bug fixes for Universal Language Selector, while the other included extensions didn’t see many changes.

Some months ago I wrote about Language tag validation in MediaWiki. A nice person named Siebrand Mazeland decided to improve the situation. As of now we have three new methods developed by the Wikimedia Language engineering team:

  • isSupportedLanguage
  • isWellFormedLanguageTag
  • isKnownLanguageTag

Unless these methods are backported to MediaWiki 1.19 and MediaWiki 1.20, it will take a while before these are being used in extensions, but after a while we should see faster and more readable code.

How I debug performance issues in MediaWiki

The earlier post does not describe how I usually do performance improvements. Usually it starts with debugging the less innocent-looking messages by our IRC bot rakkaus, which relays PHP error messages to the IRC channel. An example:

[01-Nov-2012 20:16:25 UTC] PHP Fatal error: Maximum execution time of 30 seconds exceeded in /www/translatewiki.net/w/extensions/Translate/ttmserver/TTMServer.php on line 100

After this I have to use the timestamp to match our webserver access log and try if I can reproduce the issue by loading the same url. PHP is very unhelpful in this regard: fatal errors don’t give the request url nor stacktrace. Sometimes it is a command line script like the job runner initiated via cron. For those cases I’ve implemented a simple logging of all maintenance script executions, but they are still annoying to debug. Once I am able to reproduce the issue on the production environment, I try to reproduce it also on my development environment. Oh boy, it is fun if that is not possible. If I can, however, I will usually start by looking the per-request profiling included in the page source, with output like this:

0.0558 8.5M Connected to database 0 at localhost 0.0562 8.5M Query sandwiki (14) (slave): SELECT /* SqlBagOStuff::getMulti Nike */ keyname,value,exptime FROM `bw_objectcache` WHERE keyname = ‘sw:messages:fi’

Here we see that it takes 56 milliseconds before MediaWiki even connects to the database, and the first thing it does is to load messages for the current user language. What usually follows is old style debugging where I add echo and var_dump statements until I have understood what is happening and what is inefficient. After that, the creative phase begins: finding a way to make it faster. Usually there is some sort of bug in the code that causes it to do unnecessary work. Rarely the bad performance is actually caused by slow algorithms. This kind of makes sense: the datasets we are processing are usually small, and when they are bigger, it is usually written in an efficient way in the first place.

I love performance tuning, but I have to be prudent to pick the right things to optimize, because it is also a great time sink, and as a busy person I am entitled only to few time sinks at a time.

MediaWiki Language Extension Bundle 2012.12

MediaWiki language extension bundle 2012.12 was released just before Christmas. It is compatible with MediaWiki versions 1.20 and 1.21alpha. Downloads and installation instructions can be found at https://www.mediawiki.org/wiki/MLEB. Announcements of new releases will be posted to mediawiki-i18n mailing list.

Here are the highlights:

cldr

  • English name for Azerbaijani (arz) was added.
  • A bug that caused local names for be-tarask not to be used was fixed.
  • Translations for be-tarask were updated.

Translate

Lots of development is ongoing on the translation user interface redesign project conducted by the WMF Language Engineering Team. New message list and translation editor (pictured) are in alpha stage, but interested users can activate them by using URL parameter tux=1 while on Special:Translate. Also, tux=0 gets back the old interface.

$wgTranslateAC and $wgTranslateEC were removed. If you were still using these, switch over to the TranslatePostInitGroups hook or $wgTranslateCC.

Bundled Solarium library was removed. Install it manually or use the MediaWiki Solarium extensions.

The new group selector (top) and part of the revamped translation editor.

Sneak peek from the new translation UX: the new group selector (top) and part of the revamped translation editor.

Other noteworthy changes in Translate:

  • ApiQueryMessageGroups module has lots of new functionality.
  • There is new ApiQueryLanguageStats module.
  • (bug 39761) Special:TranslationStats counts for edits includes also reviews
  • GettextFFS: Handle empty but existing msgctxt properly
  • New hook: TranslateSupportedLanguages

Universal Language Selector

  • Fixed a display issue in the Modern skin.
  • (bug 42382) Indicate context in input settings/more languages

MediaWiki Language Extension Bundle launches!

The Wikimedia Language Engineering team is pleased to announce the first release of the MediaWiki Language Extension Bundle. The bundle is a collection of a few selected MediaWiki extensions needed by any wiki which desires to be multilingual.
This first bundle release (2012.11) is compatible with MediaWiki 1.19, 1.20 and 1.21alpha.
The Universal Language Selector is a must have, because it provides an essential functionality for any user regardless on the number of languages he/she speaks: language selection, font support for displaying scripts badly supported by operating systems and input methods for typing languages that don’t use Latin (a-z) alphabet.
Maintaining multilingual content in a wiki is a mess without the Translate extension, which is used by Wikimedia, KDE and translatewiki.net, where hundreds of pieces of documentation and interface translations are updated every day; with Localisation Update your users will always have the latest translations freshly out of the oven. The Clean Changes extension keeps your recent changes page uncluttered from translation activity and other distractions.
Don’t miss the chance to practice your rusty language skills and use the Babel extension to mark the languages you speak and to find other speakers of the same language in your wiki. And finally the cldr extension is a database of language and country translations.
We are aiming to make a new release every month, so that you can easily stay on the cutting edge with the constantly improving language support. The bundle comes with clear installation and upgrade instructions. It is tested against MediaWiki release versions, so you can avoid most of the temporary breaks that would happen if you were using the latest development versions instead.
Because this is our first release, there can be some rough edges. Please provide us a lot of feedback so that we can improve for the next release.

Performance tuning translatewiki.net

One of the biggest advantage of desktop translation tools is that they don’t have delays rendering the interface – at least not in such a scale as websites have. In translatewiki.net it is crucial that our pages load very fast. In certain places we can and do use intelligent preloading to remove the delays, in other places we have to employ complex caching algorithms to reach that target. I am regularly monitoring the automatically collected profiling information to avoid regressions and to pick low-hanging fruit from time to time.

In the last sprint my main task was to convert the way we handle the translation of MediaWiki extensions in translatewiki.net to use the same processes and interfaces as pretty much everything else. MediaWiki and MediaWiki extensions were the first things supported in translatewiki.net and now they are among the last things to get modernized to take advantage of better interfaces built on the years of experience supporting various kinds of products.

The only user visible change is improved performance. The new interfaces are more efficient and enable more optimizations, which allows us to deliver faster page views and scale to more messages. It will also simplify the work of translatewiki.net staff, as they don’t need to follow two different processes, especially after we update also MediaWiki translation code.

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

As a developer I’m proud that the new code is unit tested. The culmination, however, was a change which removed hundreds of lines of old code: in fact, the above quote applies to software development too.

For those interested in details, the biggest performance boosts were achieved by avoiding the need to parse the translation files in many places – the list of message keys and their values are stored in intermediate cache files in CDB format. In addition there were many smaller performance optimizations, like not using some MediaWiki method to construct a link element, which consumed 20 kilobytes of memory for each link. When there are thousands of links, it adds up and is excessive for just making some hundred bytes of output. I switched it to a more low level method (memory usage: from 175 to 12 MB).

Some low-hanging fruit might not be as easy to pick as it seems at first. (Photo CC-BY-SA by Asit K. Ghosh.)

At the time of writing I still have some more fixes pending further testing and cleanup. For example, to access any message group, those all have to be loaded. They are cached as serialized PHP objects, but loading them takes 20 milliseconds and 10 megabytes of memory. I’m working on making it possible to load cached message groups individually.

The website anyone can translate

Translatewiki.net has started using Puppet. Puppet is a tool designed to manage the configuration of servers. Like Wikimedia’s, our configuration is public and stored in the translatewiki.net git repository, where anyone can submit patches. I don’t expect a flood of them coming in anytime soon, my motivations for this were different. If you remember, some months back I had to learn some Puppet to write the Solr configuration for Wikimedia deployment. Now I wanted to learn more and gather more experience on using Puppet. It will also greatly help if we ever need to reinstall the translatewiki.net server from scratch (which is quite likely to happen soon). As a bonus it gives transparency and something I can refer people to when they ask how some particular thing is done in translatewiki.net. As time permits, I will be moving more configuration to Puppet.

Mitä isot edellä, sitä pienet perässä. (Internet suggest the closest translation is Monkey see, monkey do.)

I also added the translatewiki.net repository to Ohloh. If you use translatewiki.net as localisation platform, feel free to add it to your stacks by clicking “I use this”, or to embed its widgets in your website. Ohloh also gives some cool stats:

In a Nutshell, translatewiki.net…

Together with the introduction of Puppet, I also switched the webserver of translatewiki.net from lighttpd to nginx. The biggest reason for this is that https was broken for Google Chrome users, but in general nginx feels faster and more robust and the way PHP is used with it is much simpler (php-fpm instead of spawn-cgi). The Wikimedia operations team is supposedly going to test nginx soon, so we will see whether the tide also goes that way.

Muir Woods has one tree – plural issues in MediaWiki

While I was having fun with the rest of the Wikimedia I18n team in San Francisco, a stream of plural related bug reports started coming in. The cause is that we have recently scrapped the custom plural rules in MediaWiki in favor of using plural rules from the CLDR database. A temporary fix has been applied to mitigate the reported issues.

The problem manifestation is pretty simple; in some languages in some contexts the message was always one something. For example the category page would say This category has one page regardless of how many pages there were in it. At first I was baffled. After all we had written unit tests for all languages in MediaWiki and they reported no regressions. Turns out we had ignored one particular set of languages: those which don’t always use plurals and had no plural rules defined in MediaWiki. The problems started when those language used plural even though they weren’t supposed to. When plural rules are not defined for a language, those languages use the plural rules as defined for the English language: 1 book, 2 books. In CLDR, however, some languages have been defined to not use any plural rules at all.

We could blame the translators for using plural syntax when they are not supported, or we could blame the CLDR for having no plurals rules for languages which do use plurals in some cases. It is not that simple, however. The typical example is a language which doesn’t have distinct plural forms (like some words in English: 1 fish, 2 fish; but for all nouns), but do use plural quantifiers if the number is not present: one fish, many fish.

As a compromise I have proposed an extension to the plural syntax to allow specifying the output when the number is 0 or 1 regardless of the usual plural rules for that language. Let’s take a real example:

Accepted by {{PLURAL:$1|you|$1 users including you}}.

This works fine in English, because the first form is always for number 1. In Belarusian it doesn’t work, because the first form is used for number 1, but also for numbers 21, 31, 41 etc. It could be solved by the following syntax:

{{PLURAL:$1|1=you|$1 users including you}}.

The slightly confusing part here is that now the second form is actually the singular form. This is more evident in the imaginary Belarus translation:

{{PLURAL:$1|1=you|one|few|many|other}}

"you" is used for number 1, “one" for 21, 31, 41 but not 1, and the remaining forms as they usually are.

The explicit zero form (0=something) can also be useful for English and many other languages to have a different wording – something which is now usually done with separate messages.

The message used above is from the Translate extension. Unfortunately we cannot start using this syntax until we have dropped backwards compatibility with the last MediaWiki version not supporting  this syntax i.e. 1.20, which would be around when MediaWiki 1.22 is released. We are seriously considering to backport this functionality, but we also need to add support for the same syntax in JavaScript first.

During further testing we also found issues in Hebrew plural rules. The position of dual was changed and we didn’t notice it because the unit tests were wrong. This resulted in problems like the login page saying Remember my login for two days. It just helps reminding how bugs in i18n can cause potentially severe issues.

Niklas in Muir Woods.

Niklas in Muir Woods. Testing new counting methods? (Photo by Pau Giner.)

Finnish translation sprint 2012-06 KDE – results

English summary: Using the MediaWiki Translate extension, the KDE SC 4.9 release was collaboratively translated into Finnish during the midsummer. The goals of the translation sprint were to produce 10,000 new proofread translations and to unify the translations of many common terms. The goals were mostly met. Due to problems of counting the number of new translations, we had to change the measure to include also updated translations. We made 10,643 new or updated translations and over 8,250 translations were proofread. I believe the combined effort makes a visible difference, though in absolute terms there is more work to do for tens of more sprints.


Kesä-heinäkuussa järjestettiin KDE:n käännöstapahtuma, jossa käännetiin kaikkea KDE:sta, mutta erityisesti KDE SC 4.9 -julkaisua, josta saimme kaikki tärkeimmät osat tehtyä. Muitakin KDE-ohjelmia suomennettiin ja suomennoksia parannettiin. Erityisesti KDE:n verkkosivujen suomentamisessa päästiin eteenpäin. Aiemmin oli suomennettu vain osa KDE:n sivustojen yhteisistä viesteistä, mutta nyt koko Join the Game on suomeksi sekä muidenkin sivujen suomentamista on aloitettu.

Toinen tärkeä saavutus oli käännösten laadun parantaminen oikolukemalla tärkeimpien ja näkyvimpien KDE-ohjelmien viestejä sekä yhtenäistämällä käytettyjä termejä ja käymällä läpi yleisiä kehnouksia. Jotkut myös innostuivat korjailemaan heitä jo pidempään vaivanneita asioita, joita ei vain ollut aiemmin tullut tehtyä. IRC-kanavakin aktivoitui ja keskustelua käännöksistä syntyi.

Jäimme kuitenkin alkuperäisestä 10 000 uuden käännöksen tavoitteestamme: täysin uusia oli 6 315. Tämä luku ei pidä täysin paikkaansa, sillä siinä ei ole mukana sumeita viestejä. Sumeat viestit sisältävät esitäytetyn käännöksen, joka saattaa joko vaatia pientä viilausta tai on ihan väärin. Parempi luku saadaan, jos mukaan lasketaan parannetut suomennokset, jolloin tulokseksi saadaan 10 643 uutta tai parannettua käännöstä. Oikolukeminen ei ollut aivan yhtä suosittua, mutta silti yli 8 250 käännöstä oikoluettiin.

Monet kokonaisuudet jäivät myös kesken, joskus pientä vaille. Termien yhtenäistäminenkin jäi kesken, vaikka se olikin vain toissijainen tavoite ja tiedetysti turhan iso pala. Siitä huolimatta useita termejä saatiin yhtenäistettyä – työurakka on vain erittäin suuri ja liian suuri tehtäväksi kerralla.

Käännösten oikolukeminen Translatella todella toimi. Varsinkin verrattuna nykyiseen tapaan, jossa kääntäjät lähettelevät sähköpostilla kokonaisia Gettext po-tiedostoja eikä edes laadunvalvontaan tarkoitettu postituslista ole käytössä. Työkalun perustoiminnallisuus oli kunnossa: mikään ei juuri häirinnyt kääntämistä. Pientä hitautta tosin ole havaittavissa; syynä siihen lähinnä käännösmuistiominaisuus.

7–8 uutta ihmistä saatiin mukaan, joista muutama teki kymmeniä käännöksiä ja loput selvästi enemmän. Kaikki vastaan tulleet tekniset ongelmat tilastosivun skaalautumisongelmista käännösten po-tiedostoihin viennin pieniin korjauksiin saatiin ratkottua hyvin.

Edelleen pohdituttaa, miten saisi enemmän ihmisiä mukaan. Myös työkalun tuonnin ja viennin suhdetta SVN:ssä tehtyihin muutoksiin täytynee miettiä ennen kuin työkalua voisi ottaa käyttöön nykyisen po-tiedostojen sähköpostilla lähettelyn rinnalle. Automaattisesti vientiä ei kuitenkaan voi tehdä, koska KDE:n SVN-tilin käyttösäännöt estävät sen.

Tulevaisuudessa mietimme työkalun vakituisempaakin käyttöä. Suurin ongelma on herättää lokalisointi.org uudelleen henkiin ja yhdistää siellä olevat termit käännösalustaan.

 

Haluan kiittää kaikille osallistujia käännöstyöstä ja palautteesta alustan toimivuuden suhteen. Erikseen kiitokset Lasse Liehulle tämän yhteenvedon raakaversion koostamisesta.

Efficient translation: Translation memory enabled on all Wikimedia wikis

I am pleased to announce that a long development project has been released and taken into production. We now have translation memory services enabled on Wikimedia projects (since August 28, in our last sprint).

The translation editor on Wikimania 2013 wiki shows a suggestion from Wikimania 2012 wiki

Users translating for Wikimania 2013 are provided with suggestions from 2012 (right arrow); a click is enough to copy it to the text area (down arrow). See also on Meta, in English interface.

Translation memory is a feature which provides likely translations for a text based on previous translations of similar texts: translators use them to speed up their work and to increase consistency (more in Wikipedia).

If you have translated at translatewiki.net or usebase.kde.org, you may have already noticed it. The translation memory on Wikimedia wikis has been filled with existing translations made with the Translate extension in WMF projects including Meta, mediawiki.org and Wikimania wikis.

Translators from all Wikimedia projects using the Translate extension can now work more efficiently, sharing their work and experience across the boundaries of wikis. Translators on Wikimania 2013 wiki can now find translations already provided for the previous year (see screenshot) and be quicker without sacrificing quality and consistency. Translators of technical documentation on mediawiki.org can benefit from the translation of Wikimedia terminology on Meta-Wiki and vice versa.

Technical challenges

A translation memory service has been in use at translatewiki.net for years, and the process of getting it enabled on Wikimedia was started about a year ago.

Naturally WMF operations is a very different thing from the small shared server translatewiki.net runs on. Yet, there were many unexpected turns that caused delay. The phases here are named retroactively.

Phase1

Originally we used the tmserver component from the translate toolkit. It had its own problems: it was hard to set up, it was an external dependency and the SQLite database engine it used was problematic for updates – it failed if there were multiple processes accessing at the same time. Sometimes the included standalone webserver got stuck and the other option, WSGI, didn’t play nicely with our lighttpd web server.

I did lots of research with Siebrand trying to find other open source translation memories, but failed to find anything that had any active or recent development.

Phase2

The next step was the standalone version. To avoid external dependencies, to make it usable in the WMF infrastructure, and not to require separate services, I started porting the tmserver algorithm from Python to PHP. At the same time I was able to take advantage of MediaWiki’s database abstraction code, which in theory should make it work on SQLite, MySQL and PostgreSQL. At the moment, however, only MySQL is tested and in use at translatewiki.net.

Performance of this new system was mostly the same, though it’s a constant fight for not letting the Levenshtein algorithm, used for ranking in the core, get exponentially slow. The major new feature was the support for shared databases, so that multiple wikis can use the translations made in other wikis for suggestions. A lot of time was spent on this, and also on making the initial bootstrap efficient with use of multiple threads.

Phase3

When we thought everything was ready for deployment on Wikimedia wikis, we waited for feedback from ops and finally we got a simple, yet unwanted reply: “Full-text search with MySQL cannot be used in the WMF cluster (because it depends on the problematic MyISAM storage engine)”. Yay. Back to the drawing board.
Since everything at Wikimedia is using a heavily modified Apache Lucene for full text search, the same was obviously suggested as a solution. So started the development of phase3; if the past predicts anything, this will have been the final rewrite.

I decided not to touch Wikimedia’s version of Lucene, as I already had lots of experience on it due to playing with it for my Master’s thesis (English summary on my blog), and decided to use standard Lucene with a Solr frontend. Solr simplified many things and the development was swift using the PHP Solarium library.

In fact, the most difficult “feature” to develop was the Puppet configuration for Jetty and Solr, and testing it on WMF Labs. So I learned to write Puppet configuration files from scratch and did it mostly myself. Oren Bochman helped a lot with the Labs testing phase. The last hurdle was backporting recent packages of Solr and its dependency Jetty for the Ubuntu that Wikimedia was using on Labs and in production. Luckily I was fortunate enough to get quick help from ops, so I didn’t have to also learn how to make Ubuntu packages.

So somewhat ironically, we went from separate services to standalone and again to a separate service. The first phase is long forgotten, but the standalone and Solr versions complement each other. The former is enabled by default for anyone using the Translate extension, the latter provides superior scalability and hopefully in the future even better suggestions.

Fact is that the Levenshtein based ranking is not the state of the art for translation memories[1] and does not compare to the state of art i18n we are doing with MediaWiki and translatewiki.net.

On to the next adventure!

[1] Paper abstract (full text behind paywall; DOI:10.1007/3-540-39965-8_14).

Language validation in MediaWiki

Validating language codes like en or fi or chr might seem to be an easy task at first. You would expect this problem is already solved in MediaWiki, but that is far from the truth.

In fact, we are not even handling language codes, but language tags as defined by IETF. The linked standard brings together many standards, like the two and three letter language codes from ISO 639 standards, script names and region names, and more. This means that we have to handle language tags like pt-BR, sr-Latn, be-x-old and of course in the mix are invalid tags like de-formal and tokipona, and deprecated language codes like bat-smg (better: sgs).

The language tags are case insensitive, but there is preferred casing for different parts. MediaWiki has wfBCP47() which handles the “pretty-formatting”.

Let me list the language tag validation functions that already exists…

  • Language::isValidCode() – Contrary to its name, this function only checks that the language tag doesn’t contain certain characters which are not valid in page names or unsafe in html. Recently we had some issues with XSS exploits when code expected language codes to be html safe.
  • Language::isValidBuiltinCode() – This is slightly more strict, it only accepts language tags which consist of letters a-z, numbers 0-9 and hyphens.
…and what I think should exist – these will be probably implemented very soon:
  • Language::isKnownLanguageTag() – Checks that the language tag is known to MediaWiki. This basically means that we know the name of the language in English or in another language. Sources of known language codes are the built-in Names.php, the codes optionally added through the CLDR extension and the list of language names in English (pending merge).
  • Language::isSupportedLanguageTag() – Checks whether any localisation is available for that language tag in MediaWiki (MessagesXx.php exists).
I can also imagine a use case for:
  • Language::isWellFormedLanguageTag() – Checks whether the language tag is well formed. Like isKnownLanguageTag but less tight and more flexible. Would accept non-sense stuff like fi-Cyrl-JA-x-foo that semantically makes no sense but is valid according to the rules.