Category Archives: MediaWiki

Muir Woods has one tree – plural issues in MediaWiki

While I was having fun with the rest of the Wikimedia I18n team in San Francisco, a stream of plural related bug reports started coming in. The cause is that we have recently scrapped the custom plural rules in MediaWiki in favor of using plural rules from the CLDR database. A temporary fix has been applied to mitigate the reported issues.

The problem manifestation is pretty simple; in some languages in some contexts the message was always one something. For example the category page would say This category has one page regardless of how many pages there were in it. At first I was baffled. After all we had written unit tests for all languages in MediaWiki and they reported no regressions. Turns out we had ignored one particular set of languages: those which don’t always use plurals and had no plural rules defined in MediaWiki. The problems started when those language used plural even though they weren’t supposed to. When plural rules are not defined for a language, those languages use the plural rules as defined for the English language: 1 book, 2 books. In CLDR, however, some languages have been defined to not use any plural rules at all.

We could blame the translators for using plural syntax when they are not supported, or we could blame the CLDR for having no plurals rules for languages which do use plurals in some cases. It is not that simple, however. The typical example is a language which doesn’t have distinct plural forms (like some words in English: 1 fish, 2 fish; but for all nouns), but do use plural quantifiers if the number is not present: one fish, many fish.

As a compromise I have proposed an extension to the plural syntax to allow specifying the output when the number is 0 or 1 regardless of the usual plural rules for that language. Let’s take a real example:

Accepted by {{PLURAL:$1|you|$1 users including you}}.

This works fine in English, because the first form is always for number 1. In Belarusian it doesn’t work, because the first form is used for number 1, but also for numbers 21, 31, 41 etc. It could be solved by the following syntax:

{{PLURAL:$1|1=you|$1 users including you}}.

The slightly confusing part here is that now the second form is actually the singular form. This is more evident in the imaginary Belarus translation:

{{PLURAL:$1|1=you|one|few|many|other}}

"you" is used for number 1, “one" for 21, 31, 41 but not 1, and the remaining forms as they usually are.

The explicit zero form (0=something) can also be useful for English and many other languages to have a different wording – something which is now usually done with separate messages.

The message used above is from the Translate extension. Unfortunately we cannot start using this syntax until we have dropped backwards compatibility with the last MediaWiki version not supporting  this syntax i.e. 1.20, which would be around when MediaWiki 1.22 is released. We are seriously considering to backport this functionality, but we also need to add support for the same syntax in JavaScript first.

During further testing we also found issues in Hebrew plural rules. The position of dual was changed and we didn’t notice it because the unit tests were wrong. This resulted in problems like the login page saying Remember my login for two days. It just helps reminding how bugs in i18n can cause potentially severe issues.

Niklas in Muir Woods.

Niklas in Muir Woods. Testing new counting methods? (Photo by Pau Giner.)

Finnish translation sprint 2012-06 KDE – results

English summary: Using the MediaWiki Translate extension, the KDE SC 4.9 release was collaboratively translated into Finnish during the midsummer. The goals of the translation sprint were to produce 10,000 new proofread translations and to unify the translations of many common terms. The goals were mostly met. Due to problems of counting the number of new translations, we had to change the measure to include also updated translations. We made 10,643 new or updated translations and over 8,250 translations were proofread. I believe the combined effort makes a visible difference, though in absolute terms there is more work to do for tens of more sprints.


Kesä-heinäkuussa järjestettiin KDE:n käännöstapahtuma, jossa käännetiin kaikkea KDE:sta, mutta erityisesti KDE SC 4.9 -julkaisua, josta saimme kaikki tärkeimmät osat tehtyä. Muitakin KDE-ohjelmia suomennettiin ja suomennoksia parannettiin. Erityisesti KDE:n verkkosivujen suomentamisessa päästiin eteenpäin. Aiemmin oli suomennettu vain osa KDE:n sivustojen yhteisistä viesteistä, mutta nyt koko Join the Game on suomeksi sekä muidenkin sivujen suomentamista on aloitettu.

Toinen tärkeä saavutus oli käännösten laadun parantaminen oikolukemalla tärkeimpien ja näkyvimpien KDE-ohjelmien viestejä sekä yhtenäistämällä käytettyjä termejä ja käymällä läpi yleisiä kehnouksia. Jotkut myös innostuivat korjailemaan heitä jo pidempään vaivanneita asioita, joita ei vain ollut aiemmin tullut tehtyä. IRC-kanavakin aktivoitui ja keskustelua käännöksistä syntyi.

Jäimme kuitenkin alkuperäisestä 10 000 uuden käännöksen tavoitteestamme: täysin uusia oli 6 315. Tämä luku ei pidä täysin paikkaansa, sillä siinä ei ole mukana sumeita viestejä. Sumeat viestit sisältävät esitäytetyn käännöksen, joka saattaa joko vaatia pientä viilausta tai on ihan väärin. Parempi luku saadaan, jos mukaan lasketaan parannetut suomennokset, jolloin tulokseksi saadaan 10 643 uutta tai parannettua käännöstä. Oikolukeminen ei ollut aivan yhtä suosittua, mutta silti yli 8 250 käännöstä oikoluettiin.

Monet kokonaisuudet jäivät myös kesken, joskus pientä vaille. Termien yhtenäistäminenkin jäi kesken, vaikka se olikin vain toissijainen tavoite ja tiedetysti turhan iso pala. Siitä huolimatta useita termejä saatiin yhtenäistettyä – työurakka on vain erittäin suuri ja liian suuri tehtäväksi kerralla.

Käännösten oikolukeminen Translatella todella toimi. Varsinkin verrattuna nykyiseen tapaan, jossa kääntäjät lähettelevät sähköpostilla kokonaisia Gettext po-tiedostoja eikä edes laadunvalvontaan tarkoitettu postituslista ole käytössä. Työkalun perustoiminnallisuus oli kunnossa: mikään ei juuri häirinnyt kääntämistä. Pientä hitautta tosin ole havaittavissa; syynä siihen lähinnä käännösmuistiominaisuus.

7–8 uutta ihmistä saatiin mukaan, joista muutama teki kymmeniä käännöksiä ja loput selvästi enemmän. Kaikki vastaan tulleet tekniset ongelmat tilastosivun skaalautumisongelmista käännösten po-tiedostoihin viennin pieniin korjauksiin saatiin ratkottua hyvin.

Edelleen pohdituttaa, miten saisi enemmän ihmisiä mukaan. Myös työkalun tuonnin ja viennin suhdetta SVN:ssä tehtyihin muutoksiin täytynee miettiä ennen kuin työkalua voisi ottaa käyttöön nykyisen po-tiedostojen sähköpostilla lähettelyn rinnalle. Automaattisesti vientiä ei kuitenkaan voi tehdä, koska KDE:n SVN-tilin käyttösäännöt estävät sen.

Tulevaisuudessa mietimme työkalun vakituisempaakin käyttöä. Suurin ongelma on herättää lokalisointi.org uudelleen henkiin ja yhdistää siellä olevat termit käännösalustaan.

 

Haluan kiittää kaikille osallistujia käännöstyöstä ja palautteesta alustan toimivuuden suhteen. Erikseen kiitokset Lasse Liehulle tämän yhteenvedon raakaversion koostamisesta.

Efficient translation: Translation memory enabled on all Wikimedia wikis

I am pleased to announce that a long development project has been released and taken into production. We now have translation memory services enabled on Wikimedia projects (since August 28, in our last sprint).

The translation editor on Wikimania 2013 wiki shows a suggestion from Wikimania 2012 wiki

Users translating for Wikimania 2013 are provided with suggestions from 2012 (right arrow); a click is enough to copy it to the text area (down arrow). See also on Meta, in English interface.

Translation memory is a feature which provides likely translations for a text based on previous translations of similar texts: translators use them to speed up their work and to increase consistency (more in Wikipedia).

If you have translated at translatewiki.net or usebase.kde.org, you may have already noticed it. The translation memory on Wikimedia wikis has been filled with existing translations made with the Translate extension in WMF projects including Meta, mediawiki.org and Wikimania wikis.

Translators from all Wikimedia projects using the Translate extension can now work more efficiently, sharing their work and experience across the boundaries of wikis. Translators on Wikimania 2013 wiki can now find translations already provided for the previous year (see screenshot) and be quicker without sacrificing quality and consistency. Translators of technical documentation on mediawiki.org can benefit from the translation of Wikimedia terminology on Meta-Wiki and vice versa.

Technical challenges

A translation memory service has been in use at translatewiki.net for years, and the process of getting it enabled on Wikimedia was started about a year ago.

Naturally WMF operations is a very different thing from the small shared server translatewiki.net runs on. Yet, there were many unexpected turns that caused delay. The phases here are named retroactively.

Phase1

Originally we used the tmserver component from the translate toolkit. It had its own problems: it was hard to set up, it was an external dependency and the SQLite database engine it used was problematic for updates – it failed if there were multiple processes accessing at the same time. Sometimes the included standalone webserver got stuck and the other option, WSGI, didn’t play nicely with our lighttpd web server.

I did lots of research with Siebrand trying to find other open source translation memories, but failed to find anything that had any active or recent development.

Phase2

The next step was the standalone version. To avoid external dependencies, to make it usable in the WMF infrastructure, and not to require separate services, I started porting the tmserver algorithm from Python to PHP. At the same time I was able to take advantage of MediaWiki’s database abstraction code, which in theory should make it work on SQLite, MySQL and PostgreSQL. At the moment, however, only MySQL is tested and in use at translatewiki.net.

Performance of this new system was mostly the same, though it’s a constant fight for not letting the Levenshtein algorithm, used for ranking in the core, get exponentially slow. The major new feature was the support for shared databases, so that multiple wikis can use the translations made in other wikis for suggestions. A lot of time was spent on this, and also on making the initial bootstrap efficient with use of multiple threads.

Phase3

When we thought everything was ready for deployment on Wikimedia wikis, we waited for feedback from ops and finally we got a simple, yet unwanted reply: “Full-text search with MySQL cannot be used in the WMF cluster (because it depends on the problematic MyISAM storage engine)”. Yay. Back to the drawing board.
Since everything at Wikimedia is using a heavily modified Apache Lucene for full text search, the same was obviously suggested as a solution. So started the development of phase3; if the past predicts anything, this will have been the final rewrite.

I decided not to touch Wikimedia’s version of Lucene, as I already had lots of experience on it due to playing with it for my Master’s thesis (English summary on my blog), and decided to use standard Lucene with a Solr frontend. Solr simplified many things and the development was swift using the PHP Solarium library.

In fact, the most difficult “feature” to develop was the Puppet configuration for Jetty and Solr, and testing it on WMF Labs. So I learned to write Puppet configuration files from scratch and did it mostly myself. Oren Bochman helped a lot with the Labs testing phase. The last hurdle was backporting recent packages of Solr and its dependency Jetty for the Ubuntu that Wikimedia was using on Labs and in production. Luckily I was fortunate enough to get quick help from ops, so I didn’t have to also learn how to make Ubuntu packages.

So somewhat ironically, we went from separate services to standalone and again to a separate service. The first phase is long forgotten, but the standalone and Solr versions complement each other. The former is enabled by default for anyone using the Translate extension, the latter provides superior scalability and hopefully in the future even better suggestions.

Fact is that the Levenshtein based ranking is not the state of the art for translation memories[1] and does not compare to the state of art i18n we are doing with MediaWiki and translatewiki.net.

On to the next adventure!

[1] Paper abstract (full text behind paywall; DOI:10.1007/3-540-39965-8_14).

Language validation in MediaWiki

Validating language codes like en or fi or chr might seem to be an easy task at first. You would expect this problem is already solved in MediaWiki, but that is far from the truth.

In fact, we are not even handling language codes, but language tags as defined by IETF. The linked standard brings together many standards, like the two and three letter language codes from ISO 639 standards, script names and region names, and more. This means that we have to handle language tags like pt-BR, sr-Latn, be-x-old and of course in the mix are invalid tags like de-formal and tokipona, and deprecated language codes like bat-smg (better: sgs).

The language tags are case insensitive, but there is preferred casing for different parts. MediaWiki has wfBCP47() which handles the “pretty-formatting”.

Let me list the language tag validation functions that already exists…

  • Language::isValidCode() – Contrary to its name, this function only checks that the language tag doesn’t contain certain characters which are not valid in page names or unsafe in html. Recently we had some issues with XSS exploits when code expected language codes to be html safe.
  • Language::isValidBuiltinCode() – This is slightly more strict, it only accepts language tags which consist of letters a-z, numbers 0-9 and hyphens.
…and what I think should exist – these will be probably implemented very soon:
  • Language::isKnownLanguageTag() – Checks that the language tag is known to MediaWiki. This basically means that we know the name of the language in English or in another language. Sources of known language codes are the built-in Names.php, the codes optionally added through the CLDR extension and the list of language names in English (pending merge).
  • Language::isSupportedLanguageTag() – Checks whether any localisation is available for that language tag in MediaWiki (MessagesXx.php exists).
I can also imagine a use case for:
  • Language::isWellFormedLanguageTag() – Checks whether the language tag is well formed. Like isKnownLanguageTag but less tight and more flexible. Would accept non-sense stuff like fi-Cyrl-JA-x-foo that semantically makes no sense but is valid according to the rules.

Wikimania videos: the next billion users on Wikipedia and beyond

Wikimedia DC has started publishing the Wikimania videos on YouTube. They are not split by presentation, only by track, but here are some about localisation and internationalisation.

My Wikimania presentation (see my previous post), Translating the wiki way (starts at 28:05; watch on YouTube):

Amir’s Supporting languages, all of them and Siebrand’s A Tale of Language Support and Ask the Language Support People (watch on YouTube):

Santhosh’s Read and Write in your language has not been published yet and nobody seems to know if it will, or if it has been recorded at all.

Alolita’s The next billion users on Wikipedia with Open Source Webfonts and Amir’s The software localization paradox (watch on YouTube):

See also the category on Wikimania wiki for abstracts and slides for these presentations.

My presentations at Akademy and Wikimania

In July I gave two presentations: one at Akademy 2012 in Tallinn, and one at Wikimania 2012.

Short summary of my Akademy presentation (slides): If you are translating content in MediaWiki and you are not using Translate extension, you are doing it wrong. Statistics, translation and proofreading interface – you get them all with Translate. Because Translate keeps track of changes to pages, you can spend your time translating instead of trying to figure what needs translating or updating.

Also, have a look at UserBase, it has now been updated to include the latest features and fixes of Translate extension, like the ability to group translatable pages into larger groups.

Akademy presenation by Niklas and Claus: click for video. Yes, there’s a a typo.

Short summary of my Wikimania presentation (slides; video not yet available): Stop wasting translators’ time.
Forget signing up to e-mail lists, forget sending files back and forth. Use translation platforms that move files from and to the version control system transparently to the translator.
If you have sentences split into multiple messages, you are doing it wrong. If your i18n framework doesn’t have support for plural, gender and grammar dependent translations, you are doing it wrong. If you are not documenting your interface messages for translators, you are doing it wrong.

Niklas maybe having fun at Library of Congress. Photo tychay, CC-BY-NC-ND

Translation sprint for KDE in Finnish

In our sprint website we’re translating the upcoming KDE SC 4.9 release into Finnish. If you know Finnish, you only have to register to start translating: please join us!
We have a simple goal: translate 10,000 new messages and have all the changes proofread and accepted. In two weeks we have translated more than 3,000 messages and the majority of them have been proofread and accepted. We still have about three weeks to go, so your help is needed to increase the output to reach the goal of 10,000 new translations. As a secondary activity we are also proofreading the existing translations and discussing and harmonizing the terminology. For example should filter be suodin or suodatin.

Keep reading if you are interested in how we organized the sprint from a technical perspective.

This is the second translation sprint I’m organizing with the Translate extension. The first one was in March, when we translated Gnome 3.4 into Finnish and this time we are translating KDE 4.9 into Finnish. I can say that the Translate extension fits for this purpose pretty well:

  • You can set up everything in few hours.
  • There are minimal barriers to start using it (we do require registration).
  • It is suitable for novice translators, because they get feedback when other people proofread and correct their translations.

It is not without its issues either, but I see this as a great opportunity to make the MediaWiki Translate extension even better and have it support a variety of use cases. Let me describe some.

Bugs. There are always some bugs. This time I found a regression in the workflow states feature where the recent changes weren’t backwards compatible with the old configuration format. That was quickly fixed and I also submitted fixes for a few minor issues, which were not encountered before. All in all I have 7 local patches, mostly small behaviour changes like the formatting of message keys or showing the message context field to translators. Most of those can be cleaned up and submitted for merging.

Scalability. I had an impression for a long time that the Translate extension scales up pretty well. After all we have thousands of message groups and 50k messages translated into hundreds of languages at translatewiki.net. How naive I was. All of KDE as we use it (stable and trunk branches merged; including playground and extragear, calligra and other related stuff) contains 200k messages. Turns out that our import tools choke when you try to feed them 350k new messages at once (this includes Finnish translations). As a workaround I had to limit the amount of messages that are processed at once and iterate over the whole process multiple times. This is where the bulk of my time was spent. Of course I also ran out of disk space in the middle of the import. It takes about 1G of space, but currently I have only a tiny 10G disk on the server.

Search. The most requested feature is better search. Currently it is not possible to limit the search to a message group nor to see the translation when searching source texts, or the source text when searching for translations. Also it takes a few clicks before you can edit the message from the search results. Building a good search backend is currently on the backlog of the Wikimedia Localisation team, but it is not yet scheduled for any sprint.

Stay tuned for the results of the KDE Finnish translation sprint.

Report from the Multilingual Web Workshop

I attended the W3C Workshop about multilingual web with Gerard Meijssen for the Wikimedia Localisation team. Aside from the long list of new things you will learn in every conference, this time I was surprised by the number of links that appeared between things I already knew. For example META-SHARE was mentioned multiple times in different contexts.

Presentations. The workshop was split into two days. The first day was packed with short presentations from participants. Some observations:

  • Keynote about semantic web and how it can help us to reach multilingual web.
  • Microsoft presented their translation toolkit. It didn’t seem to include translation management at all: “You can then send the empty translation file by email”. Also in the example application mph was not localised to km/h.
  • There was a poster presentation about open source language guessers. We do tag the language used in Wikipedia pages, but still most of the guessers didn’t get it right. To me this says that there is training data out there, but nobody bothers to use it.
  • New language related features (bidirectional text, ruby, translate-flag) in HTML5 ignited lots of discussion: they were welcomed but people wanted to do more.
  • XSL-FO is still years ahead of CSS by having direction neutral start, end, before and after keywords. That is the one of the few features I like in that language.
  • Some WTF-moments: “unicode languages”, using flags for languages and locales and one of the best practices for bidirectional text was to “avoid using it”.

Open linked data. There is a big demand for all kinds of linguistic data. One of the discussion groups in the second day was about open linked data. It was emphasized that open data means that the data is in a standard format, not tied into one application. But for me an explicit open license is more important, since it allows converting the proprietary format into other formats and *distribute* them.

Open linked data. Links are another side of open linked data. Links were said to be as important as the data itself, something easy to agree with. What would for example Wikipedia be without links? The number of links is increasing, but currently the links are clustered into centers. Links are crucial to discover what data is actually available, but projects like META-SHARE do their part too. For me this compares closely to the UNIX philosophy of having each tool do one thing and do it well.
An example of this idea is in the Bank of Finnish Terminology in Arts and Sciences. Contributors are encouraged to write short definitions for terms, while long explanations are better suited to be included in Wikipedia. We are also using Semantic MediaWiki to increase the links inside the data itself.

Open linked data. A type of linked open data I would like to see is translation memory data. This is also something the open source and open content projects including Wikimedia and translatewiki.net can contribute, since we have lots of translations that can be used to build translation memories and parallel corpuses. Have you ever wanted to compare the same text in 50+ languages? We have it. I also see nice post-processing possibilities to increase the usefulness of the data by doing sentence or even word level alignment; we only have paragraph alignment for now.

Updates on translation review feature of Translate extension

About three months ago I blogged about the translation review feature that we developed for the Translate extension. It is time to have a look at how it has been received. Thanks to Siebrand Mazeland we can now draw a graphs for review and reviewer activity. This feature came just in time for the Gnome 3.4 Finnish Translation Sprint that I’m organizing. If you look at its main page, you can see graphs for translation and review activity. The activity isn’t exactly over the top, so if you speak or can translate into Finnish, please join and help us.

I’m aware of three places using this feature: translatewiki.net, Wikimedia Foundation and the translation sprint mentioned above. In translatewiki.net the review ability is not as open as I originally envisioned it to be: only experienced translators can get it by request.  Only about 2% of over 3500 registered translators currently have the review right in translatewiki.net. For the other two places, everyone who can translate can also review.

When looking at the graphs for translatewiki.net we can without doubt see that translation reviewing activity is not yet anywhere near close to the translation activity, and we should consider that there is a huge backlog or previous translations that should also be reviewed. We don’t even see a steady growth in the review activity (around the change of the year we had a translation sprint which temporarily increased translation and review activity to higher than normal levels). We don’t have graphs for Wikimedia projects yet, but looking at the logs the review features seems to be relatively in more active use there. I would personally like to see all new translations from now on to be reviewed at least by one other user.

The next step would be to add a review level column to Special:LanguageStats and Special:MessageGroupStats pages. That would need some idea on how to convey both quantity and coverage. For example, a hundred translators reviewing the same message doesn’t mean that the review coverage is good. Perhaps we should just start with coverage and bring quantity later. This could be a nice small project for someone who wants to help to develop the Translate extension with help from us.

My take-away from Open Advice

I told my friend Nemo that I have been reading the recently published Open Advice book and he basically forced me to write a review about it. This isn’t really a review, but instead something the book made me think. When I started reading the book I expected to get some simple tips on how I could do things better or on new things I could do. Well, I didn’t get those, but I got something else.

The book consists of many short stories of open source from different starting points – each story is written by a different author. It was nice to notice that among the writers there were many who I’ve met or at least whose name and work I knew. Most of the stories didn’t tell anything new to me, and the section about translation was annoyingly short of content. The book is worth reading, especially since each story is short, which makes it easy to read.

When I read what follows in Markus Krötzsch’s Out of the lab, into the Wild, I started thinking.

When a certain density of users is reached, support starts to happen from user to user. This is always a magical moment for a project, and a sure sign that it is on a good path.

I have been developing the Translate extension (and by extension translatewiki.net too) for many years now, but apart from seeing it being used more and more, I haven’t really stopped to think what it means for a software project to grow up and be successful. So I made up some milestones:

  1. You write something for yourself
  2. Other people find it useful and start using it
  3. The users of your software are providing peer to peer help
  4. Other developers are able take over maintenance and development of the software

Now we have something we can measure. I started writing Translate over five years ago. Some years later there were already tens of translators using it. This year the Translate extension is used in many Wikimedia projects as well as in KDE UserBase in addition to translatewiki.net. Lots of new people need to learn how to use the Translate extension from a management point of view, and more and more often they get an answer not from me but from someone else or by reading the documentation.

So what about step 4? Until very recently Translate has been my world and my world only apart from some patch contributions. But I have now taken it as my personal goal to change this. And what a lucky person I am! The Wikimedia Localisation Team – which I am member of – has the development of Translate extension as one of their major goals. Even better, we are an agile team, which means that each and every developer of the team should be able to do any development task in the team. To achieve this we divide tasks among team members so that nobody works only on their own favourite project. In addition we are explicitly reserving time for knowledge transfer, which happens through code review, proofreading the documentation one of us has written, explicit sessions where a team member covers a topic they know well and pair programming. This has already been going on for some months and it is not going to stop.

In addition to schooling the other developers in our team, I also plan to keep expanding the documentation, adding more tutorials and organizing tasks suitable for new developers, so that it is easy for interested volunteer developers to start contributing to Translate. Because in the end knowledge is useless if the developer has no reason to develop, and the best reason to develop is to scratch your own itch. I believe those developers are to be found among the users of the Translate extension who have a slightly different and new use case which needs development work.

I haven’t yet finished my plans on the fifth step (world domination), so stay tuned for coming blog posts.