Tag Archives: translatewiki.net

Insertables in Translate make translating easier

Insertables are a new tool to easily copy some text from the source language to your translation with one click.

Have you ever translated anything with the Translate extension? Did it contain markup like this?

[http://very.long.url/here link description]
{{GENDER:$1|he|she}} posted $2 on $3

If so, then you know what this is about. Have you ever translated anything with the Translate extension while using a tablet or another device without a physical keyboard? If so, then you likely know why this interesting.

When you translate text written in wiki markup, or software interface strings, you will encounter the examples above, and many more parts which you need to copy verbatim while translating. These parts contain special characters like braces, dollar signs, brackets, pipes and so on. These characters are cumbersome to type on non-English keyboards, where they have been moved to more difficult to reach key combinations in favour of local characters – if they exist in the layout at all. If they don’t exist in the keyboard layout, you need to switch keyboard layouts just to type few characters and then switch it back.

Does this sound cumbersome? Many translators in fact do not do that, but instead they copy and paste the text from the source text. On tablets however, copy and paste itself is a cumbersome thing. Insertables are a solution to this usability issue.

We can automatically identify a part of the translatable text which has the following properties: it should not be changed and it is difficult to type. We can then present these parts of strings as buttons near the translation. Clicking or pressing that button inserts the text into the translation. These buttons complement the insert source text button and are optional to use, like all translation helpers we provide.

Happy translator using the new feature

Happy translator using the new feature

As of now, we only detect a few types of these insertables: plural, grammar magic words, and variables in MediaWiki style ($1). Read more on Translate documentation for how to contribute more insertables.

FOSDEM talk reflections 1/3: I18n in the WEB, Mozilla i18n and L20n

FOSDEM 2013 t-shirt

FOSDEM 2013 was attended by several Wikimedians.

Now that I’ve slept over the presentations I attended at FOSDEM, it’s a good time to think about what I heard and how it related to what I am doing. It is also a good time before I forget what I heard. I didn’t get to talk to that many people this year, mostly running from one talk to another.

There will be three parts to the series of these blog posts. I will start with i18n related topics and then other presentations roughly in the order I saw them (headers link to abstracts).  There will also be a follow-up post on the gettext format detailing the good and bad sides from today’s point of view. Stay tuned!

An Integrated Localization Environment

Mozilla keeps pushing new i18n stuff, though the general feeling of this and other related talks is that they either have not defined what is the issue they are fixing, or they have defined in a way that is completely different from what we are working on.

While we are trying to make it as easy as possible for translators to translate (in a technical sense, they already have enough of complexity due to language itself), the ILE proposed in this talk is essentially a IDE (integrated development environment) – a glorified text editor that programmers often use for programming. It has features like highlighting syntax via colors and automatic completion for translation file syntax.

But do translators really care about particular syntax of translation in a file, or are they in fact more happy if they do not need to care about files and version control systems at all, while at the same time having access to aids like translation memories and change tracking in an interface created by UX designers, as we have in translatewiki.net?

“It helps to see the messages above and below to understand the context”
You can see the related messages close to each other in almost any translation tool, even though showing related messages next to each other is not a replacement for proper documentation of context for each message.

“I don’t see how form based translation tools would cope with more complex localisation file formats like L20n”
I don’t think the solution to facilitating proper localisation is to turn the localisation itself into programming. The cases where more complex logic is needed are actually relatively few and I think it is worthwhile to keep the common case as simple as possible while supporting also the more complex cases in a standardized, data driven way, like using the CLDR.


Mozilla presenters at FOSDEM

Mozilla keeps pushing new i18n stuff: who is the user they are designing new tools for?

This talk was an update to the similar presentation on L20n last year. What I said on the previous post about turning localisation into programming applies here too.

It is nice that you specify grammatical gender for things, but this format does not really solve the problem that many variables actually come from user input, for which we cannot specify this information.

It is nice that you can make custom plural rules, but in almost all cases the standard set of plural rules that comes from standards like CLDR is enough.

It is nice that you can mix gender and plural and even many plural in one message using nested hashes (arrays in PHP), but it is not nice at all that you have to translate the message N*M*O times as the number of variables increase. I firmly believe that inline syntax like {{GENDER:$1|he|she}} eats {{PLURAL:$2|apple|apples}} is superior in this regard.

If we strip the plural, gender, time formatting etc. support from L20n, we actually just get a complex file format for storing things, something which we already have many variants of. The aforementioned features are usually provided by the i18n library (or definitely should be; unfortunately this is not always the case) so what they have done is actually moving the complexity of language from i18n libraries and software developers to translators. Aiming at “keep common case simple, but support complex cases where needed”, I don’t think this is as presented a good trade-off between simplicity and flexibility.

webL10n: client-side i18n / l10n library

This talk was about adapting some nice parts of L20n to .properties format. The result is somewhat more complex than plain .properties and not as flexible as L20n. Even having gender and plural in the same message is problematic in this format.

I’d like to highlight two ideas in webl10n. Sidenote: Why call it l10n when it is actually an i18n library for developers, similar to jquery.i18n.

The first idea is that you can have html like this:

<div l10n-data-id=retro>
<div>Please <a href="login/">log in</a></div>

And the translators see this:

retro = <div>Please <a>log in</a></div>

The translation, when displayed, is properly merged to the original html so that the classes and link targets are preserved. I don’t know what happens if the translation is outdated and the structure is changed, but I guess we just should not use outdated translations with this system. When escaping is handled properly, this is a very nice way to handle what we call lego messages, where the text of the link is in a separate message, because due to escaping we can’t have link and link text in the same message.

Another idea is that if you have HTML like this:

<input type="search" placeholder="Search messages" title="Message search box">

You can turn it to this.

<input type="search" data-l10n-id="searchbox">

And translators will see this (using .properties format here)

searchbox.placeholder=Search messages
searchbox.title=Message search box

This simplifies the html the developers need to write.

Finally, take a look also at Pau’s Design talks at FOSDEM 2013.

Performance tuning translatewiki.net

One of the biggest advantage of desktop translation tools is that they don’t have delays rendering the interface – at least not in such a scale as websites have. In translatewiki.net it is crucial that our pages load very fast. In certain places we can and do use intelligent preloading to remove the delays, in other places we have to employ complex caching algorithms to reach that target. I am regularly monitoring the automatically collected profiling information to avoid regressions and to pick low-hanging fruit from time to time.

In the last sprint my main task was to convert the way we handle the translation of MediaWiki extensions in translatewiki.net to use the same processes and interfaces as pretty much everything else. MediaWiki and MediaWiki extensions were the first things supported in translatewiki.net and now they are among the last things to get modernized to take advantage of better interfaces built on the years of experience supporting various kinds of products.

The only user visible change is improved performance. The new interfaces are more efficient and enable more optimizations, which allows us to deliver faster page views and scale to more messages. It will also simplify the work of translatewiki.net staff, as they don’t need to follow two different processes, especially after we update also MediaWiki translation code.

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

As a developer I’m proud that the new code is unit tested. The culmination, however, was a change which removed hundreds of lines of old code: in fact, the above quote applies to software development too.

For those interested in details, the biggest performance boosts were achieved by avoiding the need to parse the translation files in many places – the list of message keys and their values are stored in intermediate cache files in CDB format. In addition there were many smaller performance optimizations, like not using some MediaWiki method to construct a link element, which consumed 20 kilobytes of memory for each link. When there are thousands of links, it adds up and is excessive for just making some hundred bytes of output. I switched it to a more low level method (memory usage: from 175 to 12 MB).

Some low-hanging fruit might not be as easy to pick as it seems at first. (Photo CC-BY-SA by Asit K. Ghosh.)

At the time of writing I still have some more fixes pending further testing and cleanup. For example, to access any message group, those all have to be loaded. They are cached as serialized PHP objects, but loading them takes 20 milliseconds and 10 megabytes of memory. I’m working on making it possible to load cached message groups individually.

The website anyone can translate

Translatewiki.net has started using Puppet. Puppet is a tool designed to manage the configuration of servers. Like Wikimedia’s, our configuration is public and stored in the translatewiki.net git repository, where anyone can submit patches. I don’t expect a flood of them coming in anytime soon, my motivations for this were different. If you remember, some months back I had to learn some Puppet to write the Solr configuration for Wikimedia deployment. Now I wanted to learn more and gather more experience on using Puppet. It will also greatly help if we ever need to reinstall the translatewiki.net server from scratch (which is quite likely to happen soon). As a bonus it gives transparency and something I can refer people to when they ask how some particular thing is done in translatewiki.net. As time permits, I will be moving more configuration to Puppet.

Mitä isot edellä, sitä pienet perässä. (Internet suggest the closest translation is Monkey see, monkey do.)

I also added the translatewiki.net repository to Ohloh. If you use translatewiki.net as localisation platform, feel free to add it to your stacks by clicking “I use this”, or to embed its widgets in your website. Ohloh also gives some cool stats:

In a Nutshell, translatewiki.net…

Together with the introduction of Puppet, I also switched the webserver of translatewiki.net from lighttpd to nginx. The biggest reason for this is that https was broken for Google Chrome users, but in general nginx feels faster and more robust and the way PHP is used with it is much simpler (php-fpm instead of spawn-cgi). The Wikimedia operations team is supposedly going to test nginx soon, so we will see whether the tide also goes that way.

Muir Woods has one tree – plural issues in MediaWiki

While I was having fun with the rest of the Wikimedia I18n team in San Francisco, a stream of plural related bug reports started coming in. The cause is that we have recently scrapped the custom plural rules in MediaWiki in favor of using plural rules from the CLDR database. A temporary fix has been applied to mitigate the reported issues.

The problem manifestation is pretty simple; in some languages in some contexts the message was always one something. For example the category page would say This category has one page regardless of how many pages there were in it. At first I was baffled. After all we had written unit tests for all languages in MediaWiki and they reported no regressions. Turns out we had ignored one particular set of languages: those which don’t always use plurals and had no plural rules defined in MediaWiki. The problems started when those language used plural even though they weren’t supposed to. When plural rules are not defined for a language, those languages use the plural rules as defined for the English language: 1 book, 2 books. In CLDR, however, some languages have been defined to not use any plural rules at all.

We could blame the translators for using plural syntax when they are not supported, or we could blame the CLDR for having no plurals rules for languages which do use plurals in some cases. It is not that simple, however. The typical example is a language which doesn’t have distinct plural forms (like some words in English: 1 fish, 2 fish; but for all nouns), but do use plural quantifiers if the number is not present: one fish, many fish.

As a compromise I have proposed an extension to the plural syntax to allow specifying the output when the number is 0 or 1 regardless of the usual plural rules for that language. Let’s take a real example:

Accepted by {{PLURAL:$1|you|$1 users including you}}.

This works fine in English, because the first form is always for number 1. In Belarusian it doesn’t work, because the first form is used for number 1, but also for numbers 21, 31, 41 etc. It could be solved by the following syntax:

{{PLURAL:$1|1=you|$1 users including you}}.

The slightly confusing part here is that now the second form is actually the singular form. This is more evident in the imaginary Belarus translation:


"you" is used for number 1, “one" for 21, 31, 41 but not 1, and the remaining forms as they usually are.

The explicit zero form (0=something) can also be useful for English and many other languages to have a different wording – something which is now usually done with separate messages.

The message used above is from the Translate extension. Unfortunately we cannot start using this syntax until we have dropped backwards compatibility with the last MediaWiki version not supporting  this syntax i.e. 1.20, which would be around when MediaWiki 1.22 is released. We are seriously considering to backport this functionality, but we also need to add support for the same syntax in JavaScript first.

During further testing we also found issues in Hebrew plural rules. The position of dual was changed and we didn’t notice it because the unit tests were wrong. This resulted in problems like the login page saying Remember my login for two days. It just helps reminding how bugs in i18n can cause potentially severe issues.

Niklas in Muir Woods.

Niklas in Muir Woods. Testing new counting methods? (Photo by Pau Giner.)

Wikimania videos: the next billion users on Wikipedia and beyond

Wikimedia DC has started publishing the Wikimania videos on YouTube. They are not split by presentation, only by track, but here are some about localisation and internationalisation.

My Wikimania presentation (see my previous post), Translating the wiki way (starts at 28:05; watch on YouTube):

Amir’s Supporting languages, all of them and Siebrand’s A Tale of Language Support and Ask the Language Support People (watch on YouTube):

Santhosh’s Read and Write in your language has not been published yet and nobody seems to know if it will, or if it has been recorded at all.

Alolita’s The next billion users on Wikipedia with Open Source Webfonts and Amir’s The software localization paradox (watch on YouTube):

See also the category on Wikimania wiki for abstracts and slides for these presentations.

Putting that another pair of eyes into good use

This blog post is about the MediaWiki Translate extension and explains how we got to develop a new set of translation review tools.

One of the core principles at translatewiki.net is that the time of translators is a prestige resource. We show our appreciation to translators by providing tools that let them concentrate 100% on the task at hand and let the (volunteer) staff handle the boring tasks.

It is well known that good translators take pride of their and others work. This may result in a urge to review all translations made by other translators. I consider myself being that kind of translator. The good news is that in recent months the Translate extension has got massively better at supporting reviewing of translations. Some weeks ago we added a new listing where you can click a button to accept a translation. When the list is empty, you know that all translations have either been made or fixed by you, or you have accepted someone elses’ translations.

This is all nice and dandy, but if you want to review new translations as they come in it is not practical. You’d either have to watch the list of recent translations or subscribe to the feed of them. From here you can get to the individual messages, but it takes many clicks to get to the page where you see the button to accept the translation. And iterating over each of the hundreds of message groups to see if there is anything to accept is not practical either.

The solution: a special message group which lists the recent translations in a given language. Since only some of the translators are allowed to review, on the right you can see a screenshot of how it looks like – click to enlarge. One could bookmark this page and have a look at it a few times per week. For me this is a real time saver, and I’m sure others will find it useful too.

To get this implemented, I originally anticipated that some heavy refactoring was needed and I estimated about one and a half day for it. In the end it took only about half a day – I was positively surprised how painless the refactoring was. The problem was that the class which fetches all the messages from the database assumed they all belong in the same MediaWiki namespace. In translatewiki.net we have over ten namespaces for translations of different projects, so it had to be fixed. I’d say this is a prime example of the saying Premature optimization is the root of all evil by Donald Knuth.

In the future we need to link this page from suitable places to make this feature discoverable and also to make sure that more than the current 66 users out of 3000+ translators get the right to use this feature.

translatewiki.net celebrates – so do I

Oh boy time flies. Translatewiki.net turns six years next Saturday. This is the first time we celebrate its birthday. How did it happen?

It was 2005, my last year at upper secondary school when I set up a MediaWiki for myself to do some school work. I was 17, and in the fall of the same year I started studying at a university. Can you imagine how awkward it was to attend university under age of majority (18 years in Finland)? Anyway, I think the wiki was originally called Nukawiki, then Betawiki and finally translatewiki.net. The wiki has gone through many updates. It probably started with Mediawiki 1.4 which boasts in release notes that User interface language can be changed by the user. It’s also gone through many computers starting from my laptop and gradually to more powerful, more dedicated servers.

Already before the summer of 2006, when I started my obligatory military service which lasted six months, I was using the wiki to translate MediaWiki into Finnish and fix i18n problems. In 2006 we started inviting other translators to join. In February 2007 I started translating FreeCol into Finnish and soon they moved all translation related activities into our wiki. One of the initial translators was Siebrand, who has had enormous influence on the direction the project has taken since he joined.

In other words translatewiki.net was a small hobby project for an entirely different purpose, then I used it to scratch a personal itch, and nowadays it is a thriving community with thousands of members. We are already huge in many metrics, we are still growing and there doesn’t seem to be any boundaries for our size. I just cannot imagine how many people the work of translatewiki.net has impacted. For me this means an opportunity, but more importantly a challenge. How do we improve our service while scaling up? How can we provide better tools for translators, for ourselves and for projects that use us? We have been successful thus far, because we have been very efficient – it is almost scary how few people (albeit very dedicated) can keep everything running smoothly.

Translatewiki.net has had and still does have huge impact to my life. It is just not because it is a huge time sink for me. It is a manifestation of the many skills I’ve learned during my life. It feels wrong to say that it is my hobby, because sometimes it feels that studying is the hobby here. Nevertheless my master thesis is nearing completion. I already have a job in mind and I can’t say that translatewiki.net didn’t affect that.

I’m sincerely grateful to each and everyone who has helped translatewiki.net to become what it is today.

Translation engines: black boxes

One would hope that using machine translation system would be as easy as giving text and pair of languages in and getting something out. But at least here in translatewiki.net things are pretty complex under the hood.

First of all these translation engines are external systems which are based on huge corpora of translated texts and statistical methods. Translations are queried trough HTTP requests. The Translate extension implements an algorithm which keeps tracks of failures and disables the whole service for some period. Failures can be error messages, time outs or even failures to establish a connection. For example on translatewiki.net recently moved to a new server which has a bit unstable DNS resolution which needs to be fixed.

Disabling serves multiple purposes. First of all if the service is temporarily down, we don’t waste our nor their time trying. Secondly, if we hit some kind of rate limit (we shouldn’t) we can back off for a while.

Then there is a issue with the contents–the engines like to mungle mangle mingle all things they don’t understand. In interface translation with many special characters and expressions this is annoying. I just recently made some improvements here based on a suggestion from Jeroen De Dauw. The most common special syntaxes are now armored against changes. This includes variables like $1, %s or %foo% and some other things. Line breaks disappear too, but that was already worked around earlier.