Prioritizing MediaWiki’s translation strings

After a very long wait, MediaWiki’s top 500 most important messages are back at translatewiki.net with fresh data. This list helps translators prioritize their work to get most out of their effort.

What are the most important messages

In this blog post the term message means a translatable string in a software; technically, when a message is shown to users, they see different strings depending on the interface language.

MediaWiki software includes almost 5.000 messages (~40.000 words), or almost 24.000 messages (~177.000 words) if we include extensions. Since 2007, we make a list of about 500 messages which are used most frequently.

Why? If translators can translate few hundreds words per hour, and translating messages is probably slower than translating running text, it will take weeks to translate everything. Most of our volunteer translators do not have that much time.

Assuming that the messages follow a long tail pattern, a small number of messages are shown* to users very often, like the Edit button at the top of page in MediaWiki. On the other hand, most messages are only shown on rare error conditions or are part of disabled or restricted features. Thus it makes sense to translate the most visible messages first.

Concretely, translators and i18n fans can monitor the progress of MediaWiki localisation easily, by finding meaningful numbers in our statistics page; and we have an clear minimum service level for new locales added to MediaWiki. In particular, the Wikimedia Language committee requires that at very least all the most important messages are translated in a language before that language is given a Wikimedia project subdomain. This gives an incentive to kickstart the localisation in new languages, ensures that users see Wikimedia projects mostly in their own language and avoids linguistic colonialism.

Screenshot of fi.wiktionary.org

The screenshot shows an example page with messages replaced by their key instead of their string content. Click for full size.

Some history and statistics

The usage of the list for monitoring was fantastically impactful in 2007 and 2009 when translatewiki.net was still ramping up, because it gave translators concrete goals and it allowed to streamline the language proposal mechanism which had been trapped into a dilemma between a growing number of requests for language subdomains and a growing number of seemingly-dead open subdomains. There is some more background on translatewiki.net.

Languages with over 99 % most used messages translated were:

There is much more to do, but we now have a functional tool to motivate translators! To reach the peak of 2011, the least translated language among the first 181 will have to translate 233 messages, which is a feasible task. The 300th language is 30 % translated and needs 404 more translations. If we reached such a number, we could confidently say that we really have Wikimedia projects in 280+ languages, however small.

* Not necessarily seen: I’m sure you don’t read the whole sidebar and footer every time you load a page in Wikipedia.

Process

At Wikimedia, first, for about 30 minutes we logged all requests to fetch certain messages by their key. We used this as a proxy variable to measure how often a particular message is shown to the user, which again is a proxy of how often a particular message is seen by the user. This is in no way an exact measurement, but I believe it good enough for the purpose. After the 30 minutes, we counted how many times each key was requested and we sorted by frequency. The result was a list containing about 17.000 different keys observed in over 15 million calls. This concluded the first phase.

In the second phase, we applied a rigorous human cleanup to the list with the help of a script, as follows:

  1. We removed all keys not belonging to MediaWiki or any extension. There are lots of keys which can be customized locally, but which don’t correspond to messages to translate.
  2. We removed all messages which were tagged as “ignored” in our system. These messages are not available for translation, usually because they have no linguistic content or are used only for local site-specific customization.
  3. We removed messages called less than 100 times in the time span and other messages with no meaningful linguistic content, like messages where there are only dashes or other punctuation which usually don’t need any changes in translation.
  4. We removed any messages we judged to be technical or not shown often to humans, even though they appeared high in this list. This includes some messages which are only seen inside comments in the generated HTML and some messages related to APIs or EXIF features.

Finally, some coding work was needed by yours truly to let users select those messages for translation at translatewiki.net.

Discoveries

In this process some points emerged that are worth highlighting.

  • 310 messages (62 %) of the previous list (from 2011) are in the new list as well. Superseded user login messages have now been removed.
  • Unsurprisingly, there are new entries from new highly visible extensions like MobileFrontend, Translate, Collection and Echo. However, except a dozen languages, translators didn’t manage to keep up with such messages in absence of a list.
  • I just realized that we are probably missing some high visibility messages only used in the JavaScript side. That is something we should address in the future.
  • We slightly expanded the list from 500 to 600 messages, after noticing there were few or no “important” messages beyond this point. This will also allow some breathing space to remove messages which get removed.
  • We did not follow a manual passage as in the original list, which included «messages that are not that often used, but important for a proper look and feel for all users: create account, sign on, page history, delete page, move page, protect page, watchlist». A message like “watchlist” got removed, which may raise suspicions: but it’s “just” the HTML title of Special:Watchlist, more or less as important as the the name “Special:Watchlist” itself, which is not included in the list either (magic words, namespaces or special pages names are not included). All in all, the list seems plausible.

Conclusion

Finally, the aim was to make this process reproducible so that we could do it yearly, or even more often. I hope this blog post serves as a documentation to achieve that.

I want to thank Ori Livneh for getting the key counts and Nemo for curating the list.

-- .

8 thoughts on “Prioritizing MediaWiki’s translation strings

  1. Nemo

    A ballpark number might be that, by translating those 600 messages, about 90-95 % pages served have a fully translated interface.

  2. Jon Harald Søby

    I should probably comment this somewhere else, but… {{int:Metadata-fields}} probably doesn’t belong in the most used, right?

  3. Purodha Blissenbach

    Thank you for this overview! It made me ask myself, why I do not complete some translations for considerably long times.

    Here are my top 3 reasons.

    1. There is a word that I do not have a suitable translation for. Translating to a vernacular language lacking many technical terms for longer than more mainstream languages may take months or even years to discuss and develop terms. Sometimes it does not help. For terms very specific to a program or an application, things may be even worse.

    2. Specific terms or abbreviations may be unknown to me and not understood at once. Often, they require knowledge of the contexts in which they occur, which I do not have. Questions yielding helpful information leading an appropriate, understandable, common-sensual translation at about 50% only.

    3. Wording, semantics, contexts of messages may be unclear, ambiguous, complicated, or very specific (e.g. legalese), and cannot be determined well, even with high research effort. Thus valid unquestionable translations are simply impossible given the time I can spent.

    I usually leave messages, or words, untranslated until an undoubtedly valid translation has been found. I mark messages with untranslated words as !!FUZZY!! for later tries.

  4. Nemo

    Thanks Purodha for your experience.

    We’re over 40 languages 99 % translated now (including ksh!).

  5. Pingback: 14 more languages “fully” translated in MediaWiki this week « It rains like a saavi

  6. Pingback: Wikimedia:Woche 7/2015 – Wikimedia Deutschland Blog

Comments are closed.