What are the most important messages
In this blog post the term message means a translatable string in a software; technically, when a message is shown to users, they see different strings depending on the interface language.
MediaWiki software includes almost 5.000 messages (~40.000 words), or almost 24.000 messages (~177.000 words) if we include extensions. Since 2007, we make a list of about 500 messages which are used most frequently.
Why? If translators can translate few hundreds words per hour, and translating messages is probably slower than translating running text, it will take weeks to translate everything. Most of our volunteer translators do not have that much time.
Assuming that the messages follow a long tail pattern, a small number of messages are shown* to users very often, like the Edit button at the top of page in MediaWiki. On the other hand, most messages are only shown on rare error conditions or are part of disabled or restricted features. Thus it makes sense to translate the most visible messages first.
Concretely, translators and i18n fans can monitor the progress of MediaWiki localisation easily, by finding meaningful numbers in our statistics page; and we have an clear minimum service level for new locales added to MediaWiki. In particular, the Wikimedia Language committee requires that at very least all the most important messages are translated in a language before that language is given a Wikimedia project subdomain. This gives an incentive to kickstart the localisation in new languages, ensures that users see Wikimedia projects mostly in their own language and avoids linguistic colonialism.
Some history and statistics
The usage of the list for monitoring was fantastically impactful in 2007 and 2009 when translatewiki.net was still ramping up, because it gave translators concrete goals and it allowed to streamline the language proposal mechanism which had been trapped into a dilemma between a growing number of requests for language subdomains and a growing number of seemingly-dead open subdomains. There is some more background on translatewiki.net.
Languages with over 99 % most used messages translated were:
- 38 in 2007 shortly after the new list;
- 96 one year later against the same list (or 100 with the new list shortly after) , which means the list helped triple the coverage;
- 181 in 2011 with the 2009 list, but actually 82 after updating; back to 115 6 months later and 143 after 6 more months;
- 113 in latest available data for the 2011 list (April 2014), only 17 right now with the new list.
There is much more to do, but we now have a functional tool to motivate translators! To reach the peak of 2011, the least translated language among the first 181 will have to translate 233 messages, which is a feasible task. The 300th language is 30 % translated and needs 404 more translations. If we reached such a number, we could confidently say that we really have Wikimedia projects in 280+ languages, however small.
* Not necessarily seen: I’m sure you don’t read the whole sidebar and footer every time you load a page in Wikipedia.
At Wikimedia, first, for about 30 minutes we logged all requests to fetch certain messages by their key. We used this as a proxy variable to measure how often a particular message is shown to the user, which again is a proxy of how often a particular message is seen by the user. This is in no way an exact measurement, but I believe it good enough for the purpose. After the 30 minutes, we counted how many times each key was requested and we sorted by frequency. The result was a list containing about 17.000 different keys observed in over 15 million calls. This concluded the first phase.
In the second phase, we applied a rigorous human cleanup to the list with the help of a script, as follows:
- We removed all keys not belonging to MediaWiki or any extension. There are lots of keys which can be customized locally, but which don’t correspond to messages to translate.
- We removed all messages which were tagged as “ignored” in our system. These messages are not available for translation, usually because they have no linguistic content or are used only for local site-specific customization.
- We removed messages called less than 100 times in the time span and other messages with no meaningful linguistic content, like messages where there are only dashes or other punctuation which usually don’t need any changes in translation.
- We removed any messages we judged to be technical or not shown often to humans, even though they appeared high in this list. This includes some messages which are only seen inside comments in the generated HTML and some messages related to APIs or EXIF features.
Finally, some coding work was needed by yours truly to let users select those messages for translation at translatewiki.net.
In this process some points emerged that are worth highlighting.
- 310 messages (62 %) of the previous list (from 2011) are in the new list as well. Superseded user login messages have now been removed.
- Unsurprisingly, there are new entries from new highly visible extensions like MobileFrontend, Translate, Collection and Echo. However, except a dozen languages, translators didn’t manage to keep up with such messages in absence of a list.
- We slightly expanded the list from 500 to 600 messages, after noticing there were few or no “important” messages beyond this point. This will also allow some breathing space to remove messages which get removed.
- We did not follow a manual passage as in the original list, which included «messages that are not that often used, but important for a proper look and feel for all users: create account, sign on, page history, delete page, move page, protect page, watchlist». A message like “watchlist” got removed, which may raise suspicions: but it’s “just” the HTML title of Special:Watchlist, more or less as important as the the name “Special:Watchlist” itself, which is not included in the list either (magic words, namespaces or special pages names are not included). All in all, the list seems plausible.
Finally, the aim was to make this process reproducible so that we could do it yearly, or even more often. I hope this blog post serves as a documentation to achieve that.
I want to thank Ori Livneh for getting the key counts and Nemo for curating the list.