Category Archives: translatewiki.net

Report on Wikimedia Hackathon 2017 in Vienna

Long time no post! Let’s fix that with another report from my travels. This one is mostly about work.

Wikimedia hackathon was held in Vienna in May 2017. It is an event many MediaWiki developers come to meet and work together on various kinds of things – the more experiences developers helping the newcomers. For me this was one of the best events of this kind which I have attended because it was very well organized and I had a good balance between working on things and helping others.

The main theme for me for this hackathon was translatewiki.net. This was great, because recently I have not had as much time to work on improving translatewiki.net as I used to. But this does not mean there hasn’t been any, I just haven’t made any noise about it. For example, we have greatly increased automation for importing and exporting translations, new projects are being added, the operating systems have been updates, and so on. But let’s talk about what happened during the hackathon.

I worked with with Nemo_bis and Siebrand (they did most of the work) to go over backlog of support requests from translatewiki.net. We addressed more than half of 101 open support requests, also by adding support for 7 locales. Relatedly, we also helped a couple of people to start translating at translatewiki.net or to contribute to CLDR language database for their language.

Siebrand and I held an open post-mortem (anyone could join) about a 25 hours downtime that happened to translatewiki.net just before the event. There we reflected how we handled the situation, and how we could we do better in the future. The main take-aways are better communication (twitter, status page), server upgrade and using it for increased redundancy and periodically doing restoration practices to ensure we can restore quickly if the need arises.

Amir revived his old project that allows translating messages using a chat application (the prototype uses Telegram). Many people (at least Amir, Taras, Mt.Du and I) worked on different aspects on that project. I installed the MediaWiki OAuth extension (without which it would not be possible to do the translations using the correct user name) to translatewiki.net, and gave over-the-shoulder help for the coders.

Hackathon attendees working on their computers. Photo CC-BY-SA 3.0 by Nemo_bis

As always, some bugs were found and fixed during the hackathon. I fixed an issue in Translate where the machine translation suggestions using the Apertium service hosted by Wikimedia Foundation were not showing up. I also reported an issue with our discussion extension (LiquidThreads) having two toolbars instead of one. This was quickly fixed by Bartosz and Ed.

Finally, I would advertise a presentation about MediaWiki best practices I gave in the Fantastic MediaWikis track. It summarizes a few of the best practices I have come up during my experience maintaining translatewiki.net and many other MediaWiki sites. It has tips about deployment, job queue configuration and short main page URLs and slides are available.

As a small bonus, I finally updated my blog to use https, so that I could write and that you could read this post safely knowing that nobody else but me could have put all the bad puns in the post.

MediaWiki short urls with nginx and main page without redirect

This post has been updated 2015-09-06 with simplified code suggested by Krinkle and again in 2017-04-04.

Google PageSpeed Insights writes:

Redirects trigger an additional HTTP request-response cycle and delay page rendering. In the best case, each redirect will add a single round trip (HTTP request-response), and in the worst it may result in multiple additional round trips to perform the DNS lookup, TCP handshake, and TLS negotiation in addition to the additional HTTP request-response cycle. As a result, you should minimize use of redirects to improve site performance.

Let’s consider the situation where you run MediaWiki as the main thing on your domain. When user goes to your domain example.com, MediaWiki by default will issue a redirect to example.com/wiki/Main_Page, assuming you have configured the recommended short urls.

In addition the short url page writes:

Note that we do not recommend doing a HTTP redirect to your wiki path or main page directly. As redirecting to the main page directly will hard-code variable parts of your wiki’s page setup into your server config. And redirecting to the wiki path will result in two redirects. Simply rewrite the root path to MediaWiki and it will take care of the 301 redirect to the main page itself.

So are we stuck with a suboptimal solution? Fortunately, there is a way and it is not even that complicated. I will share example snippets from translatewiki.net configuration how to do it.

Configuring nginx

For nginx, the only thing we need in addition the default wiki short url rewrite is to rewrite / so that it is forwarded to MediaWiki. The configuration below assumes MediaWiki is installed in the w directory under the document root.

location ~ ^/wiki/ {
	rewrite ^ /w/index.php;
}

location = / {
	rewrite ^ /w/index.php;
}

Whole file for the curious.

Configuring MediaWiki

First, in our LocalSettings.php we have the short url configuration:

$wgArticlePath      = "/wiki/$1";
$wgScriptPath       = "/w";

In addition we use hooks to tell MediaWiki to make / the URL for the main page, not to be redirected:

$wgHooks['GetLocalURL'][] = function ( &$title, &$url, $query ) {
	if ( $title->isExternal() && $query != '' && $title->isMainPage() ) {
		$url = '/';
	}
};

// Tell MediaWiki that "/" should not be redirected
$wgHooks['TestCanonicalRedirect'][] = function ( $request ) {
	return $request->getRequestURL() !== '/';
};

This has the added benefit that all MediaWiki generated links to the main page point to the domain root, so you only have one canonical url for the wiki main page. The if block in the middle with strpos checks for problematic characters ? and & and forces them to use the long URLs, because otherwise they would not work correctly with this nginx rewrite rule.

And that’s it. With these changes you can have your main page displayed on your domain without redirect, also keeping it short for users to copy and share. This method should work for most versions of MediaWiki, including MediaWiki 1.26 which forcefully redirects everything that doesn’t match the canonical URL as seen by MediaWiki.

translatewiki.net – harder, better, faster, stronger

I am very pleased to announce that translatewiki.net has been migrated to new servers sponsored by netcup GmbH. Yes, that is right, we now have two servers, both of which are more powerful than the old server.

Since the two (virtual) servers are located in the same data center and other nitty gritty details, we are not making them redundant for the sake of load balancing or uptime. Rather, we have split the services: ElasticSearch runs on one server, powering the search, translation search and translation memory; everything else runs on the other server.

In addition to faster servers and continuous performance tweaks, we are now faster thanks to the migration from PHP to HHVM. The Wikimedia Foundation did this a while ago with great results, but HHVM has been crashing and freezing on translatewiki.net for unknown reasons. Fortunately, recently I found a lead that the issue is related to a ini_set function, which I was easily able to work around while the investigation on the root cause continues.

Non-free Google Analytics confirms that we now server pages faster.

Non-free Google Analytics confirms that we now serve pages faster: the small speech bubble indicates migration day to new servers and HHVM. Effect on the actual page load times observed by users seems to be less significant.

We now have again lots of room for growth and I challenge everyone to make us grow with more translations, new projects or other legitimate means, so that we reach a point where we will need to upgrade again ;). That’s all for now, stay tuned for more updates.

14 more languages “fully” translated this week

This week, MediaWiki’s priority messages have been fully translated in 14 more languages by about a dozen translators, after we checked our progress. Most users in those languages now see the interface of Wikimedia wikis entirely translated.

In two months since we updated the list of priority translations, languages 99+ % translated went from 17 to 60. No encouragement was even needed: those 60 languages are “organically” active, translators quickly rushed to use the new tool we gave them. Such regular and committed translators deserve a ton of gratitude!

However, we want to do better. We did something simple: tell MediaWiki users that they can make a difference, even if they don’t know. «With a couple hours’ work or less, you can make sure that nearly all visitors see the wiki interface fully translated.» The results we got in few hours speak for themselves:

Special:TranslationStats graph of daily registrations

This week’s peak of new translator daily registrations was ten times the usual

Special:TranslationStats of daily active translators

Many were eager to help: translation activity jumped immediately

Thanks especially to CERminator, David1010, EileenSanda, KartikMistry, Njardarlogar, Pymouss, Ranveig, Servien, StanProg, Sudo77(new), TomášPolonec and Чаховіч Уладзіслаў, who completed priority messages in their languages.

For the curious, the steps to solicit activity were:

take languages 80–99 % translated (about 60);
list active users using those languages on Wikimedia wikis (about 600);
send them a short talk page message (example);
welcome and thank them!

There is a long tail of users who see talk page messages only after weeks or months, so for most of those 60 languages we hope to get more translations later. It will be harder to reach the other hundreds languages, for which there are only 300 active users in Wikimedia according to interface language preferences: about 100 incubating languages do not have a single known speaker on any wiki!

We will need a lot of creativity and word spreading, but the lesson is simple: show people the difference that their contribution can make for free knowledge; the response will be great. Also, do try to reach the long tail of users and languages: if you do it well, you can communicate effectively to a large audience of silent and seemingly unresponsive users on hundreds Wikimedia projects.

Prioritizing MediaWiki’s translation strings

After a very long wait, MediaWiki’s top 500 most important messages are back at translatewiki.net with fresh data. This list helps translators prioritize their work to get most out of their effort.

What are the most important messages

In this blog post the term message means a translatable string in a software; technically, when a message is shown to users, they see different strings depending on the interface language.

MediaWiki software includes almost 5.000 messages (~40.000 words), or almost 24.000 messages (~177.000 words) if we include extensions. Since 2007, we make a list of about 500 messages which are used most frequently.

Why? If translators can translate few hundreds words per hour, and translating messages is probably slower than translating running text, it will take weeks to translate everything. Most of our volunteer translators do not have that much time.

Assuming that the messages follow a long tail pattern, a small number of messages are shown* to users very often, like the Edit button at the top of page in MediaWiki. On the other hand, most messages are only shown on rare error conditions or are part of disabled or restricted features. Thus it makes sense to translate the most visible messages first.

Concretely, translators and i18n fans can monitor the progress of MediaWiki localisation easily, by finding meaningful numbers in our statistics page; and we have an clear minimum service level for new locales added to MediaWiki. In particular, the Wikimedia Language committee requires that at very least all the most important messages are translated in a language before that language is given a Wikimedia project subdomain. This gives an incentive to kickstart the localisation in new languages, ensures that users see Wikimedia projects mostly in their own language and avoids linguistic colonialism.

The screenshot shows an example page with messages replaced by their key instead of their string content. Click for full size.

Some history and statistics

The usage of the list for monitoring was fantastically impactful in 2007 and 2009 when translatewiki.net was still ramping up, because it gave translators concrete goals and it allowed to streamline the language proposal mechanism which had been trapped into a dilemma between a growing number of requests for language subdomains and a growing number of seemingly-dead open subdomains. There is some more background on translatewiki.net.

Languages with over 99 % most used messages translated were:

38 in 2007 shortly after the new list;
96 one year later against the same list (or 100 with the new list shortly after) , which means the list helped triple the coverage;
181 in 2011 with the 2009 list, but actually 82 after updating; back to 115 6 months later and 143 after 6 more months;
113 in latest available data for the 2011 list (April 2014), only 17 right now with the new list.

There is much more to do, but we now have a functional tool to motivate translators! To reach the peak of 2011, the least translated language among the first 181 will have to translate 233 messages, which is a feasible task. The 300th language is 30 % translated and needs 404 more translations. If we reached such a number, we could confidently say that we really have Wikimedia projects in 280+ languages, however small.

* Not necessarily seen: I’m sure you don’t read the whole sidebar and footer every time you load a page in Wikipedia.

Process

At Wikimedia, first, for about 30 minutes we logged all requests to fetch certain messages by their key. We used this as a proxy variable to measure how often a particular message is shown to the user, which again is a proxy of how often a particular message is seen by the user. This is in no way an exact measurement, but I believe it good enough for the purpose. After the 30 minutes, we counted how many times each key was requested and we sorted by frequency. The result was a list containing about 17.000 different keys observed in over 15 million calls. This concluded the first phase.

In the second phase, we applied a rigorous human cleanup to the list with the help of a script, as follows:

We removed all keys not belonging to MediaWiki or any extension. There are lots of keys which can be customized locally, but which don’t correspond to messages to translate.
We removed all messages which were tagged as “ignored” in our system. These messages are not available for translation, usually because they have no linguistic content or are used only for local site-specific customization.
We removed messages called less than 100 times in the time span and other messages with no meaningful linguistic content, like messages where there are only dashes or other punctuation which usually don’t need any changes in translation.
We removed any messages we judged to be technical or not shown often to humans, even though they appeared high in this list. This includes some messages which are only seen inside comments in the generated HTML and some messages related to APIs or EXIF features.

Finally, some coding work was needed by yours truly to let users select those messages for translation at translatewiki.net.

Discoveries

In this process some points emerged that are worth highlighting.

310 messages (62 %) of the previous list (from 2011) are in the new list as well. Superseded user login messages have now been removed.
Unsurprisingly, there are new entries from new highly visible extensions like MobileFrontend, Translate, Collection and Echo. However, except a dozen languages, translators didn’t manage to keep up with such messages in absence of a list.
I just realized that we are probably missing some high visibility messages only used in the JavaScript side. That is something we should address in the future.
We slightly expanded the list from 500 to 600 messages, after noticing there were few or no “important” messages beyond this point. This will also allow some breathing space to remove messages which get removed.
We did not follow a manual passage as in the original list, which included «messages that are not that often used, but important for a proper look and feel for all users: create account, sign on, page history, delete page, move page, protect page, watchlist». A message like “watchlist” got removed, which may raise suspicions: but it’s “just” the HTML title of Special:Watchlist, more or less as important as the the name “Special:Watchlist” itself, which is not included in the list either (magic words, namespaces or special pages names are not included). All in all, the list seems plausible.

Conclusion

Finally, the aim was to make this process reproducible so that we could do it yearly, or even more often. I hope this blog post serves as a documentation to achieve that.

I want to thank Ori Livneh for getting the key counts and Nemo for curating the list.

Oregano deployment tool

This blog post introduces oregano, a non-complex, non-distributed, non-realtime deployment tool. It currently consists of less than 100 lines of shell script and is licensed under the MIT license.

The problem. For a very long time, we have run translatewiki.net straight from a git clone, or svn checkout before that. For years, we have been the one wiki which systematically run latest master, with few hours of delay. That was not a problem while we were young and wild. But nowadays, due to the fact that we carry dozens of local patches and thanks to the introduction of composer, it is quite likely that git pull --rebase will stop in a merge conflict. As a consequence, updates have become less frequent, but have semi-regularly brought the site down for many minutes until the merge conflicts were manually resolved. This had to change.

The solution. I wrote a simple tool, probably re-inventing the wheel for the hundredth time, which separates the current deployment in two stages: preparation and pushing out new code. Since I have been learning a lot about Salt and its quirks, I named my tool “oregano”.

How it works. Basically, oregano is a simple wrapper for symbolic links and rsync. The idea is that you prepare your code in a directory named workdir. To deploy the current state in workdir, you must first create a read-only copy by running oregano tag. After that, you can run oregano deploy, which will update symbolic links so that your web server sees the new code. You can give the name of the tag with both commands, but by default oregano will name a new tag after the current timestamp, and deploy the most recently created tag. If, after deploying, you find out that the new tag is broken, you can quickly go back to the previously deployed code by running oregano rollback. Below this is shown as a command line tutorial.

mkdir /srv/mediawiki/ # the path does not matter, pick whatever you want

cd /srv/mediawiki

# Get MediaWiki. Everything we want to deploy must be inside workdir
git clone https://github.com/wikimedia/mediawiki workdir

oregano tag
oregano deploy

# Now we can use /srv/mediawiki/targets/deployment where we want to deploy
ln -s /srv/mediawiki/targets/deployment /www/example.com/docroot/mediawiki

# To update and deploy a new version
cd workdir
git pull
# You can run maintenance scripts, change configuration etc. here
nano LocalSettings.php

cd .. # Must be in the directory where workdir is located
oregano tag
oregano deploy

# Whoops, we accidentally introduced a syntax error in LocalSettings.php
oregano rollback

As you can see from above, it is still possible to break the site if you don’t check what you are deploying. For this purpose I might add support for hooks, so that one could run syntax checks whose failure would prevent deploying that code. Hooks would also be handy for sending IRC notifications, which is something our existing scripts do when code is updated: as pushing out code is now a separate step, they are currently incorrect.

By default oregano will keep the 4 newest tags, so make sure you have enough disk space. For translatewiki.net, which has MediaWiki and dozens of extensions, each tag takes about 200M. If you store MediaWiki localisation cache, pre-generated for all languages, inside workdir, then you would need 1.2G for each tag. Currently, at translatewiki.net, we store localisation cache outside workdir, which means it is out of sync with the code. We will see if that causes any issues; we will move it inside workdir if needed. Do note that oregano creates a tag with rsync --cvs-exclude to save space. That also has the caveat that you should not name files or directories as core. Be warned; patches welcome.

The code is in the translatewiki repo but, if there is interest, I can move it to a separate repository in GitHub. Oregano is currently used in translatewiki.net and in a pet project of mine nicknamed InTense. If things go well, expect to hear more about this mysterious pet project in the future.

Midsummer cleanup: YAML and file formats, HHVM, translation memory

Wikimania 2014 is now over and that is a good excuse to write updates about the MediaWiki Translate extension and translatewiki.net.
I’ll start with an update related to our YAML format support, which has always been a bit shaky. Translate supports different libraries (we call them drivers) to parse and generate YAML files. Over time the Translate extension has supported four different drivers:

spyc uses spyc, a pure PHP library bundled with the Translate extension,
syck uses libsyck which is a C library (hard to find any details) which we call by shelling out to P erl,
syck-pecl uses libsyck via a PHP extension,
phpyaml uses the libyaml C library via a PHP extension.

The latest change is that I dropped syck-pecl because it does not seem to compile with PHP 5.5 anymore; and I added phpyaml. We tried to use sypc a bit but the output it produced for localisation files was not compatible with Ruby projects: after complaints, I had to find an alternative solution.

Joel Sahleen let me know of phpyaml, which I somehow did not found before: thanks to him we now use the same libyaml library that Ruby projects use, so we should be fully compatible. It is also the fastest driver of the four. Anyone generating YAML files with Translate is highly recommended to use the phpyaml driver. I have not checked how phpyaml works with HHVM but I was told that HHVM ships with a built-in yaml extension.

Speaking of HHVM, the long standing bug which causes HHVM to stop processing requests is still unsolved, but I was able to contribute some information upstream. In further testing we also discovered that emails sent via the MediaWiki JobQueue were not delivered, so there is some issue in command line mode. I have not yet had time to investigate this, so HHVM is currently disabled for web requests and command line.

I have a couple of refactoring projects for Translate going on. The first is about simplifying the StringMangler interface. This has no user visible changes, but the end goal is to make the code more testable and reduce coupling. For example the file format handler classes only need to know their own keys, not how those are converted to MediaWiki titles. The other refactoring I have just started is to split the current MessageCollection. Currently it manages a set of messages, handles message data loading and filters the collection. This might also bring performance improvements: we can be more intelligent and only load data we need.

Aiming high: creating a translation memory that works for Wikipedia; even though a long way from here (photo Marie-Lan Nguyen, CC BY 3.0)

Finally, at Wikimania I had a chance to talk about the future of our translation memory with Nik Everett and David Chan. In the short term, Nik is working on implementing in ElasticSearch an algorithm to sort all search results by edit distance. This should bring translation memory performance on par with the old Solr implementation. After that is done, we can finally retire Solr at Wikimedia Foundation, which is much wanted especially as there are signs that Solr is having problems.

Together with David, I laid out some plans on how to go beyond simply comparing entire paragraphs by edit distance. One of his suggestions is to try doing edit distance over words instead of characters. When dealing with the 300 or so languages of Wikimedia, what is a word is less obvious than what is a character (even that is quite complicated), but I am planning to do some research in this area keeping the needs of the content translation extension in mind.

Translatewiki.net summer update

It’s been a busy while since last update, but how could I have not worked on translatewiki.net? ;) Here is an update on my current activities.
In this episode:

we provide translations for over 70 % of users of the new Wikipedia app,
I read a book on networking performance and get needy for speed,
ElasticSearch tries to eat all of us and our memory,
HHVM finds the place not fancy enough,
Finns and Swedes start cooperating.

Performance

Naturally, I have been thinking of ways to further improve translatewiki.net performance. I have been running HHVM as a beta feature at translatewiki.net many months now, but I have kept turning it on and off due to stability issues. It is currently disabled, but my plan is to try the Wikimedia packaged version of HHVM. Those packages only work in Ubuntu 2014.04, so Siebrand and I first have to upgrade the translatewiki.net server from Ubuntu 2012.04, as we plan to later this month (July). (Update: done as of 2014-07-09, 14 UTC.)

Map of some translatewiki.net translators

A global network of translators is not served well enough from a single location

After reading a book about networking performance I finally decided to give a content distribution network (CDN) a try. Not because they can optimize and cache things on the fly [1], nor because the can do spam protection [2], but because CDN can reduce latency, which is usually the main bottleneck of web browsing. We only have single server in Germany, but our users are international. I am close to the server, so I have much better experience than many of our users. I do not have any numbers yet, but I will do some experiments and gather some numbers to see whether CDN helps us.

[1] MediaWiki is already very aggressive in terms of optimizations for resource delivery.
[2] Restricting account creation already eliminated spam on our wiki.

Wikimedia Mobile Apps

Amir and I have been closely working with the Wikimedia Mobile Apps team to ensure that their apps are well supported. In just a couple weeks, the new app was translated in dozens languages and released, with over 7 millions new installations by non-English users (74 % of the total).

In more detail, we finally addressed a longstanding issue in the Android app which prevented translation of strings containing links. I gave Yuvi access to synchronize translations, ensuring that translators have as much time as possible to translate and the apps have the latest updates before being released. We also discussed about how to notify translators before releases to get more translations in time, and about improvements to their i18n frameworks to bring their flexibility more in line with MediaWiki (including plural support).

To put it bluntly, for some reason the mobile i18n frameworks are ugly and hard to work with. Just as an example, Android did not support many languages at all just for one character too much; support is still partial. I can’t avoid comparing this to the extra effort which has been needed to support old versions of Internet Explorer: we would rather be doing other cool things, but the environment is not going to change anytime soon.

Search

I installed and enabled CirrusSearch on translatewiki.net: for the first time, we have a real search engine for all our pages! I had multiple issues, including running a bit tight on memory while indexing all content.

Translate’s translation memory support for ElasticSearch has been almost ready for a while now. It may take a couple months before we’re ready to migrate from Solr (first on translatewiki.net, then Wikimedia sites). I am looking forward to it: as a system administrator, I do not want to run both Solr and ElasticSearch.

I want to say big thanks to Nik for helping both with the translation memory ElasticSearch backend and my CirrusSearch problems.

Wikimedia Sweden launches a new project

I am expecting to see an increased activity and new features at translatewiki.net thanks to a new project by Wikimedia Sweden together with InternetFonden.Se. The project has been announced on the Wikimedia blog, but in short they want to bring more Swedish translators, new projects for translation and possibly open badges to increase translator engagement. They are already looking for feedback, please do share your thoughts.

Summary of Translate workshop at Zürich hackathon

The hall always provided power and wifi for eager hackers (photo CC-BY-SA by Ludovic Péron)

I held a Translate workshop at the Zürich hackathon. Naturally, others and I worked on Translate and translatewiki.net outside of the workshop as well. Here is a summary of the outcomes.

The workshop itself consisted of three topics of interest. I gave an introduction about the Content translation project, going over the basic design and features, followed by a Q&A. We then split into three small groups. One group continued talking about translating content in wider scope. The second group went over how to add new projects to translatewiki.net, using Huggle and Sharelatex as a concrete example. The third group consisted of me helping with programming questions about the Translate extension.

During the whole hackathon people worked on about 20 bugs and patches. I started a patch for glossary support in the Translate extension: a proof of concept, as simple as possible.

Numbers on translatewiki.net sign-up process

Translatewiki.net features a good user experience for non-technical translators. A crucial or even critical component is signing up. An unrelated data collection for my PhD studies inspired me to get some data on the translatewiki.net user registration process. I will present the results below.

History

At translatewiki.net the process of becoming an approved translator has been, arguably, complicated in some periods.

In the early days of the wiki, permissions were not clearly separated: hundreds users were just given the full set of permissions to edit the MediaWiki namespace and translate that way.

Later, we required people to go through hoops of various kind after registering to be approved as translators. They had to create a user page with certain elements and post a request on a separate page and they would not get notifications when they were approved unless they tweaked their preferences.

At some point, we started using the LiquidThreads extension: now the users could get notifications when approved, at least in theory. That brought its own set of issues though: many people thought that the LiquidThreads search box on the requests page was the place where to write the title of their request. After entering a title, they ended up in a search results page, which was a dead end. This usability issue was so annoying and common that I completely removed the search field from LiquidThreads.
In early 2010 we implemented a special page wizard (FirstSteps) to guide users though the process. For years, this has allowed new users to get approved, and start translating, in few clicks and a handful hours after registering.

In late 2013 we enabled the new main page containing a sign-up form. Using that form, translators can create an account in a sandbox environment. Accounts created this way are normal user accounts except that they can only make example translations to get a feel of the system. Example translations give site administrators some hints on whether to approve or reject the request and approve the user as a translator.

Data collection

The data we have is not ideal.

For example, it is impossible to say what’s our conversion rate from users visiting the main page to actual translators.
A lot of noise is added by spam bots which create user accounts, even though we have a CAPTCHA.
When we go far back in the history, the data gets unreliable or completely missing.
- We only have dates for account created after 2006 or so.
- The log entry format for user permissions has changed multiple times, so the promotion times are missing or even incorrect for many entries until a few years back.

The data collection was made with two scripts I wrote for this purpose. The first script produces a tab separated file (tsv) containing all accounts which have been created. Each line has the following fields:

username,
time of account creation,
number of edits,
whether the user was approved as translator,
time of approval and
whether they used the regular sign-up process or the sandbox.

Some of the fields may be empty because the script was unable to find the data. User accounts for which we do not have account creation time are not listed. I chose not to try some methods which can be used to approximate the account creation time, because the data in that much past is too unreliable to be useful.

The first script takes a couple of minutes to run at translatewiki.net, so I split further processing to a separate script to avoid doing the slow data fetching many times. The second script calculates a few additional values like average and median time for approval and aggregates the data per month.

The data also includes translators who signed up through the sandbox, but got rejected: this information is important for approval rate calculation. For them, we do not know the exact registration date, but we use the time they were rejected instead. This has a small impact on monthly numbers, if a translator registers in one month and gets rejected in a later month. If the script is run again later, numbers for previous months might be somewhat different. For approval times there is no such issue.

Results

Image 1: Account creations and approved translators at translatewiki.net

Image 1 displays all account creations at translatewiki.net as described above, simply grouped by their month of account creation.

We can see that approval rate has gone down over time. I assume this is caused by spam bot accounts. We did not exclude them hence we cannot tell whether the approval rate has gone up or down for human users.

We can also see that the number of approved translators who later turn out to be prolific translators has stayed pretty much constant each month. A prolific translator is an approved translator who has made at least 100 edits. The edits can be from any point of time, the script is just looking at current edit count so the graph above doesn’t say anything about wiki activity at any point in time.

There is an inherent bias towards old users for two reasons. First, at the beginning translators were basically invited to a new tool from existing methods they used, so they were likely to continue to translate with the new tool. Second, new users have had less time to reach 100 edits. On the other hand, we can see that a dozen translators even in the past few months have already made over 100 edits.

I have collected some important events below, which I will then compare against the chart.

2009: Translation rallies in August and December.
2010-02: The special page to assist in filing translator requests was enabled.
2010-04: We created a new (now old) main page.
2010-10: Translation rally.
2011: Translation rallies in April, September and December.
2012: Translation rallies in August and December.
2013-12: The sandbox sign-up process was enabled.

There is an increase in account creations and approved translators a few months after the assisting special page was enabled. The explanation of this is likely to be the new main page which had a big green button to access the special page. The September translation rally in 2011 seems to be very successful in requiting new translators, but also the other rallies are visible in the chart.

Image 2: How long it takes for account creation to be approved.

The second image shows how long it takes from the account creation for a site administrator to approve the request. Before sandbox, users had to submit a request to become translators on their own: the time for them to do so is out of control of the site administrators. With sandbox, that is much less the case, as users get either approved or rejected in a couple of days. Let me give an overview of how the sandbox works.

All users in the sandbox are listed on a special page together with the sandbox translations they have made. The administrators can then approve or reject the users. Administrators usually wait until the user has made a handful translations. Administrators can also send email reminders for the users to make more translations. If translators do not provide translations within some time, or the translations are very bad, they will get rejected. Otherwise they will be approved and can immediately start using the full translation interface.

We can see that the median approval time is just a couple of hours! The average time varies wildly though. I am not completely sure why, but I have two guesses.
First, some very old user accounts have reactivated after being dormant for months or years and have finally requested translator rights. Even one of these can skew the average significantly. On a quick inspection of the data, this seems plausible.
Second, originally we made all translators site administrators. At some point, we introduced the translator user group, and existing translators have gradually been getting this new permission as they returned to the site. The script only counts the time when they were added to the translator group.
Alternatively, the script may have a bug and return wrong times. However, that should not be the case for recent years because the log format has been stable for a while. In any case, the averages are so big as to be useless before the year 2012, so I completely left them out of the graph.

The sandbox has been in use only for a few months. For January and February 2014, the approval rate has been slightly over 50%. If a significant portion of rejected users are not spam bots, there might be a reason for concern.

Suggested action points

Store the original account creation date and “sandbox edit count” for rejected users.
Investigate the high rejection rate. We can ask the site administrator why about a half of the new users are rejected. Perhaps we can also have “mark as spam” action to get insight whether we get a lot of spam. Event logging could also be used, to get more insight on the points of the process where users get stuck.

Source material

Scripts are in Gerrit. Version ‘2’ of the scripts was used for this blog post. Processed data is in a LibreOffice spreadsheet. Original and updated data is available on request, please email me.

-- Niklas Laxström.

It rains like a saavi

About me, me and me