GNU i18n for high priority projects list

Today, for a special occasion, I’m hosting this guest post by Federico Leva, dealing with some frequent topics of my blog.

A special GNU committee has invited everyone to comment on the selection of high priority free software projects (thanks M.L. for spreading the word).

In my limited understanding from looking every now and then in the past few years, the list so far has focused on “flagship” projects which are perceived to the biggest opportunities, or roadblocks to remove, for the goal of having people only use free/libre/open source software.

A “positive” item is one which makes people want to embrace GNU/Linux and free software in order to user it: «I want to use Octave because it’s more efficient». A “negative” item is an obstacle to free software adoption, which we want removed: «I can’t use GNU/Linux because I need AutoCAD for work».

We want to propose something different: a cross-fuctional project, which will benefit no specific piece of software, but rather all of them. We believe that the key for success of each and all the free software projects is going to be internationalization and localization. No i18n can help if the product is bad: here we assume that the idea of the product is sound and that we are able to scale its development, but we “just” need more users, more work.

What we believe

If free software is about giving control to the user, we believe it must also be about giving control of the language to its speakers. Proper localisation of a software can only be done by people with a particular interest and competence in it, ideally language natives who use the software.

It’s clear that there is little overlap between this group and developers; if nothing else, because most free software projects have at most a handful developers: all together, they can only know a fraction of the world’s languages. Translation is not, and can’t be, a subset of programming. A GNOME dataset showed a strong specialisation of documenters, coders and i18n contributors.

We believe that the only way to put them in control is to translate the wiki way: easily, the only requirement being language competency; with no or very low barriers on access; using translations immediately in the software; correcting after the fact thanks to their usage, not with pre-publishing gatekeeping.

Translation should not be a labyrinth

In most projects, the i18n process is hard to join and incomprehensible, if explained at all. GNOME has a nice description of their workflow, which however is a perfect example of what the wiki way is not.

A logical consequence of the wiki way is that not all translators will know the software like their pockets. Hence, to translate correctly, translators need message documentation straight in their translation interface (context, possible values of parameters, grammatical role of words, …): we consider this a non-negotiable feature of any system chosen. Various research agrees.

Ok, but why care?

I18n is a recipe for success

First. Developers and experienced users are often affected by the software localisation paradox, which means they only use software in English and will never care about l10n even if they are in the best position to help it. At this point, they are doomed; but the computer users of the future, e.g. students, are not. New users may start using free software simply because of not knowing English and/or because it’s gratis and used by their school; then they will keep using it.

With words we don’t like much, we could say: if we conquer some currently marginal markets, e.g. people under a certain age or several countries, we can then have a sufficient critical mass to expand to the main market of a product.

Research is very lacking on this aspect: there was quite some research on predicting viability of FLOSS projects, but almost nothing on their i18n/l10n and even less on predicting their success compared to proprietary competitors, let alone on the two combined. However, an analysis of SourceForge data from 2009 showed that there is a strong correlation between high SourceForge rank and having translators (table 5): for successful software, translation is the “most important” work after coding and project management, together with documentation and testing.

Second. Even though translation must not be like programming, translation is a way to introduce more people in the contributor base of each piece of software. Eventually, if they become more involved, translators will get in touch with the developers and/or the code, and potentially contribute there as well. In addition to this practical advantage, there’s also a political one: having one or two orders of magnitude more contributors of free software, worldwide, gives our ideas and actions a much stronger base.

Practically speaking, every package should be i18n-ready from the beginning (the investment pays back immediately) and its “Tools”/”Help” menu, or similarly visible interface element, should include a link to a website where everyone can join its translation. If the user’s locale is not available, the software should actively encourage joining translation.

Arjona Reina et al. 2013, based on the observation of 41 free software projects and 22 translation tools, actually claim that recruiting, informing and rewarding the translators is the most important factor for success of l10n, or even the only really important one.

Exton, Wasala et al. also suggest to receive in situ translations in a “crowdsourcing” or “micro-crowdsourcing” limbo, which we find superseded by a wiki. In fact, they end up requiring a “reviewing mechanism such as observed in the Wikipedia community” anyway, in addition to a voting system. Better keep it simple and use a wiki in the first place.

Third. Extensive language support can be a clear demonstration of the power of free software. Unicode CLDR is an effort we share with companies like Microsoft or Apple, yet no proprietary software in the world can support 350 languages like MediaWiki. We should be able to say this of free software in general, and have the motives to use free software include i18n/l10n.

Research agrees that free software is more favourable for multilingualism because compared to proprietary software translation is more efficient, autonomous and web-based (Flórez & Alcina, 2011; citing Mas 2003, Bowker et al. 2008).

The obstacle here is linguistic colonialism, namely the self-disrespect billions of humans have for their own language. Language rights are often neglected and «some languages dominate» the web (UNO report A/HRC/22/49, §84); but many don’t even try to use their own language even where they could. The solution can’t be exclusively technical.

Fourth. Quality. Proprietary software we see in the wild has terrible translations (for example Google, Facebook, Twitter). They usually use very complex i18n systems or they give up on quality and use vote-based statistical approximation of quality; but the results are generally bad. A striking example is Android, which is “open source” but whose translation is closed as in all Google software, with terrible results.

How to reach quality? There can’t be an authoritative source for what’s the best translation of every single software string: the wiki way is the only way to reach the best quality; by gradual approximation, collaboratively. Free software can be more efficient and have a great advantage here.

Indeed, quality of available free software tools for translation is not a weakness compared to proprietary tools, according to the same Flórez & Alcina, 2011: «Although many agencies and clients require translators to use specific proprietary tools, free programmes make it possible to achieve similar results».

We are not there yet

Many have the tendency to think they have “solved” i18n. The internet is full of companies selling i18n/10n services as if they had found the panacea. The reality is, most software is not localised at all, or is localised in very few languages, or has terrible translations. Explaining the reasons is not the purpose of this post; we have discussed or will discuss the details elsewhere. Some perspectives:

Gettext is powerful but problematic (cf. the Gettext Localisation horror story, 1999).
Mozilla i18n and L20n has an unclear direction and a tendency to turn localisation into programming.
We know little of most proprietary software.
In the translatewiki.net intro, Translating the wiki way (video) and Localisation for developers (doc) we try to explain what matters for us and how we do things at translatewiki.net.

A 2000 survey confirms that education about i18n is most needed: «There is a curious “localisation paradox”: while customising content for multiple linguistic and cultural market conditions is a valuable business strategy, localisation is not as widely practised as one would expect. One reason for this is lack of understanding of both the value and the procedures for localisation.»

Can we win this battle?

We believe it’s possible. What above can look too abstract, but it’s intentionally so. Figuring out the solution is not something we can do in this document, because making i18n our general strength is a difficult project: that’s why we argue it needs to be in the high priority projects list.

The initial phase will probably be one of research and understanding. As shown above, we have opinions everywhere, but too little scientific evidence on what really works: this must change. Where evidence is available, it should be known more than it currently is: a lot of education on i18n is needed. Sharing and producing knowledge also implies discussion, which helps the next step.

The second phase could come with a medium term concrete goal: for instance, it could be decided that within a couple years at least a certain percentage of GNU software projects should (also) offer a modern, web-based, computer-assisted translation tool with low barriers on access etc., compatible with the principles above. Requirements will be shaped by the first phase (including the need to accommodate existing workflows, of course).

This would probably require setting up a new translation platform (or giving new life to an existing one), because current “bigs” are either insufficiently maintained (Pootle and Launchpad) or proprietary. Hopefully, this platform would embed multiple perspectives and needs of projects way beyond GNU, and much more un-i18n’d free software would gravitate here as well.

A third (or fourth) phase would be about exploring the uncharted territory with which we share so little, like the formats, methods and CAT tools existing out there for translation of proprietary software and of things other than software. The whole translation world (millions of translators?) deserves free software. For this, a way broader alliance will be needed, probably with university courses and others, like the authors of Free/Open-Source Software for the Translation Classroom: A Catalogue of Available Tools and tuxtrans.

“What are you doing?”

Fair question. This proposal is not all talk. We are doing our best, with the tools we know. One of the challenges, as Wasala et al. say, is having a shared translation memory to make free software translation more efficient: so, we are building one. InTense is our new showcase of free software l10n and uses existing translations to offer an open translation memory to everyone; we believe we can eventually include practically all free software in the world.

For now, we have added a few dozens GNU projects and others, with 55 thousands strings and about 400 thousands translations. See also the translation interface for some examples.

If translatewiki.net is asked to do its part, we are certainly available. MediaWiki has the potential to scale incredibly, after all: see Wikipedia. In a future, a wiki like InTense could be switched from read-only to read/write and become a über-translatewiki.net, translating thousands of projects.

But that’s not necessarily what we’re advocating for: what matter is the result, how much more well-localised software we get. In fact, MediaWiki gave birth to thousands of wikis; and its success is also in its principles being adopted by others, see e.g. the huge StackExchange family (whose Q&A are wikis and use a free license, though more individual-centred).

Maybe the solution will come with hundreds or thousands separate installs of one or a handful software platforms. Maybe the solution will not be to “translate the wiki way”, but a similar and different concept, which still puts the localisation in the hands of users, giving them real freedom.

What do you think? Tell us in the comments.

Oregano deployment tool

This blog post introduces oregano, a non-complex, non-distributed, non-realtime deployment tool. It currently consists of less than 100 lines of shell script and is licensed under the MIT license.

The problem. For a very long time, we have run translatewiki.net straight from a git clone, or svn checkout before that. For years, we have been the one wiki which systematically run latest master, with few hours of delay. That was not a problem while we were young and wild. But nowadays, due to the fact that we carry dozens of local patches and thanks to the introduction of composer, it is quite likely that git pull --rebase will stop in a merge conflict. As a consequence, updates have become less frequent, but have semi-regularly brought the site down for many minutes until the merge conflicts were manually resolved. This had to change.

The solution. I wrote a simple tool, probably re-inventing the wheel for the hundredth time, which separates the current deployment in two stages: preparation and pushing out new code. Since I have been learning a lot about Salt and its quirks, I named my tool “oregano”.

How it works. Basically, oregano is a simple wrapper for symbolic links and rsync. The idea is that you prepare your code in a directory named workdir. To deploy the current state in workdir, you must first create a read-only copy by running oregano tag. After that, you can run oregano deploy, which will update symbolic links so that your web server sees the new code. You can give the name of the tag with both commands, but by default oregano will name a new tag after the current timestamp, and deploy the most recently created tag. If, after deploying, you find out that the new tag is broken, you can quickly go back to the previously deployed code by running oregano rollback. Below this is shown as a command line tutorial.

mkdir /srv/mediawiki/ # the path does not matter, pick whatever you want

cd /srv/mediawiki

# Get MediaWiki. Everything we want to deploy must be inside workdir
git clone https://github.com/wikimedia/mediawiki workdir

oregano tag
oregano deploy

# Now we can use /srv/mediawiki/targets/deployment where we want to deploy
ln -s /srv/mediawiki/targets/deployment /www/example.com/docroot/mediawiki

# To update and deploy a new version
cd workdir
git pull
# You can run maintenance scripts, change configuration etc. here
nano LocalSettings.php

cd .. # Must be in the directory where workdir is located
oregano tag
oregano deploy

# Whoops, we accidentally introduced a syntax error in LocalSettings.php
oregano rollback

As you can see from above, it is still possible to break the site if you don’t check what you are deploying. For this purpose I might add support for hooks, so that one could run syntax checks whose failure would prevent deploying that code. Hooks would also be handy for sending IRC notifications, which is something our existing scripts do when code is updated: as pushing out code is now a separate step, they are currently incorrect.

By default oregano will keep the 4 newest tags, so make sure you have enough disk space. For translatewiki.net, which has MediaWiki and dozens of extensions, each tag takes about 200M. If you store MediaWiki localisation cache, pre-generated for all languages, inside workdir, then you would need 1.2G for each tag. Currently, at translatewiki.net, we store localisation cache outside workdir, which means it is out of sync with the code. We will see if that causes any issues; we will move it inside workdir if needed. Do note that oregano creates a tag with rsync --cvs-exclude to save space. That also has the caveat that you should not name files or directories as core. Be warned; patches welcome.

The code is in the translatewiki repo but, if there is interest, I can move it to a separate repository in GitHub. Oregano is currently used in translatewiki.net and in a pet project of mine nicknamed InTense. If things go well, expect to hear more about this mysterious pet project in the future.

Midsummer cleanup: YAML and file formats, HHVM, translation memory

Wikimania 2014 is now over and that is a good excuse to write updates about the MediaWiki Translate extension and translatewiki.net.
I’ll start with an update related to our YAML format support, which has always been a bit shaky. Translate supports different libraries (we call them drivers) to parse and generate YAML files. Over time the Translate extension has supported four different drivers:

spyc uses spyc, a pure PHP library bundled with the Translate extension,
syck uses libsyck which is a C library (hard to find any details) which we call by shelling out to P erl,
syck-pecl uses libsyck via a PHP extension,
phpyaml uses the libyaml C library via a PHP extension.

The latest change is that I dropped syck-pecl because it does not seem to compile with PHP 5.5 anymore; and I added phpyaml. We tried to use sypc a bit but the output it produced for localisation files was not compatible with Ruby projects: after complaints, I had to find an alternative solution.

Joel Sahleen let me know of phpyaml, which I somehow did not found before: thanks to him we now use the same libyaml library that Ruby projects use, so we should be fully compatible. It is also the fastest driver of the four. Anyone generating YAML files with Translate is highly recommended to use the phpyaml driver. I have not checked how phpyaml works with HHVM but I was told that HHVM ships with a built-in yaml extension.

Speaking of HHVM, the long standing bug which causes HHVM to stop processing requests is still unsolved, but I was able to contribute some information upstream. In further testing we also discovered that emails sent via the MediaWiki JobQueue were not delivered, so there is some issue in command line mode. I have not yet had time to investigate this, so HHVM is currently disabled for web requests and command line.

I have a couple of refactoring projects for Translate going on. The first is about simplifying the StringMangler interface. This has no user visible changes, but the end goal is to make the code more testable and reduce coupling. For example the file format handler classes only need to know their own keys, not how those are converted to MediaWiki titles. The other refactoring I have just started is to split the current MessageCollection. Currently it manages a set of messages, handles message data loading and filters the collection. This might also bring performance improvements: we can be more intelligent and only load data we need.

Aiming high: creating a translation memory that works for Wikipedia; even though a long way from here (photo Marie-Lan Nguyen, CC BY 3.0)

Finally, at Wikimania I had a chance to talk about the future of our translation memory with Nik Everett and David Chan. In the short term, Nik is working on implementing in ElasticSearch an algorithm to sort all search results by edit distance. This should bring translation memory performance on par with the old Solr implementation. After that is done, we can finally retire Solr at Wikimedia Foundation, which is much wanted especially as there are signs that Solr is having problems.

Together with David, I laid out some plans on how to go beyond simply comparing entire paragraphs by edit distance. One of his suggestions is to try doing edit distance over words instead of characters. When dealing with the 300 or so languages of Wikimedia, what is a word is less obvious than what is a character (even that is quite complicated), but I am planning to do some research in this area keeping the needs of the content translation extension in mind.

Translatewiki.net summer update

It’s been a busy while since last update, but how could I have not worked on translatewiki.net? ;) Here is an update on my current activities.
In this episode:

we provide translations for over 70 % of users of the new Wikipedia app,
I read a book on networking performance and get needy for speed,
ElasticSearch tries to eat all of us and our memory,
HHVM finds the place not fancy enough,
Finns and Swedes start cooperating.

Performance

Naturally, I have been thinking of ways to further improve translatewiki.net performance. I have been running HHVM as a beta feature at translatewiki.net many months now, but I have kept turning it on and off due to stability issues. It is currently disabled, but my plan is to try the Wikimedia packaged version of HHVM. Those packages only work in Ubuntu 2014.04, so Siebrand and I first have to upgrade the translatewiki.net server from Ubuntu 2012.04, as we plan to later this month (July). (Update: done as of 2014-07-09, 14 UTC.)

Map of some translatewiki.net translators

A global network of translators is not served well enough from a single location

After reading a book about networking performance I finally decided to give a content distribution network (CDN) a try. Not because they can optimize and cache things on the fly [1], nor because the can do spam protection [2], but because CDN can reduce latency, which is usually the main bottleneck of web browsing. We only have single server in Germany, but our users are international. I am close to the server, so I have much better experience than many of our users. I do not have any numbers yet, but I will do some experiments and gather some numbers to see whether CDN helps us.

[1] MediaWiki is already very aggressive in terms of optimizations for resource delivery.
[2] Restricting account creation already eliminated spam on our wiki.

Wikimedia Mobile Apps

Amir and I have been closely working with the Wikimedia Mobile Apps team to ensure that their apps are well supported. In just a couple weeks, the new app was translated in dozens languages and released, with over 7 millions new installations by non-English users (74 % of the total).

In more detail, we finally addressed a longstanding issue in the Android app which prevented translation of strings containing links. I gave Yuvi access to synchronize translations, ensuring that translators have as much time as possible to translate and the apps have the latest updates before being released. We also discussed about how to notify translators before releases to get more translations in time, and about improvements to their i18n frameworks to bring their flexibility more in line with MediaWiki (including plural support).

To put it bluntly, for some reason the mobile i18n frameworks are ugly and hard to work with. Just as an example, Android did not support many languages at all just for one character too much; support is still partial. I can’t avoid comparing this to the extra effort which has been needed to support old versions of Internet Explorer: we would rather be doing other cool things, but the environment is not going to change anytime soon.

Search

I installed and enabled CirrusSearch on translatewiki.net: for the first time, we have a real search engine for all our pages! I had multiple issues, including running a bit tight on memory while indexing all content.

Translate’s translation memory support for ElasticSearch has been almost ready for a while now. It may take a couple months before we’re ready to migrate from Solr (first on translatewiki.net, then Wikimedia sites). I am looking forward to it: as a system administrator, I do not want to run both Solr and ElasticSearch.

I want to say big thanks to Nik for helping both with the translation memory ElasticSearch backend and my CirrusSearch problems.

Wikimedia Sweden launches a new project

I am expecting to see an increased activity and new features at translatewiki.net thanks to a new project by Wikimedia Sweden together with InternetFonden.Se. The project has been announced on the Wikimedia blog, but in short they want to bring more Swedish translators, new projects for translation and possibly open badges to increase translator engagement. They are already looking for feedback, please do share your thoughts.

Summary of Translate workshop at Zürich hackathon

The hall always provided power and wifi for eager hackers (photo CC-BY-SA by Ludovic Péron)

I held a Translate workshop at the Zürich hackathon. Naturally, others and I worked on Translate and translatewiki.net outside of the workshop as well. Here is a summary of the outcomes.

The workshop itself consisted of three topics of interest. I gave an introduction about the Content translation project, going over the basic design and features, followed by a Q&A. We then split into three small groups. One group continued talking about translating content in wider scope. The second group went over how to add new projects to translatewiki.net, using Huggle and Sharelatex as a concrete example. The third group consisted of me helping with programming questions about the Translate extension.

During the whole hackathon people worked on about 20 bugs and patches. I started a patch for glossary support in the Translate extension: a proof of concept, as simple as possible.

Seminar in Finland about big data in linguistics

Recently I attended a seminar on big data. The discussion included what big data is in linguistics, whether it has arrived yet and whether it is even needed in all places.

It was nice to meet people like Kimmo Koskennimi (he was advisor for Master’s thesis), Antti Kantter and others with whom I worked on the Bank of Finnish terminology in arts and sciences project. I’ve collected the points I found most interesting as a summary below.

Emeritus professor Kimmo Koskenniemi started the seminar by giving examples of really big data, one of them being the Google n-gram viewer. He also raised the issue that the copyright law in Finland does not allow us to do similar things, which can be a problem for local research. He suggested (not seriously, but still) that perhaps we should move linguistics research to some other country.

Next was presented a corpus project, Arkisyn, which is a collection of annotated everyday conversation. It was funded by Kone foundation Language programme. Topic of interest was how to achieve uniform tagging when multiple people are working on the collection. Annotation guidelines were produced and the resulting work was cross-checked. Participants were encouraged to document clearly their practices: having the data is no longer enough to be able to reproduce research findings. In fact, it is also necessary for researchers to be able to understand the data and to justify their conclusions.

Toni Suutari from Kotus went more in depth in the political landscape of open data politics in Finland. He then gave some sneak peeks on the (huge) efforts at Kotus to open some of their data collections (I wonder if volunteers could help with some). Licensing is a difficult topic: for example the word list of contemporary Finnish is licensed under three different licenses as they went through the experience of finding the most suitable one. Also geographical information (paikkatieto) is a big thing now, and they are working on opening geodata on Finnish dialects and other geodata collections. I’m sure the OpenStreetMap project and many others are eagerly waiting already.

Timo Honkela, new professor of Digital information, gave an engineering perspective of big data in linguistics and what it makes possible.

Jarmo Jantunen said that human intuition is still needed, so machines are not replacing linguistics. When deciding what to study, big data can give ideas. One might end studying a small part of the big data, there is too much data to go over manually. He also went over classification scheme that helps understanding the role of data in the research. Briefly: one can gather supporting examples from the data; one can base the research on analyzing the data; or one can let the data actually drive the research.

Kristiina Jokinen (my doctoral advisor) gave a practical view of issues in multi-modal data collection: privacy issues preventing open data, encoding formats, synchronization, lightning and audio quality. Topics of interest were how to understand the interaction between so many variables (eye glaze, head position, gestures, what is being said, many people interacting) and whether what the machine sees is what a real person would notice. Deb Roy’s research (recording his son while he learned to speak) was also mentioned.

In the end there was a panel discussion. It ranged over a wide variety of topics, but one that struck me was that while some data is becoming available by itself, some kind of data is not coming unless researchers create it, and it is difficult to find resources to create it.

Numbers on translatewiki.net sign-up process

Translatewiki.net features a good user experience for non-technical translators. A crucial or even critical component is signing up. An unrelated data collection for my PhD studies inspired me to get some data on the translatewiki.net user registration process. I will present the results below.

History

At translatewiki.net the process of becoming an approved translator has been, arguably, complicated in some periods.

In the early days of the wiki, permissions were not clearly separated: hundreds users were just given the full set of permissions to edit the MediaWiki namespace and translate that way.

Later, we required people to go through hoops of various kind after registering to be approved as translators. They had to create a user page with certain elements and post a request on a separate page and they would not get notifications when they were approved unless they tweaked their preferences.

At some point, we started using the LiquidThreads extension: now the users could get notifications when approved, at least in theory. That brought its own set of issues though: many people thought that the LiquidThreads search box on the requests page was the place where to write the title of their request. After entering a title, they ended up in a search results page, which was a dead end. This usability issue was so annoying and common that I completely removed the search field from LiquidThreads.
In early 2010 we implemented a special page wizard (FirstSteps) to guide users though the process. For years, this has allowed new users to get approved, and start translating, in few clicks and a handful hours after registering.

In late 2013 we enabled the new main page containing a sign-up form. Using that form, translators can create an account in a sandbox environment. Accounts created this way are normal user accounts except that they can only make example translations to get a feel of the system. Example translations give site administrators some hints on whether to approve or reject the request and approve the user as a translator.

Data collection

The data we have is not ideal.

For example, it is impossible to say what’s our conversion rate from users visiting the main page to actual translators.
A lot of noise is added by spam bots which create user accounts, even though we have a CAPTCHA.
When we go far back in the history, the data gets unreliable or completely missing.
- We only have dates for account created after 2006 or so.
- The log entry format for user permissions has changed multiple times, so the promotion times are missing or even incorrect for many entries until a few years back.

The data collection was made with two scripts I wrote for this purpose. The first script produces a tab separated file (tsv) containing all accounts which have been created. Each line has the following fields:

username,
time of account creation,
number of edits,
whether the user was approved as translator,
time of approval and
whether they used the regular sign-up process or the sandbox.

Some of the fields may be empty because the script was unable to find the data. User accounts for which we do not have account creation time are not listed. I chose not to try some methods which can be used to approximate the account creation time, because the data in that much past is too unreliable to be useful.

The first script takes a couple of minutes to run at translatewiki.net, so I split further processing to a separate script to avoid doing the slow data fetching many times. The second script calculates a few additional values like average and median time for approval and aggregates the data per month.

The data also includes translators who signed up through the sandbox, but got rejected: this information is important for approval rate calculation. For them, we do not know the exact registration date, but we use the time they were rejected instead. This has a small impact on monthly numbers, if a translator registers in one month and gets rejected in a later month. If the script is run again later, numbers for previous months might be somewhat different. For approval times there is no such issue.

Results

Image 1: Account creations and approved translators at translatewiki.net

Image 1 displays all account creations at translatewiki.net as described above, simply grouped by their month of account creation.

We can see that approval rate has gone down over time. I assume this is caused by spam bot accounts. We did not exclude them hence we cannot tell whether the approval rate has gone up or down for human users.

We can also see that the number of approved translators who later turn out to be prolific translators has stayed pretty much constant each month. A prolific translator is an approved translator who has made at least 100 edits. The edits can be from any point of time, the script is just looking at current edit count so the graph above doesn’t say anything about wiki activity at any point in time.

There is an inherent bias towards old users for two reasons. First, at the beginning translators were basically invited to a new tool from existing methods they used, so they were likely to continue to translate with the new tool. Second, new users have had less time to reach 100 edits. On the other hand, we can see that a dozen translators even in the past few months have already made over 100 edits.

I have collected some important events below, which I will then compare against the chart.

2009: Translation rallies in August and December.
2010-02: The special page to assist in filing translator requests was enabled.
2010-04: We created a new (now old) main page.
2010-10: Translation rally.
2011: Translation rallies in April, September and December.
2012: Translation rallies in August and December.
2013-12: The sandbox sign-up process was enabled.

There is an increase in account creations and approved translators a few months after the assisting special page was enabled. The explanation of this is likely to be the new main page which had a big green button to access the special page. The September translation rally in 2011 seems to be very successful in requiting new translators, but also the other rallies are visible in the chart.

Image 2: How long it takes for account creation to be approved.

The second image shows how long it takes from the account creation for a site administrator to approve the request. Before sandbox, users had to submit a request to become translators on their own: the time for them to do so is out of control of the site administrators. With sandbox, that is much less the case, as users get either approved or rejected in a couple of days. Let me give an overview of how the sandbox works.

All users in the sandbox are listed on a special page together with the sandbox translations they have made. The administrators can then approve or reject the users. Administrators usually wait until the user has made a handful translations. Administrators can also send email reminders for the users to make more translations. If translators do not provide translations within some time, or the translations are very bad, they will get rejected. Otherwise they will be approved and can immediately start using the full translation interface.

We can see that the median approval time is just a couple of hours! The average time varies wildly though. I am not completely sure why, but I have two guesses.
First, some very old user accounts have reactivated after being dormant for months or years and have finally requested translator rights. Even one of these can skew the average significantly. On a quick inspection of the data, this seems plausible.
Second, originally we made all translators site administrators. At some point, we introduced the translator user group, and existing translators have gradually been getting this new permission as they returned to the site. The script only counts the time when they were added to the translator group.
Alternatively, the script may have a bug and return wrong times. However, that should not be the case for recent years because the log format has been stable for a while. In any case, the averages are so big as to be useless before the year 2012, so I completely left them out of the graph.

The sandbox has been in use only for a few months. For January and February 2014, the approval rate has been slightly over 50%. If a significant portion of rejected users are not spam bots, there might be a reason for concern.

Suggested action points

Store the original account creation date and “sandbox edit count” for rejected users.
Investigate the high rejection rate. We can ask the site administrator why about a half of the new users are rejected. Perhaps we can also have “mark as spam” action to get insight whether we get a lot of spam. Event logging could also be used, to get more insight on the points of the process where users get stuck.

Source material

Scripts are in Gerrit. Version ‘2’ of the scripts was used for this blog post. Processed data is in a LibreOffice spreadsheet. Original and updated data is available on request, please email me.

IWSDS 2014 reflections

Click for animation how it went through 20 versions.

I attended the International workshop series on spoken dialog systems aka IWSDS as part of my studies. It was my first scientific event where one had to submit papers: a great experience. More details below.

Poster. The paper we had submitted to the workshop was accepted as poster presentation. Having never done a poster before, especially of A0 size, I was a little bit afraid that I would encounter many technical and design issues. But no time for procrastination: little more than a week before the conference, I just started doing it. Luckily my University provided poster templates, so I didn’t need to worry about general layout. Between Adobe InDesign and PowerPoint templates, I chose the latter because I felt that I didn’t want to learn new software but just want to get things done. It took a few hours to come up with the initial draft. During the following days I went through 19 versions with my advisers until we declared it as ready. Working on weekends and late nights is nothing new to me, but I was surprised my advisers did the same. Whether this tells about dedication, short deadlines or the general issue of work spreading over to free time, I do not know.

Travel. I sent the poster to print, and got it the day before I flew in to San Francisco. My first leg was late, so I had to run in New York to catch my connection. I was advised to include the poster in check-in luggage, so of course it did not make it. That wasn’t an issue though, since I had decided to fly in one day early to recover from jet lag. The next day I went back to the airport to pick the poster and take my shuttle to the venue, leaving from SFO.

Location. The workshop was held in an inn in near Napa. I was told jokingly that the isolated place was chosen so that we can’t go out but only stay together to discuss about things. There was probably some truth in that. As a counterbalance for the packed conference program, the place and good food made it not overwhelming.

Did you notice the four hot air balloons?

Presentations. The level of presentations varied somewhat. On one end of the scale there was a presentation which focused on math which I could not grasp. On the other end both Dan Bohus and Louis-Philippe Morency gave captivating keynotes which sparked my interest and were easy to follow even for a complete newbie to the topic as I was.

What struck me the most were Dan’s words about situated interaction. He gave an example where they had a robot in the office, and they were creating algorithms to guess whether the person walking past the robot is going to engage with the robot. They were able to use machine learning to guess many seconds in advance whether the user will engage. But when they moved the robot a bit so that users approached from a different direction, the previous model did not work anymore, and they had to retrain it with new data. The point of this and other given examples was to highlight that, in the context of human interaction, we must devise machine-learning algorithms capable of adapting to new contexts. I immediately draw a connection to the presentation by Filip Ginter (p. 28) at Kites Symposium last year, where data mining was used for distributional semantics, i.e. machine learning was used to learn that words kaunis, ihastuttava and hurmaava are in some way related. There is something very appealing in making machines learn language and interaction without explicitly giving them the rules. In the case of human interaction it is much more difficult because there is less data available and so many variables to take into account.

Presence. I can’t help but wonder why, apart from Microsoft and some car companies, no big companies were present. I’m sure this kind of research is also done in other companies. I asked a few people this question:

The other companies, is what they are doing (in the context of workshop topics) not novel, or are they just not telling about their research and not contributing back; and if so, do we care?

To summarize the answers and my conclusion: Apple Siri, for example, is not so novel, it is just a product done very well using a simple technology; on the other hand big companies have a vast amount of data not available for research. Essentially Google and other big companies have a monopoly on certain areas like machine translation and speech recognition. We do care about this.

Lab tour. After the workshop there was a lab tour in the Silicon Valley. We visited Honda Research Institute, Computer History Museum and Microsoft Research. Microsoft scores again for getting my attention by presenting a new version of Kinect. It was also nice to see a functioning adding machine at the Computer History Museum. And I can’t avoid mentioning having a nice spicy pasta lunch in the warm sunshine with a good company (in increasing order of significance) in Mountain View knowing that Google offices were very close.

First day at work

Officially I started January 1st, but apart from getting an account today was the first real thing at the university. Still feels great – the “oh my what did I sign up to” feeling has still time to come. ;)

After having the WMF daily standup, I have a usual breakfast and head to city center, where our research group of four had a meeting. To my surprise, the eduroam network worked immediately. I had configured it at home earlier based on a guide on the site of some university of Switzerland, if I remember correctly: my university didn’t provide good help for how to set it up with Fedora and KDE.

Institute of Behavioural Sciences, University of Helsinki

The building on the left is part of Institute of Behavioural Sciences. It is just next to the building (not visible) where I started my university studies in 2005. (Photo CC BY-NC-ND by Irmeli Aro.)

On my side, preparations for the IWSDS conference are now the highest priority. I have until Monday to prepare my first ever poster presentation. I found PowerPoint and InDesign templates from the university’s website (ugh proprietary tools). Then there are few days to get it printed before I fly on Thursday. After the travel I will make a website for the project to allow it to get some visibility and find out about the next steps as well as how to proceed with studies.

After this topic, I got to hear about other part of the research, collection of data in Sami languages. I connected them with Wikimedia Suomi who has expressed interest to work with Sami people.

After the meeting, we went hunting for so-called WBS codes which are needed in various places to target the expenses, for example for poster printing and travel plans. (In case someone knows where the abbreviation WBS comes from, there are at least two people in the world who are interested to know.) The people I met there were all very friendly and helpful.

On my way home I met an old friend from Päivölä&university (Mui Jouni!) in the metro. There was also a surprise ticket inspection – 25% inspection rate for my trips this year based on 4 observations. I guess I need more observations before this is statistically significant ;)

One task left for me when I got home was to do the mandatory travel plan. This needs to be done through university’s travel management software, which is not directly accessible. After trying without success to access it first through their web based VPN proxy, second with openvpn via NetworkManager via “some random KDE GUI for that” on my laptop and, third, even with a proprietary VPN application on my Android phone I gave up for today – it’s likely that the VPN connection itself is not the problem and the issue is somewhere else.

It’s still not known from where I will get a room (I’m employed in a different department from where I’m doing my PhD). Though I will likely work from home often as I am used to.

MediaWiki i18n explained: {{PLURAL}}

This post explains how MediaWiki handles plural rules to developers who need to work with it. In other words, how a string like “This wiki {{PLURAL:$1|0=does not have any pages|has one page|has $1 pages}}” becomes “This wiki has 425 pages”.

Rules

As mentioned before we have adopted a data-based approach. Our plural rules come from Unicode CLDR (Common Locale Data repository) in XML format and are stored in languages/data/plurals.xml. These rules are supplemented by local overrides in languages/data/plurals-mediawiki.xml for languages not supported by CLDR or where we are yet to unify our existing local rules to match CLDR rules.

As a short recap, translators handle plurals by writing all possible different forms explicitly. That means that there are different forms for singular, dual, plural, etc., depending on what grammatical numbers the language has. There might be more forms because of other grammatical reasons, for example in Russian the grammatical case of the noun varies depending on the number. The rules from CLDR put all numbers into different boxes, each box corresponding to one form provided by the translator.

Preprocessing

The plural rules are stored in localisation cache (not to be confused with message cache and many other caches in MediaWiki) with other language specific localisation data. The localisation cache can be stored in different places depending on configuration. The default is to use the SQL database, but they can also be in CDB files as they are at the Wikimedia Foundation and translatewiki.net.

The whole process starts 1) when the user runs php maintenance/rebuildLocalisationCache.php, or 2) during a web request, if the cache is stale and automatic cache rebuild is allowed (as by default).

The code proceeds as follows:

LocalisationCache::readSourceFilesAndRegisterDeps

LocalisationCache::getPluralRules fills pluralRules
- LocalisationCache::loadPluralFiles loads both xml files, merges them and stores the result in in-process cache
LocalisationCache::getComplisedPluralRules fills compiledPluralRules
- LocalisationCache::loadPluralFiles returns rules from in-process cache
- CLDRPluralRuleEvaluator::compile compiles the standard notation into RPN notation
LocalisationCache::getPluralTypes fills pluralRuleTypes

So now for the given language we have three lists (see table 1). The pluralRules are used in frontend (JavaScript) and the compiledPluralRules are used in the backend (PHP) with a custom evaluator. Tim Starling wrote the custom evaluator for performance reasons. The pluralRuleTypes stores the map between numerical indexes and CLDR keywords, which are not used in MediaWiki plural syntax. Please note that Russian has four plural forms: the fourth form, called other, is used when none of the other rules match and is not stored anywhere.

Table 1: Stored plural data for Russian
pluralRuleTypes	pluralRules	compiledPluralRules
“one”	“n mod 10 is 1 and n mod 100 is not 11”	“n 10 mod 1 is n 100 mod 11 is-not and”
“few”	“n mod 10 in 2..4 and n mod 100 not in 12..14”	“n 10 mod 2 4 .. in n 100 mod 12 14 .. not-in and”
“many”	“n mod 10 is 0 or n mod 10 in 5..9 or n mod 100 in 11..14”	“n 10 mod 0 is n 10 mod 5 9 .. in or n 100 mod 11 14 .. in or”

The cache also stores the magic word PLURAL, defined in languages/messages/MessageEn.php and translated to other languages, so in Finnish language wikis they can use {{MONIKKO:$1|$1 talo|$1 taloa}} if they so want. For compatibility reasons, in all interface translations these magic words are used in English.

Invocation on backend

There are roughly three ways to trigger plural parsing:

using the plural syntax in a wiki page,
calling the plural parser with Message object with output format text,
using the plural syntax in a message with output format parse, which calls full wikitext parser as in 1.

In all cases, we will get into Parser::replaceVariables, which expands all magic words and templates (anything enclosed in double braces; sometimes also called {{ constructs). It will load the possible translated magic words and see if the {{thing}} in the wikitext or message matches a known magic word. If not, the {{thing}} is considered a template call. If the plural magic word matches, the parser will call CoreParserFunctions::plural which will take the arguments, make them into an array, call the correct language object with Language::convertPlural( number, forms ): see table 2 for function call trace.

In the Language class we first handle explicit plural forms explained in a previous post on explicit zero and one form. If any explicit plural form doesn’t match, they are removed and we will continue on with the other forms, calling Language::getPluralRuleIndexNumber( number ), which first loads the compiled plural rules into the in-process cache, then calls CLDRPluralRuleEvaluator::evaluateCompiled which returns the box the number belongs to. Finally we take the matching form given by the translator, or the last form provided. Then the return value is substituted in place of the plural magic word call.

Table 2: Function call list for plural magic word
Message::parse	Message::text
Message::toString Message::parseText MessageCache::parse Parser::parse Parser::internalParse	Message::toString Message::transformText MessageCache::transform Parser::transformMsg Parser::preprocess
[The above lists converge here] Parser::replaceVariables PPFrame_DOM::expand Parser::braceSubstitution Parser::callParserFunction call_user_func_array CoreParserFunctions::plural Language::convertPlural [Plural rule evaluation]

Invocation on frontend

The resource loader module mediawiki.language.data (implemented in class ResourceLoaderLanguageDataModule) is responsible for loading the plural rules from localisation cache and delivering them together with other language data to JavaScript.

The resource loader module mediawiki.jqueryMsg provides yet another limited wikitext parser which can handle plural, links and few other things. The module mediawiki (global mediaWiki, usually aliased to mw) provides the messaging interface with functions like mw.msg() or mw.message().text(). Those will not handle plural without the aforementioned mediawiki.jqueryMsg module. Translated magic words are not supported at the frontend.

If a plural magic word is found, then it will call the frontend convertPlural method. These are provided in few hops by the module mediawiki.language which depends on mediawiki.language.data and mediawiki.cldr. The latter depends on mediawiki.libs.pluralruleparser, which evaluates the (non-compiled) CLDR plural rules to reach the same result as in the PHP side and is hosted at GitHub, written by Santhosh Thottingal of the Wikimedia Language Engineering team.

It rains like a saavi

About me, me and me