Category Archives: vapaasuomi

Goes to vapaasuomi planet

Translatewiki.net summer update

It’s been a busy while since last update, but how could I have not worked on translatewiki.net? ;) Here is an update on my current activities.
In this episode:

  • we provide translations for over 70 % of users of the new Wikipedia app,
  • I read a book on networking performance and get needy for speed,
  • ElasticSearch tries to eat all of us and our memory,
  • HHVM finds the place not fancy enough,
  • Finns and Swedes start cooperating.

Performance

Naturally, I have been thinking of ways to further improve translatewiki.net performance. I have been running HHVM as a beta feature at translatewiki.net many months now, but I have kept turning it on and off due to stability issues. It is currently disabled, but my plan is to try the Wikimedia packaged version of HHVM. Those packages only work in Ubuntu 2014.04, so Siebrand and I first have to upgrade the translatewiki.net server from Ubuntu 2012.04, as we plan to later this month (July). (Update: done as of 2014-07-09, 14 UTC.)

Map of some translatewiki.net translators

A global network of translators is not served well enough from a single location

After reading a book about networking performance I finally decided to give a content distribution network (CDN) a try. Not because they can optimize and cache things on the fly [1], nor because the can do spam protection [2], but because CDN can reduce latency, which is usually the main bottleneck of web browsing. We only have single server in Germany, but our users are international. I am close to the server, so I have much better experience than many of our users. I do not have any numbers yet, but I will do some experiments and gather some numbers to see whether CDN helps us.

[1] MediaWiki is already very aggressive in terms of optimizations for resource delivery.
[2] Restricting account creation already eliminated spam on our wiki.

Wikimedia Mobile Apps

Amir and I have been closely working with the Wikimedia Mobile Apps team to ensure that their apps are well supported. In just a couple weeks, the new app was translated in dozens languages and released, with over 7 millions new installations by non-English users (74 % of the total).

In more detail, we finally addressed a longstanding issue in the Android app which prevented translation of strings containing links. I gave Yuvi access to synchronize translations, ensuring that translators have as much time as possible to translate and the apps have the latest updates before being released. We also discussed about how to notify translators before releases to get more translations in time, and about improvements to their i18n frameworks to bring their flexibility more in line with MediaWiki (including plural support).

To put it bluntly, for some reason the mobile i18n frameworks are ugly and hard to work with. Just as an example, Android did not support many languages at all just for one character too much; support is still partial. I can’t avoid comparing this to the extra effort which has been needed to support old versions of Internet Explorer: we would rather be doing other cool things, but the environment is not going to change anytime soon.

Search

I installed and enabled CirrusSearch on translatewiki.net: for the first time, we have a real search engine for all our pages! I had multiple issues, including running a bit tight on memory while indexing all content.

Translate’s translation memory support for ElasticSearch has been almost ready for a while now. It may take a couple months before we’re ready to migrate from Solr (first on translatewiki.net, then Wikimedia sites). I am looking forward to it: as a system administrator, I do not want to run both Solr and ElasticSearch.

I want to say big thanks to Nik for helping both with the translation memory ElasticSearch backend and my CirrusSearch problems.

Wikimedia Sweden launches a new project

I am expecting to see an increased activity and new features at translatewiki.net thanks to a new project by Wikimedia Sweden together with InternetFonden.Se. The project has been announced on the Wikimedia blog, but in short they want to bring more Swedish translators, new projects for translation and possibly open badges to increase translator engagement. They are already looking for feedback, please do share your thoughts.

Numbers on translatewiki.net sign-up process

Translatewiki.net features a good user experience for non-technical translators. A crucial or even critical component is signing up. An unrelated data collection for my PhD studies inspired me to get some data on the translatewiki.net user registration process. I will present the results below.

History

At translatewiki.net the process of becoming an approved translator has been, arguably, complicated in some periods.

In the early days of the wiki, permissions were not clearly separated: hundreds users were just given the full set of permissions to edit the MediaWiki namespace and translate that way.

Later, we required people to go through hoops of various kind after registering to be approved as translators. They had to create a user page with certain elements and post a request on a separate page and they would not get notifications when they were approved unless they tweaked their preferences.

At some point, we started using the LiquidThreads extension: now the users could get notifications when approved, at least in theory. That brought its own set of issues though: many people thought that the LiquidThreads search box on the requests page was the place where to write the title of their request. After entering a title, they ended up in a search results page, which was a dead end. This usability issue was so annoying and common that I completely removed the search field from LiquidThreads.
In early 2010 we implemented a special page wizard (FirstSteps) to guide users though the process. For years, this has allowed new users to get approved, and start translating, in few clicks and a handful hours after registering.

In late 2013 we enabled the new main page containing a sign-up form. Using that form, translators can create an account in a sandbox environment. Accounts created this way are normal user accounts except that they can only make example translations to get a feel of the system. Example translations give site administrators some hints on whether to approve or reject the request and approve the user as a translator.

Data collection

The data we have is not ideal.

  • For example, it is impossible to say what’s our conversion rate from users visiting the main page to actual translators.
  • A lot of noise is added by spam bots which create user accounts, even though we have a CAPTCHA.
  • When we go far back in the history, the data gets unreliable or completely missing.
    • We only have dates for account created after 2006 or so.
    • The log entry format for user permissions has changed multiple times, so the promotion times are missing or even incorrect for many entries until a few years back.

The data collection was made with two scripts I wrote for this purpose. The first script produces a tab separated file (tsv) containing all accounts which have been created. Each line has the following fields:

  1. username,
  2. time of account creation,
  3. number of edits,
  4. whether the user was approved as translator,
  5. time of approval and
  6. whether they used the regular sign-up process or the sandbox.

Some of the fields may be empty because the script was unable to find the data. User accounts for which we do not have account creation time are not listed. I chose not to try some methods which can be used to approximate the account creation time, because the data in that much past is too unreliable to be useful.

The first script takes a couple of minutes to run at translatewiki.net, so I split further processing to a separate script to avoid doing the slow data fetching many times. The second script calculates a few additional values like average and median time for approval and aggregates the data per month.

The data also includes translators who signed up through the sandbox, but got rejected: this information is important for approval rate calculation. For them, we do not know the exact registration date, but we use the time they were rejected instead. This has a small impact on monthly numbers, if a translator registers in one month and gets rejected in a later month. If the script is run again later, numbers for previous months might be somewhat different. For approval times there is no such issue.

Results

Account creations and approved translators at translatewiki.net

Image 1: Account creations and approved translators at translatewiki.net

Image 1 displays all account creations at translatewiki.net as described above, simply grouped by their month of account creation.

We can see that approval rate has gone down over time. I assume this is caused by spam bot accounts. We did not exclude them hence we cannot tell whether the approval rate has gone up or down for human users.

We can also see that the number of approved translators who later turn out to be prolific translators has stayed pretty much constant each month. A prolific translator is an approved translator who has made at least 100 edits. The edits can be from any point of time, the script is just looking at current edit count so the graph above doesn’t say anything about wiki activity at any point in time.

There is an inherent bias towards old users for two reasons. First, at the beginning translators were basically invited to a new tool from existing methods they used, so they were likely to continue to translate with the new tool. Second, new users have had less time to reach 100 edits. On the other hand, we can see that a dozen translators even in the past few months have already made over 100 edits.

I have collected some important events below, which I will then compare against the chart.

  • 2009: Translation rallies in August and December.
  • 2010-02: The special page to assist in filing translator requests was enabled.
  • 2010-04: We created a new (now old) main page.
  • 2010-10: Translation rally.
  • 2011: Translation rallies in April, September and December.
  • 2012: Translation rallies in August and December.
  • 2013-12: The sandbox sign-up process was enabled.

There is an increase in account creations and approved translators a few months after the assisting special page was enabled. The explanation of this is likely to be the new main page which had a big green button to access the special page. The September translation rally in 2011 seems to be very successful in requiting new translators, but also the other rallies are visible in the chart.

Image 2: How long it takes for account creation to be approved.

Image 2: How long it takes for account creation to be approved.

The second image shows how long it takes from the account creation for a site administrator to approve the request. Before sandbox, users had to submit a request to become translators on their own: the time for them to do so is out of control of the site administrators. With sandbox, that is much less the case, as users get either approved or rejected in a couple of days. Let me give an overview of how the sandbox works.

All users in the sandbox are listed on a special page together with the sandbox translations they have made. The administrators can then approve or reject the users. Administrators usually wait until the user has made a handful translations. Administrators can also send email reminders for the users to make more translations. If translators do not provide translations within some time, or the translations are very bad, they will get rejected. Otherwise they will be approved and can immediately start using the full translation interface.

We can see that the median approval time is just a couple of hours! The average time varies wildly though. I am not completely sure why, but I have two guesses.
First, some very old user accounts have reactivated after being dormant for months or years and have finally requested translator rights. Even one of these can skew the average significantly. On a quick inspection of the data, this seems plausible.
Second, originally we made all translators site administrators. At some point, we introduced the translator user group, and existing translators have gradually been getting this new permission as they returned to the site. The script only counts the time when they were added to the translator group.
Alternatively, the script may have a bug and return wrong times. However, that should not be the case for recent years because the log format has been stable for a while. In any case, the averages are so big as to be useless before the year 2012, so I completely left them out of the graph.

The sandbox has been in use only for a few months. For January and February 2014, the approval rate has been slightly over 50%. If a significant portion of rejected users are not spam bots, there might be a reason for concern.

Suggested action points

  1. Store the original account creation date and “sandbox edit count” for rejected users.
  2. Investigate the high rejection rate. We can ask the site administrator why about a half of the new users are rejected. Perhaps we can also have “mark as spam” action to get insight whether we get a lot of spam. Event logging could also be used, to get more insight on the points of the process where users get stuck.

Source material

Scripts are in Gerrit. Version ‘2’ of the scripts was used for this blog post. Processed data is in a LibreOffice spreadsheet. Original and updated data is available on request, please email me.

First day at work

Officially I started January 1st, but apart from getting an account today was the first real thing at the university. Still feels great – the “oh my what did I sign up to” feeling has still time to come. ;)

After having the WMF daily standup, I have a usual breakfast and head to city center, where our research group of four had a meeting. To my surprise, the eduroam network worked immediately. I had configured it at home earlier based on a guide on the site of some university of Switzerland, if I remember correctly: my university didn’t provide good help for how to set it up with Fedora and KDE.

Institute of Behavioural Sciences, University of Helsinki

The building on the left is part of Institute of Behavioural Sciences. It is just next to the building (not visible) where I started my university studies in 2005. (Photo CC BY-NC-ND by Irmeli Aro.)

On my side, preparations for the IWSDS conference are now the highest priority. I have until Monday to prepare my first ever poster presentation. I found PowerPoint and InDesign templates from the university’s website (ugh proprietary tools). Then there are few days to get it printed before I fly on Thursday. After the travel I will make a website for the project to allow it to get some visibility and find out about the next steps as well as how to proceed with studies.

After this topic, I got to hear about other part of the research, collection of data in Sami languages. I connected them with Wikimedia Suomi who has expressed interest to work with Sami people.

After the meeting, we went hunting for so-called WBS codes which are needed in various places to target the expenses, for example for poster printing and travel plans. (In case someone knows where the abbreviation WBS comes from, there are at least two people in the world who are interested to know.) The people I met there were all very friendly and helpful.

On my way home I met an old friend from Päivölä&university (Mui Jouni!) in the metro. There was also a surprise ticket inspection – 25% inspection rate for my trips this year based on 4 observations. I guess I need more observations before this is statistically significant ;)

One task left for me when I got home was to do the mandatory travel plan. This needs to be done through university’s travel management software, which is not directly accessible. After trying without success to access it first through their web based VPN proxy, second with openvpn via NetworkManager via “some random KDE GUI for that” on my laptop and, third, even with a proprietary VPN application on my Android phone I gave up for today – it’s likely that the VPN connection itself is not the problem and the issue is somewhere else.

It’s still not known from where I will get a room (I’m employed in a different department from where I’m doing my PhD). Though I will likely work from home often as I am used to.

MediaWiki i18n explained: {{PLURAL}}

This post explains how MediaWiki handles plural rules to developers who need to work with it. In other words, how a string like “This wiki {{PLURAL:$1|0=does not have any pages|has one page|has $1 pages}}” becomes “This wiki has 425 pages”.

Rules

As mentioned before we have adopted a data-based approach. Our plural rules come from Unicode CLDR (Common Locale Data repository) in XML format and are stored in languages/data/plurals.xml. These rules are supplemented by local overrides in languages/data/plurals-mediawiki.xml for languages not supported by CLDR or where we are yet to unify our existing local rules to match CLDR rules.

As a short recap, translators handle plurals by writing all possible different forms explicitly. That means that there are different forms for singular, dual, plural, etc., depending on what grammatical numbers the language has. There might be more forms because of other grammatical reasons, for example in Russian the grammatical case of the noun varies depending on the number. The rules from CLDR put all numbers into different boxes, each box corresponding to one form provided by the translator.

Preprocessing

The plural rules are stored in localisation cache (not to be confused with message cache and many other caches in MediaWiki) with other language specific localisation data. The localisation cache can be stored in different places depending on configuration. The default is to use the SQL database, but they can also be in CDB files as they are at the Wikimedia Foundation and translatewiki.net.

The whole process starts 1) when the user runs php maintenance/rebuildLocalisationCache.php, or 2) during a web request, if the cache is stale and automatic cache rebuild is allowed (as by default).

The code proceeds as follows:

LocalisationCache::readSourceFilesAndRegisterDeps

  • LocalisationCache::getPluralRules fills pluralRules
    • LocalisationCache::loadPluralFiles loads both xml files, merges them and stores the result in in-process cache
  • LocalisationCache::getComplisedPluralRules fills compiledPluralRules
    • LocalisationCache::loadPluralFiles returns rules from in-process cache
    • CLDRPluralRuleEvaluator::compile compiles the standard notation into RPN notation
  • LocalisationCache::getPluralTypes fills pluralRuleTypes

So now for the given language we have three lists (see table 1). The pluralRules are used in frontend (JavaScript) and the compiledPluralRules are used in the backend (PHP) with a custom evaluator. Tim Starling wrote the custom evaluator for performance reasons. The pluralRuleTypes stores the map between numerical indexes and CLDR keywords, which are not used in MediaWiki plural syntax. Please note that Russian has four plural forms: the fourth form, called other, is used when none of the other rules match and is not stored anywhere.

Table 1: Stored plural data for Russian
pluralRuleTypes pluralRules compiledPluralRules
“one” “n mod 10 is 1 and n mod 100 is not 11” “n 10 mod 1 is n 100 mod 11 is-not and”
“few” “n mod 10 in 2..4 and n mod 100 not in 12..14” “n 10 mod 2 4 .. in n 100 mod 12 14 .. not-in and”
“many” “n mod 10 is 0 or n mod 10 in 5..9 or n mod 100 in 11..14” “n 10 mod 0 is n 10 mod 5 9 .. in or n 100 mod 11 14 .. in or”

The cache also stores the magic word PLURAL, defined in languages/messages/MessageEn.php and translated to other languages, so in Finnish language wikis they can use {{MONIKKO:$1|$1 talo|$1 taloa}} if they so want. For compatibility reasons, in all interface translations these magic words are used in English.

Invocation on backend

There are roughly three ways to trigger plural parsing:

  1. using the plural syntax in a wiki page,
  2. calling the plural parser with Message object with output format text,
  3. using the plural syntax in a message with output format parse, which calls full wikitext parser as in 1.

In all cases, we will get into Parser::replaceVariables, which expands all magic words and templates (anything enclosed in double braces; sometimes also called {{ constructs). It will load the possible translated magic words and see if the {{thing}} in the wikitext or message matches a known magic word. If not, the {{thing}} is considered a template call. If the plural magic word matches, the parser will call CoreParserFunctions::plural which will take the arguments, make them into an array, call the correct language object with Language::convertPlural( number, forms ): see table 2 for function call trace.

In the Language class we first handle explicit plural forms explained in a previous post on explicit zero and one form. If any explicit plural form doesn’t match, they are removed and we will continue on with the other forms, calling Language::getPluralRuleIndexNumber( number ), which first loads the compiled plural rules into the in-process cache, then calls CLDRPluralRuleEvaluator::evaluateCompiled which returns the box the number belongs to. Finally we take the matching form given by the translator, or the last form provided. Then the return value is substituted in place of the plural magic word call.

Table 2: Function call list for plural magic word
Message::parse Message::text
  • Message::toString
  • Message::parseText
  • MessageCache::parse
  • Parser::parse
  • Parser::internalParse
  • Message::toString
  • Message::transformText
  • MessageCache::transform
  • Parser::transformMsg
  • Parser::preprocess
  • [The above lists converge here]
  • Parser::replaceVariables
  • PPFrame_DOM::expand
  • Parser::braceSubstitution
  • Parser::callParserFunction
  • call_user_func_array
  • CoreParserFunctions::plural
  • Language::convertPlural
  • [Plural rule evaluation]

Invocation on frontend

The resource loader module mediawiki.language.data (implemented in class ResourceLoaderLanguageDataModule) is responsible for loading the plural rules from localisation cache and delivering them together with other language data to JavaScript.

The resource loader module mediawiki.jqueryMsg provides yet another limited wikitext parser which can handle plural, links and few other things. The module mediawiki (global mediaWiki, usually aliased to mw) provides the messaging interface with functions like mw.msg() or mw.message().text(). Those will not handle plural without the aforementioned mediawiki.jqueryMsg module. Translated magic words are not supported at the frontend.

If a plural magic word is found, then it will call the frontend convertPlural method. These are provided in few hops by the module mediawiki.language which depends on mediawiki.language.data and mediawiki.cldr. The latter depends on mediawiki.libs.pluralruleparser, which evaluates the (non-compiled) CLDR plural rules to reach the same result as in the PHP side and is hosted at GitHub, written by Santhosh Thottingal of the Wikimedia Language Engineering team.

You can write a paper about that

“You can write a paper” is kind of a running joke in the language engineering team when the discussion sways so far from the original topic that it is no longer helping to get the work done. But sometimes sidelines turn out to be interesting and fruitful. When I was presented an opportunity to do a PhD related to wikis, languages and translation I could not pass it. And because of the joke, I can claim full innocence – they told me to! ;)

The results are in and…. I got accepted! Screams with joy and then quickly shies away hoping nobody noticed.

What does this mean?

Doctoral hat

The doctoral hat is the ultimate goal, right?

If you are a reader of this blog, the topics might get even more incomprehensible. Or the posts might be even more insightful and based on research instead of gut feelings. Hopefully, it doesn’t mean that I won’t have time to write more blog posts.

Practically, I will be starting at the beginning of January with the goal of writing a PhD dissertation and of graduating in about four years. The proposed topic for my dissertation is Supporting creation and interaction of open content with language technology, as part of the project “Finno-Ugric Digital Natives: Linguistic support for Finno-Ugric digital communities in generating online content”. As with my MA, I’ll do this at the University of Helsinki.

Initially I will be working three days a week on that and keep helping the language engineering team as well. We’ll see how it goes.

The first thing I will do is to participate in IWSDS (Workshop on Spoken Dialog Systems) held in January at Napa, California, USA. I will be presenting a paper about multilingual WikiTalk.

Performance is a feature

In case you haven’t already noticed, I like working on performance issues and performance improvements. Performance is a thing where you have to consider the whole stack: the speed of the server, efficient algorithms, server side caching, bandwidth and latency, client side caching and client side code. Here is a short recap of what has been done for translatewiki.net lately and some ideas for the future.

Recent improvements

Flame chart visualization

Chrome 29 (or later release) has added a helpful visualization for profiling data. In this image the speed of ULS JavaScript code is evaluated on a fonts heavy page. Comparing to the collapsible tabs feature, it is doing okay.

Server level. A month ago translatewiki.net got a new server with more memory and faster processors. The main benefit is that we can handle more simultaneous users and background tasks without them slowing each other down. At the same time, we upgraded many of the programs to newer versions. The switch from MySQL to MariaDB is the most important one. We haven’t tested it for our use case, but the Wikimedia Foundation found that the switch had overall positive impact on performance.

Web server level. In the beginning of November I configured our nginx web server to enable support for the SPDY protocol. This should greatly reduce latency when browsing over HTTPS. We are considering to switch to HTTPS by default. While tweaking nginx, I also fixed a few settings that relate on the compression and expiry times of JavaScript, SVG images and font assets when delivered to users. I used AWStats to see if our daily bandwidth usage decreased. It has not decreased significantly, but there is a lot of variation between days that make interpreting the data difficult. PageSpeed was used to ensure that caching headers are optimal and WebPagetest to confirm that pages load faster on different browsers in different places.

Application level. The Language Engineering team has recently worked a lot on the performance of Universal Language Selector (ULS) and Translate extensions. A short summary of the things which were done:

  • Reduce the amount of JavaScript and CSS delivered to the browser.
  • Delay the loading of JavaScript and CSS as much as possible (for example till the user opens ULS).
  • Optimize JPG, SVG and PNG images to the last byte with tools like jpegoptim, optipng.
  • Optimize the JavaScript to avoid slow actions (for example repaint events and dom changes). We used Chrome’s JavaScript profiler as well as the experimental tool “show potential scroll bottlenecks” to identify issues and confirm the fixes (thanks Ori).

In addition I fixed a major performance issue in one of the Translate API modules by replacing an inefficient algorithm with a faster one. While investigating that issue, I also noticed that ReplacementArray-strtr was taking 20% or so of MediaWiki run time. There is a less known PHP module FastStringSearch, which was not installed on the new server. Installing that module made a big difference on the MediaWiki profiling table: ReplacementArray-fss is now taking only about 0.20% of MediaWiki run time.

Finally, a thing called module local storage was enabled on Wikimedia wikis few days ago (the title of this post was taken from that discussion). As is usual for translatewiki.net, we were already beta testing that feature a few weeks before it went live on Wikimedia wikis.

Future plans

It is hard to plan the future for further performance improvements, as the bottlenecks and the places where you can make the most difference for the least effort change constantly, together with the technology and your content. I believe that HHVM, a JIT PHP virtual machine, is likely to be the next step which will make a significant difference. It is however not a straightforward thing to jump from a normal PHP intepreter to HHVM, so I will be keeping a close eye on how my colleagues at the Wikimedia Foundation are progressing with the adoption of HHVM.

Another relatively small thing on the horizon is better compression of inline SVG images in CSS style sheets, by avoiding unnecessary base64 encoding. Or something else might happen even before it.

Finally, I’d like to highlight that while the application-level improvements automatically benefit third party users, there really isn’t any coherent documentation on how to improve performance of a MediaWiki site at all levels. Configuring localisation cache, nginx and/or Varnish, tweaking MySQL or MariaDB and installing Memcached or Redis should be part of any capable sysadmin’s skills; but even just tailoring them for MediaWiki, let alone knowing which PHP modules to install, is likely not known by many. For example, I wouldn’t be surprised if there were very few or even no sites using the FastStringSearch module outside of Wikimedia and translatewiki.net.

Pet project: Optimizing message index to the last byte

The message index is a crucial component of Translate, so I made an experiment by implementing a trie store for the message index to optimize it. The short story is that I could not get it fast enough for practical use easily. Continue for full story.

Pet projects

A tree during fall/ruska

A tree in Helsinki (October) showing something tries can’t produce: wonderful fall colours (ruska in Finnish)

For context, in our development team each developer has time for experimentation, outside of the planned development sprint tasks. During that time the developer can try out new technologies, fix issues that are important to them personally or just do something fun and interesting. We call these pet projects and they let us do some cool things.

For example, the insertables I described in my previous blog post are something I did as a pet project. Insertables were actually part of the original translation UX (TUX) design specifications, but they were not implemented because of other priorities. I decided to implement them because users (not managers) were asking for it. I wasn’t convinced initially, but when I saw users translating with tablets I changed my mind. Insertables were a good pet project because they were relatively small and fun a thing to do.

This is all I have to say about pet projects – the non technical readers can skip the rest of this post, where I go into the details of this pet project.

Message index

I probably have introduced the message index in my earlier posts, but let me do it again quickly. I’ll use an example for this. Let’s assume we have a small software called Greeter. It has a localisation file like this:

# l10n/en/greetings.properties
greeting.noon = Good day
greeting.morning = Good morning
greeting.evening = Good evening
greeting.night = Good night

When this kind of file is set up with the Translate extension (for instance in translatewiki.net), each string is stored as a wiki page. Each translation is a separate page, too.

translatewiki.net/wiki/Greeter:greeting.noon/en -> “Good day”
translatewiki.net/wiki/Greeter:greeting.noon/fi -> “Hyvää päivää”

The bolded parts are called page titles in MediaWiki. The message index can be defined simply as a map from the page title of each known message (without the language code) to the message group it belongs to. If we printed it out it would look something like this:

1244:greeting-noon => [greeter]
1244:greeting-morning => [greeter]
1244:greeting-evening => [greeter]
1244:greeting-night => [greeter]

So, every time someone adds a new message for translation, we need to update the message index. Every time someone makes a translation, we need to query the message index. The user is waiting, so both of these actions need to be fast, while using a reasonable amount of memory.

Implementations problems

When we get to the order of 50 000 or even more known messages, creation and accessing of the message index starts to get slow in PHP, even though it’s basically just a lot of strings, and string processing should be fast, right? Not so in PHP, where holding the message index as an array of arrays takes tens of megabytes in memory. An array in php is kind of a mix of hashtable and linked list. It uses more memory for extra features and versatility.. In the case of message index we would gladly like to trade some features for reduced memory usage.

There are many aspects in message index optimization, but so far I haven’t found a solution without downsides. If the whole index was small enough, it could be kept in memory, making things faster; but currently it can only be stored in various kinds of databases, that allow querying the index one title at the time.

Currently at translatewiki.net we are using CDB files, which are immutable databases stored on a file on the file system. This is okay for our use case: the index is accessed from disk; only when the data changes, you have to build the whole thing from scratch and you have to worry about memory usage and speed. The current problem we have with this approach is that it takes a lot of memory to recreate it, and the few second running time is on the borderline of acceptable speed for having user to wait for it. There isn’t too much room for growth.

To reach the current state, I’ve tried using references to store the group names to avoid repeating them and storing the resulting array in a serialized file. I’ve tried storing the whole structure in a database table, which works well to certain amount of messages. This time I’m going to try something else. The idea is to save space by considering that the message keys share a lot of substrings, for instance the messages of a MediaWiki extension having all keys prefixed with the extension’s name. I decided to use tree structures to experiment.

Trees and tries

Disclaimer: I haven’t studied algorithms in depth so I’m just trying to apply what I know.

We can represent all the relationships between message names and their groups as a set of mostly similar strings which may share common prefixes. I could have used a tree, but I decided to use a trie. A trie is a tree where consecutive nodes which only have one child are merged together. Here is an example of how the message index above would look like in a trie (first image), compared to the full tree (second image). As you can see, the trie is more compact compared to the tree because it has less nodes and branches. The trie is also more compact than an array as the common prefixes can be stored only once and we are not using any hashes which are used in arrays. Click for full size.

Trie Tree

To create a message index using tries, I started by googling if there are any algorithms already implemented in PHP for constructing tries. I could not find any, so I just converted into PHP a Python script (which was likely converted from Java). Then I implemented a custom binary format that could be stored in a file and a custom lookup that would use the data loaded from the file into a memory.
I tried many options for optimizing the creation of the trie while minimizing the storage consumption.

One of the curious things was that, when inserting a new string to the trie, it is faster to loop over all the current children of the node comparing the first letter of the child against the first letter of the string we are inserting, rather than to use binary search to find the correct insertion point. The latter would mean keeping the list of children sorted and doing less comparisons by using binary search when doing lookups and insertions. I assume this is because inserting at the end of the array is fast, but inserting in the middle of the array (to keep it sorted) is slow because (my guess) PHP either recreates the array or updating the linked list pointers is slow for some other reason.

For the storage format I tried various kinds of indexes of strings to store the substrings only once, but all the pointers to the strings and child nodes also take a lot of space (4 bytes per pointer, where 4 bytes can also store four characters assuming ascii keys). I’m sure more space savings could be gained by experimenting with alignments so that smaller pointers could be used. Maybe it would be possible borrow some of the algorithms designed to optimize finite state automata – I believe those are much better than what I can do on my own.

Here are some numbers (approximate because I ran out of time to measure properly) on how it compares to the CDB message index solution:

Property CDB Trie
Size on disk 6 MiB 1.5 MiB (0.5 compressed with gzip)
Time to create 1 second 7 seconds

For now I declare this pet project as something that cannot be used. Maybe some day I will get back to it and try make it good enough for real use, but now I already have other interesting pet projects in my mind. If I get suggestions from you how to reach practical solutions, I will of course try them out sooner. I just want to mention that there a many things that could still be explored: QuickHash, constant hash database or finding ways to store group information so that message index is not needed at all.

Review of Gettext po(t) file format

Gettext shows its age both in developer and translator friendliness. What’s wrong with the old known localisation file formats which Google and Mozilla among others are so keen to replace? I don’t have a full answer to that. Gettext is clearly quite inflexible compared to Mozilla’s file format (which is almost a programming language) and it does not support many of the new features in Google’s resource bundles.

My general recommendation is: use the file format best supported by your i18n framework. If you can choose, prefer key based formats. Only try new file formats if you need the new features, because tool support for them is not as good. There is also no clarity which of the new file formats will “win” the fight and become popular.

When making something new, it is good to look back. The motivation why I wrote this post initially was my annoyances writing a tool which supports this format, but the context I’m going to give is completely different. It has been waiting as draft to be published for a long time because it lacked context where it makes sense. Maybe this also helps people, who are wondering what localisation file format they should use.

Enough of the general thoughts. But let’s start this evaluation with the good things:
Can support plural for many languages. The plural syntax is flexible enough to cover at least most if not all of world’s languages.
Fuzzy translations. It has a standard way to mark outdated translations, which is a necessity for this format which does not identify strings.
Tool support. Gettext can be used in many programming languages and there are plenty of tools for translators.

And then the things I don’t like:
Strings have no identifiers. This is my biggest annoyance with Gettext. Strings are identified by their contents, which means that fixing a typo in the source invalidates all translations. It also makes it impossible to keep any track of history. This causes another problem: Identical strings are collapsed by default. This is especially annoying since in English words like Open (action) and Open (state) are the same but in other languages they are different. This effectively prevents proper translations, unless a message context is provided, but here lies another problem: Not all implementations support passing context. Last time I checked this was the case at least in Python.
And one nasty corner case for tool makers is that empty context is different from no context. If you don’t handle this right you will be producing invalid Gettext files.
I listed plural support above as a plus, but it is not without its problems. One string can only have plural forms depending on one variable. This forces the developers to use lego sentences when there is more than one number, or force the translators to make ungrammatical translations. Not to mention that, in Arabic and other languages where there can be five or even more forms, you need to repeat the whole string as many times with small changes. Lots of overhead updating and proofreading that, as opposed to an inline syntax where you only mark the differences. To be fair, with an inline syntax it might hard to see how each plural form looks in full, but there are solutions to that.
There is no standard way to present authorship information except for last translator. The file header is essentially free form text, making it hard to process and update that information programmatically. To be fair, this is the case for almost all i18n file formats I’ve seen.
The comments for individual strings are funky. There are different kinds of comments that start with “#,” “#|”, not documented anywhere as far as I know, and the order of different kinds of comments matters! Do it wrong and you’ll have a file that some tools refuse to use. Not to mention that developers can also leave comments for the translators, in addition to the context parameter (so there are two ways!): the translators might or might not see them depending on the tool they use and on what is propagated from the pot file to the po file. It is quite a hassle to keep these comments in sync and repeated in all the translation files.

I’m curious to hear whether you would like to see more of these evaluations and perhaps a comparison of the formats. If there isn’t much interest I likely won’t do more.

Insertables in Translate make translating easier

Insertables are a new tool to easily copy some text from the source language to your translation with one click.

Have you ever translated anything with the Translate extension? Did it contain markup like this?

[http://very.long.url/here link description]
{{GENDER:$1|he|she}} posted $2 on $3

If so, then you know what this is about. Have you ever translated anything with the Translate extension while using a tablet or another device without a physical keyboard? If so, then you likely know why this interesting.

When you translate text written in wiki markup, or software interface strings, you will encounter the examples above, and many more parts which you need to copy verbatim while translating. These parts contain special characters like braces, dollar signs, brackets, pipes and so on. These characters are cumbersome to type on non-English keyboards, where they have been moved to more difficult to reach key combinations in favour of local characters – if they exist in the layout at all. If they don’t exist in the keyboard layout, you need to switch keyboard layouts just to type few characters and then switch it back.

Does this sound cumbersome? Many translators in fact do not do that, but instead they copy and paste the text from the source text. On tablets however, copy and paste itself is a cumbersome thing. Insertables are a solution to this usability issue.

We can automatically identify a part of the translatable text which has the following properties: it should not be changed and it is difficult to type. We can then present these parts of strings as buttons near the translation. Clicking or pressing that button inserts the text into the translation. These buttons complement the insert source text button and are optional to use, like all translation helpers we provide.

Happy translator using the new feature

Happy translator using the new feature

As of now, we only detect a few types of these insertables: plural, grammar magic words, and variables in MediaWiki style ($1). Read more on Translate documentation for how to contribute more insertables.

On course to machine translation

It has been a busy spring: I have yet to blog about Translate UX and Universal Language Selector projects, which have been my main efforts.
But now something different. In this field you can never stop learning. So I was very pleased when my boss let me participate in a week-long course, where Francis Tyers and Tommi Pirinen taught how to do machine translation with Apertium. Report of the course follows.

From translation memory to machine translation

Before going to the details about the course, I want to share my thoughts about what is the relation between the different translation memory and machine translations techniques we are using to help translators. The three different techniques are:

  • Crude translation memory: for example the TTMServer of Translate
  • Statistical machine translation: for example Google Translate or Microsoft Translator
  • Rule-based machine translation: for example Apertium

In the figure below, I have used two properties to compare them.

  • On x-axis is the amount of information that is extracted from the stored data. Here the stored data is usually a corpus of aligned* translations in two or more languages.
  • On y-axis is the amount of external knowledge used by the system. This data is usually dictionaries, rules how words inflect and rules about grammar–or even how to split text into sentences and words.

* Aligned means that the system knows which parts of the text correspond to each other in the translations. Alignment can be at paragraph level, sentence level or even smaller parts of the text.

Translation memory and machine translation comparison

A very crude implementation just stores an existing translation and can retrieve it if the very same text is translated again.

TTMServer is a little more sophisticated: it splits the translation into paragraph-sized chunks, and it can retrieve the existing translation even if the new text does not match the old text exactly. This system uses only a little information about the data. Even if all the words exist in it, translated as part of different units (strings), the system still cannot provide any kind of translation. Internally, TTMServer uses some external knowledge on how to split up text into words, in order to speed up translation retrieval.

Statistical machine translation at simplest is just a translation memory which extracts more information about the stored translation data. It gathers a huge database about which words usually occur as translation of the words in the source language. Usually it also stores the context so that in the sentence “walking along the river bank” the term “bank” is not interpreted as a building. Most sophisticated systems can also include knowledge about inflection and grammar to filter out invalid interpretations, or even fix grammatically incorrect forms.

On the right hand side of the figure we have rule-based machine translation systems like Apertium. These systems mainly rely on language dependent information supplied by the maker of the system: bilingual dictionaries, inflection and syntax rules are needed for them to function. Unlike the preceding ones, such systems are always language specific. Creating a machine translation needs a linguist for each language in the system.
Still, even these systems can benefit from statistical methods. While they do not store translation data itself, such data can be analysed and used as input to find the correct way to read ambiguous sentences, or the most common translation of a word in the given context among some alternatives.

The ultimate solution for machine translation is most likely a combination of rules and information extracted from a huge translation corpora.

The course

To create a machine translation system with Apertium, you need to choose a source and target language. I built a system to translate from Kven to Finnish. Kven is very close to Finnish, so it was quite easy to do even though I do not know much Kven. Each student was provided skeleton files and a story in the source language, also translated to the target language by a human translator.

We started by adding words in order of frequency to the lexicon. Lexicon defines part of speech and the inflection paradigms of the words. The paradigms are used to analyze the word forms, and also for generation when translating in the opposite direction. Then we added phonological rules. For example Finnish has a vowel harmony. Because of that, many word endings (cases) have two forms, depending on the word – for example koirassa (in the dog), but hiiressä (in the mouse).

As a third step, we created a bilingual dictionary in a form that is suitable for machines (read: XML). At this point we started seeing some words in the target language. Of course we also had to add the lexicon for the target language, if nobody else had done it already.

Finally we started adding rules.
We added rules to disambiguate sentences with multiple readings. For example, in the sentence “The door is open” we added a rule that open is an adjective rather than a verb, because the sentence already has a verb.
We added rules to convert the grammar. For example Finnish cases are usually replaced with prepositions in English. We might also need to add words: “sataa” needs an explicit subject in English, “it rains”.

At the end we compared the translation produced by our system with the translation made by the human translator. We briefly considered two ways to evaluate the quality of the translation.
First, we can use something like edit distance for words (instead of characters) to count how many insertion, deletions or substitutions are needed to change the machine translation to human translation. Otherwise, we can count how many words the human translator needs to change when copy editing the machine translation.
Machine translation systems start to be useful when you need to fix only one word out of six or more words in the translation.

The future

A little while ago Erik asked how the Wikimedia Foundation could support machine translation, which is now mostly in hands of big commercial entities (though the European Union is also building something) and needs an open source alternative.

We do not have lot of translation corpora like Google. We do have lots of text in different languages, but it is not the same content in all languages and it’s not aligned. Exceptions are translatewiki.net and other places where translations are done with the Translate extension. As a side note I think that translatewiki.net contains one of the most multilingual parallel translation corpora under a free license.

Given that we have lots of people in the Wikimedia movement who are multilingual and interested in languages, I think we should cooperate with an existing open source machine translation system (like Apertium) in a way that allows our users to enhance that system. Doing more translations increases the data stored in a translation memory making it more useful. In a similar fashion, doing more translations with machine translation system should make it better.

Apertium has already been in use on the Nynorsk Wikipedia. Bokmål and nynorsk are closely related languages: the kind of situation where Apertium excels.

One thing I have been thinking is that, now that the Wikimedia Language Engineering team is planning to build tools to help translate Wikipedia articles into other languages, we could closely integrate it with Apertium. We could provide an easy way for translators to add missing words and report unintelligible sentences.

I don’t expect most of our translators to actually write and correct rules, so someone should manage that on Apertium side. But at least word collection could be mostly automated; I bet someone has tried and will try to use Wiktionary data too.

As a first step, Wikimedia Foundation could set up their own Apertium instance as a web service for our needs (existing instances are too unstable). The translate extension, for example, can query such a web service to provide translation suggestions.