Tag Archives: translatewiki.net

MediaWiki short urls with nginx and main page without redirect

This post has been updated 2015-09-06 with simplified code suggested by Krinkle and again in 2017-04-04.

Google PageSpeed Insights writes:

Redirects trigger an additional HTTP request-response cycle and delay page rendering. In the best case, each redirect will add a single round trip (HTTP request-response), and in the worst it may result in multiple additional round trips to perform the DNS lookup, TCP handshake, and TLS negotiation in addition to the additional HTTP request-response cycle. As a result, you should minimize use of redirects to improve site performance.

Let’s consider the situation where you run MediaWiki as the main thing on your domain. When user goes to your domain example.com, MediaWiki by default will issue a redirect to example.com/wiki/Main_Page, assuming you have configured the recommended short urls.

In addition the short url page writes:

Note that we do not recommend doing a HTTP redirect to your wiki path or main page directly. As redirecting to the main page directly will hard-code variable parts of your wiki’s page setup into your server config. And redirecting to the wiki path will result in two redirects. Simply rewrite the root path to MediaWiki and it will take care of the 301 redirect to the main page itself.

So are we stuck with a suboptimal solution? Fortunately, there is a way and it is not even that complicated. I will share example snippets from translatewiki.net configuration how to do it.

Configuring nginx

For nginx, the only thing we need in addition the default wiki short url rewrite is to rewrite / so that it is forwarded to MediaWiki. The configuration below assumes MediaWiki is installed in the w directory under the document root.

location ~ ^/wiki/ {
	rewrite ^ /w/index.php;
}

location = / {
	rewrite ^ /w/index.php;
}

Whole file for the curious.

Configuring MediaWiki

First, in our LocalSettings.php we have the short url configuration:

$wgArticlePath      = "/wiki/$1";
$wgScriptPath       = "/w";

In addition we use hooks to tell MediaWiki to make / the URL for the main page, not to be redirected:

$wgHooks['GetLocalURL'][] = function ( &$title, &$url, $query ) {
	if ( $title->isExternal() && $query != '' && $title->isMainPage() ) {
		$url = '/';
	}
};

// Tell MediaWiki that "/" should not be redirected
$wgHooks['TestCanonicalRedirect'][] = function ( $request ) {
	return $request->getRequestURL() !== '/';
};

This has the added benefit that all MediaWiki generated links to the main page point to the domain root, so you only have one canonical url for the wiki main page. The if block in the middle with strpos checks for problematic characters ? and & and forces them to use the long URLs, because otherwise they would not work correctly with this nginx rewrite rule.

And that’s it. With these changes you can have your main page displayed on your domain without redirect, also keeping it short for users to copy and share. This method should work for most versions of MediaWiki, including MediaWiki 1.26 which forcefully redirects everything that doesn’t match the canonical URL as seen by MediaWiki.

translatewiki.net – harder, better, faster, stronger

I am very pleased to announce that translatewiki.net has been migrated to new servers sponsored by netcup GmbH. Yes, that is right, we now have two servers, both of which are more powerful than the old server.

Since the two (virtual) servers are located in the same data center and other nitty gritty details, we are not making them redundant for the sake of load balancing or uptime. Rather, we have split the services: ElasticSearch runs on one server, powering the search, translation search and translation memory; everything else runs on the other server.

In addition to faster servers and continuous performance tweaks, we are now faster thanks to the migration from PHP to HHVM. The Wikimedia Foundation did this a while ago with great results, but HHVM has been crashing and freezing on translatewiki.net for unknown reasons. Fortunately, recently I found a lead that the issue is related to a ini_set function, which I was easily able to work around while the investigation on the root cause continues.

Non-free Google Analytics confirms that we now server pages faster.

Non-free Google Analytics confirms that we now serve pages faster: the small speech bubble indicates migration day to new servers and HHVM. Effect on the actual page load times observed by users seems to be less significant.

We now have again lots of room for growth and I challenge everyone to make us grow with more translations, new projects or other legitimate means, so that we reach a point where we will need to upgrade again ;). That’s all for now, stay tuned for more updates.

14 more languages “fully” translated this week

This week, MediaWiki’s priority messages have been fully translated in 14 more languages by about a dozen translators, after we checked our progress. Most users in those languages now see the interface of Wikimedia wikis entirely translated.

In two months since we updated the list of priority translations, languages 99+ % translated went from 17 to 60. No encouragement was even needed: those 60 languages are “organically” active, translators quickly rushed to use the new tool we gave them. Such regular and committed translators deserve a ton of gratitude!

However, we want to do better. We did something simple: tell MediaWiki users that they can make a difference, even if they don’t know. «With a couple hours’ work or less, you can make sure that nearly all visitors see the wiki interface fully translated.» The results we got in few hours speak for themselves:

Special:TranslationStats graph of daily registrations

This week’s peak of new translator daily registrations was ten times the usual

Special:TranslationStats of daily active translators

Many were eager to help: translation activity jumped immediately

Thanks especially to CERminator, David1010, EileenSanda, KartikMistry, Njardarlogar, Pymouss, Ranveig, Servien, StanProg, Sudo77(new), TomášPolonec and Чаховіч Уладзіслаў, who completed priority messages in their languages.

For the curious, the steps to solicit activity were:

take languages 80–99 % translated (about 60);
list active users using those languages on Wikimedia wikis (about 600);
send them a short talk page message (example);
welcome and thank them!

There is a long tail of users who see talk page messages only after weeks or months, so for most of those 60 languages we hope to get more translations later. It will be harder to reach the other hundreds languages, for which there are only 300 active users in Wikimedia according to interface language preferences: about 100 incubating languages do not have a single known speaker on any wiki!

We will need a lot of creativity and word spreading, but the lesson is simple: show people the difference that their contribution can make for free knowledge; the response will be great. Also, do try to reach the long tail of users and languages: if you do it well, you can communicate effectively to a large audience of silent and seemingly unresponsive users on hundreds Wikimedia projects.

Oregano deployment tool

This blog post introduces oregano, a non-complex, non-distributed, non-realtime deployment tool. It currently consists of less than 100 lines of shell script and is licensed under the MIT license.

The problem. For a very long time, we have run translatewiki.net straight from a git clone, or svn checkout before that. For years, we have been the one wiki which systematically run latest master, with few hours of delay. That was not a problem while we were young and wild. But nowadays, due to the fact that we carry dozens of local patches and thanks to the introduction of composer, it is quite likely that git pull --rebase will stop in a merge conflict. As a consequence, updates have become less frequent, but have semi-regularly brought the site down for many minutes until the merge conflicts were manually resolved. This had to change.

The solution. I wrote a simple tool, probably re-inventing the wheel for the hundredth time, which separates the current deployment in two stages: preparation and pushing out new code. Since I have been learning a lot about Salt and its quirks, I named my tool “oregano”.

How it works. Basically, oregano is a simple wrapper for symbolic links and rsync. The idea is that you prepare your code in a directory named workdir. To deploy the current state in workdir, you must first create a read-only copy by running oregano tag. After that, you can run oregano deploy, which will update symbolic links so that your web server sees the new code. You can give the name of the tag with both commands, but by default oregano will name a new tag after the current timestamp, and deploy the most recently created tag. If, after deploying, you find out that the new tag is broken, you can quickly go back to the previously deployed code by running oregano rollback. Below this is shown as a command line tutorial.

mkdir /srv/mediawiki/ # the path does not matter, pick whatever you want

cd /srv/mediawiki

# Get MediaWiki. Everything we want to deploy must be inside workdir
git clone https://github.com/wikimedia/mediawiki workdir

oregano tag
oregano deploy

# Now we can use /srv/mediawiki/targets/deployment where we want to deploy
ln -s /srv/mediawiki/targets/deployment /www/example.com/docroot/mediawiki

# To update and deploy a new version
cd workdir
git pull
# You can run maintenance scripts, change configuration etc. here
nano LocalSettings.php

cd .. # Must be in the directory where workdir is located
oregano tag
oregano deploy

# Whoops, we accidentally introduced a syntax error in LocalSettings.php
oregano rollback

As you can see from above, it is still possible to break the site if you don’t check what you are deploying. For this purpose I might add support for hooks, so that one could run syntax checks whose failure would prevent deploying that code. Hooks would also be handy for sending IRC notifications, which is something our existing scripts do when code is updated: as pushing out code is now a separate step, they are currently incorrect.

By default oregano will keep the 4 newest tags, so make sure you have enough disk space. For translatewiki.net, which has MediaWiki and dozens of extensions, each tag takes about 200M. If you store MediaWiki localisation cache, pre-generated for all languages, inside workdir, then you would need 1.2G for each tag. Currently, at translatewiki.net, we store localisation cache outside workdir, which means it is out of sync with the code. We will see if that causes any issues; we will move it inside workdir if needed. Do note that oregano creates a tag with rsync --cvs-exclude to save space. That also has the caveat that you should not name files or directories as core. Be warned; patches welcome.

The code is in the translatewiki repo but, if there is interest, I can move it to a separate repository in GitHub. Oregano is currently used in translatewiki.net and in a pet project of mine nicknamed InTense. If things go well, expect to hear more about this mysterious pet project in the future.

Midsummer cleanup: YAML and file formats, HHVM, translation memory

Wikimania 2014 is now over and that is a good excuse to write updates about the MediaWiki Translate extension and translatewiki.net.
I’ll start with an update related to our YAML format support, which has always been a bit shaky. Translate supports different libraries (we call them drivers) to parse and generate YAML files. Over time the Translate extension has supported four different drivers:

spyc uses spyc, a pure PHP library bundled with the Translate extension,
syck uses libsyck which is a C library (hard to find any details) which we call by shelling out to P erl,
syck-pecl uses libsyck via a PHP extension,
phpyaml uses the libyaml C library via a PHP extension.

The latest change is that I dropped syck-pecl because it does not seem to compile with PHP 5.5 anymore; and I added phpyaml. We tried to use sypc a bit but the output it produced for localisation files was not compatible with Ruby projects: after complaints, I had to find an alternative solution.

Joel Sahleen let me know of phpyaml, which I somehow did not found before: thanks to him we now use the same libyaml library that Ruby projects use, so we should be fully compatible. It is also the fastest driver of the four. Anyone generating YAML files with Translate is highly recommended to use the phpyaml driver. I have not checked how phpyaml works with HHVM but I was told that HHVM ships with a built-in yaml extension.

Speaking of HHVM, the long standing bug which causes HHVM to stop processing requests is still unsolved, but I was able to contribute some information upstream. In further testing we also discovered that emails sent via the MediaWiki JobQueue were not delivered, so there is some issue in command line mode. I have not yet had time to investigate this, so HHVM is currently disabled for web requests and command line.

I have a couple of refactoring projects for Translate going on. The first is about simplifying the StringMangler interface. This has no user visible changes, but the end goal is to make the code more testable and reduce coupling. For example the file format handler classes only need to know their own keys, not how those are converted to MediaWiki titles. The other refactoring I have just started is to split the current MessageCollection. Currently it manages a set of messages, handles message data loading and filters the collection. This might also bring performance improvements: we can be more intelligent and only load data we need.

Aiming high: creating a translation memory that works for Wikipedia; even though a long way from here (photo Marie-Lan Nguyen, CC BY 3.0)

Finally, at Wikimania I had a chance to talk about the future of our translation memory with Nik Everett and David Chan. In the short term, Nik is working on implementing in ElasticSearch an algorithm to sort all search results by edit distance. This should bring translation memory performance on par with the old Solr implementation. After that is done, we can finally retire Solr at Wikimedia Foundation, which is much wanted especially as there are signs that Solr is having problems.

Together with David, I laid out some plans on how to go beyond simply comparing entire paragraphs by edit distance. One of his suggestions is to try doing edit distance over words instead of characters. When dealing with the 300 or so languages of Wikimedia, what is a word is less obvious than what is a character (even that is quite complicated), but I am planning to do some research in this area keeping the needs of the content translation extension in mind.

Translatewiki.net summer update

It’s been a busy while since last update, but how could I have not worked on translatewiki.net? ;) Here is an update on my current activities.
In this episode:

we provide translations for over 70 % of users of the new Wikipedia app,
I read a book on networking performance and get needy for speed,
ElasticSearch tries to eat all of us and our memory,
HHVM finds the place not fancy enough,
Finns and Swedes start cooperating.

Performance

Naturally, I have been thinking of ways to further improve translatewiki.net performance. I have been running HHVM as a beta feature at translatewiki.net many months now, but I have kept turning it on and off due to stability issues. It is currently disabled, but my plan is to try the Wikimedia packaged version of HHVM. Those packages only work in Ubuntu 2014.04, so Siebrand and I first have to upgrade the translatewiki.net server from Ubuntu 2012.04, as we plan to later this month (July). (Update: done as of 2014-07-09, 14 UTC.)

Map of some translatewiki.net translators

A global network of translators is not served well enough from a single location

After reading a book about networking performance I finally decided to give a content distribution network (CDN) a try. Not because they can optimize and cache things on the fly [1], nor because the can do spam protection [2], but because CDN can reduce latency, which is usually the main bottleneck of web browsing. We only have single server in Germany, but our users are international. I am close to the server, so I have much better experience than many of our users. I do not have any numbers yet, but I will do some experiments and gather some numbers to see whether CDN helps us.

[1] MediaWiki is already very aggressive in terms of optimizations for resource delivery.
[2] Restricting account creation already eliminated spam on our wiki.

Wikimedia Mobile Apps

Amir and I have been closely working with the Wikimedia Mobile Apps team to ensure that their apps are well supported. In just a couple weeks, the new app was translated in dozens languages and released, with over 7 millions new installations by non-English users (74 % of the total).

In more detail, we finally addressed a longstanding issue in the Android app which prevented translation of strings containing links. I gave Yuvi access to synchronize translations, ensuring that translators have as much time as possible to translate and the apps have the latest updates before being released. We also discussed about how to notify translators before releases to get more translations in time, and about improvements to their i18n frameworks to bring their flexibility more in line with MediaWiki (including plural support).

To put it bluntly, for some reason the mobile i18n frameworks are ugly and hard to work with. Just as an example, Android did not support many languages at all just for one character too much; support is still partial. I can’t avoid comparing this to the extra effort which has been needed to support old versions of Internet Explorer: we would rather be doing other cool things, but the environment is not going to change anytime soon.

Search

I installed and enabled CirrusSearch on translatewiki.net: for the first time, we have a real search engine for all our pages! I had multiple issues, including running a bit tight on memory while indexing all content.

Translate’s translation memory support for ElasticSearch has been almost ready for a while now. It may take a couple months before we’re ready to migrate from Solr (first on translatewiki.net, then Wikimedia sites). I am looking forward to it: as a system administrator, I do not want to run both Solr and ElasticSearch.

I want to say big thanks to Nik for helping both with the translation memory ElasticSearch backend and my CirrusSearch problems.

Wikimedia Sweden launches a new project

I am expecting to see an increased activity and new features at translatewiki.net thanks to a new project by Wikimedia Sweden together with InternetFonden.Se. The project has been announced on the Wikimedia blog, but in short they want to bring more Swedish translators, new projects for translation and possibly open badges to increase translator engagement. They are already looking for feedback, please do share your thoughts.

Summary of Translate workshop at Zürich hackathon

The hall always provided power and wifi for eager hackers (photo CC-BY-SA by Ludovic Péron)

I held a Translate workshop at the Zürich hackathon. Naturally, others and I worked on Translate and translatewiki.net outside of the workshop as well. Here is a summary of the outcomes.

The workshop itself consisted of three topics of interest. I gave an introduction about the Content translation project, going over the basic design and features, followed by a Q&A. We then split into three small groups. One group continued talking about translating content in wider scope. The second group went over how to add new projects to translatewiki.net, using Huggle and Sharelatex as a concrete example. The third group consisted of me helping with programming questions about the Translate extension.

During the whole hackathon people worked on about 20 bugs and patches. I started a patch for glossary support in the Translate extension: a proof of concept, as simple as possible.

Numbers on translatewiki.net sign-up process

Translatewiki.net features a good user experience for non-technical translators. A crucial or even critical component is signing up. An unrelated data collection for my PhD studies inspired me to get some data on the translatewiki.net user registration process. I will present the results below.

History

At translatewiki.net the process of becoming an approved translator has been, arguably, complicated in some periods.

In the early days of the wiki, permissions were not clearly separated: hundreds users were just given the full set of permissions to edit the MediaWiki namespace and translate that way.

Later, we required people to go through hoops of various kind after registering to be approved as translators. They had to create a user page with certain elements and post a request on a separate page and they would not get notifications when they were approved unless they tweaked their preferences.

At some point, we started using the LiquidThreads extension: now the users could get notifications when approved, at least in theory. That brought its own set of issues though: many people thought that the LiquidThreads search box on the requests page was the place where to write the title of their request. After entering a title, they ended up in a search results page, which was a dead end. This usability issue was so annoying and common that I completely removed the search field from LiquidThreads.
In early 2010 we implemented a special page wizard (FirstSteps) to guide users though the process. For years, this has allowed new users to get approved, and start translating, in few clicks and a handful hours after registering.

In late 2013 we enabled the new main page containing a sign-up form. Using that form, translators can create an account in a sandbox environment. Accounts created this way are normal user accounts except that they can only make example translations to get a feel of the system. Example translations give site administrators some hints on whether to approve or reject the request and approve the user as a translator.

Data collection

The data we have is not ideal.

For example, it is impossible to say what’s our conversion rate from users visiting the main page to actual translators.
A lot of noise is added by spam bots which create user accounts, even though we have a CAPTCHA.
When we go far back in the history, the data gets unreliable or completely missing.
- We only have dates for account created after 2006 or so.
- The log entry format for user permissions has changed multiple times, so the promotion times are missing or even incorrect for many entries until a few years back.

The data collection was made with two scripts I wrote for this purpose. The first script produces a tab separated file (tsv) containing all accounts which have been created. Each line has the following fields:

username,
time of account creation,
number of edits,
whether the user was approved as translator,
time of approval and
whether they used the regular sign-up process or the sandbox.

Some of the fields may be empty because the script was unable to find the data. User accounts for which we do not have account creation time are not listed. I chose not to try some methods which can be used to approximate the account creation time, because the data in that much past is too unreliable to be useful.

The first script takes a couple of minutes to run at translatewiki.net, so I split further processing to a separate script to avoid doing the slow data fetching many times. The second script calculates a few additional values like average and median time for approval and aggregates the data per month.

The data also includes translators who signed up through the sandbox, but got rejected: this information is important for approval rate calculation. For them, we do not know the exact registration date, but we use the time they were rejected instead. This has a small impact on monthly numbers, if a translator registers in one month and gets rejected in a later month. If the script is run again later, numbers for previous months might be somewhat different. For approval times there is no such issue.

Results

Image 1: Account creations and approved translators at translatewiki.net

Image 1 displays all account creations at translatewiki.net as described above, simply grouped by their month of account creation.

We can see that approval rate has gone down over time. I assume this is caused by spam bot accounts. We did not exclude them hence we cannot tell whether the approval rate has gone up or down for human users.

We can also see that the number of approved translators who later turn out to be prolific translators has stayed pretty much constant each month. A prolific translator is an approved translator who has made at least 100 edits. The edits can be from any point of time, the script is just looking at current edit count so the graph above doesn’t say anything about wiki activity at any point in time.

There is an inherent bias towards old users for two reasons. First, at the beginning translators were basically invited to a new tool from existing methods they used, so they were likely to continue to translate with the new tool. Second, new users have had less time to reach 100 edits. On the other hand, we can see that a dozen translators even in the past few months have already made over 100 edits.

I have collected some important events below, which I will then compare against the chart.

2009: Translation rallies in August and December.
2010-02: The special page to assist in filing translator requests was enabled.
2010-04: We created a new (now old) main page.
2010-10: Translation rally.
2011: Translation rallies in April, September and December.
2012: Translation rallies in August and December.
2013-12: The sandbox sign-up process was enabled.

There is an increase in account creations and approved translators a few months after the assisting special page was enabled. The explanation of this is likely to be the new main page which had a big green button to access the special page. The September translation rally in 2011 seems to be very successful in requiting new translators, but also the other rallies are visible in the chart.

Image 2: How long it takes for account creation to be approved.

The second image shows how long it takes from the account creation for a site administrator to approve the request. Before sandbox, users had to submit a request to become translators on their own: the time for them to do so is out of control of the site administrators. With sandbox, that is much less the case, as users get either approved or rejected in a couple of days. Let me give an overview of how the sandbox works.

All users in the sandbox are listed on a special page together with the sandbox translations they have made. The administrators can then approve or reject the users. Administrators usually wait until the user has made a handful translations. Administrators can also send email reminders for the users to make more translations. If translators do not provide translations within some time, or the translations are very bad, they will get rejected. Otherwise they will be approved and can immediately start using the full translation interface.

We can see that the median approval time is just a couple of hours! The average time varies wildly though. I am not completely sure why, but I have two guesses.
First, some very old user accounts have reactivated after being dormant for months or years and have finally requested translator rights. Even one of these can skew the average significantly. On a quick inspection of the data, this seems plausible.
Second, originally we made all translators site administrators. At some point, we introduced the translator user group, and existing translators have gradually been getting this new permission as they returned to the site. The script only counts the time when they were added to the translator group.
Alternatively, the script may have a bug and return wrong times. However, that should not be the case for recent years because the log format has been stable for a while. In any case, the averages are so big as to be useless before the year 2012, so I completely left them out of the graph.

The sandbox has been in use only for a few months. For January and February 2014, the approval rate has been slightly over 50%. If a significant portion of rejected users are not spam bots, there might be a reason for concern.

Suggested action points

Store the original account creation date and “sandbox edit count” for rejected users.
Investigate the high rejection rate. We can ask the site administrator why about a half of the new users are rejected. Perhaps we can also have “mark as spam” action to get insight whether we get a lot of spam. Event logging could also be used, to get more insight on the points of the process where users get stuck.

Source material

Scripts are in Gerrit. Version ‘2’ of the scripts was used for this blog post. Processed data is in a LibreOffice spreadsheet. Original and updated data is available on request, please email me.

Performance is a feature

In case you haven’t already noticed, I like working on performance issues and performance improvements. Performance is a thing where you have to consider the whole stack: the speed of the server, efficient algorithms, server side caching, bandwidth and latency, client side caching and client side code. Here is a short recap of what has been done for translatewiki.net lately and some ideas for the future.

Recent improvements

Chrome 29 (or later release) has added a helpful visualization for profiling data. In this image the speed of ULS JavaScript code is evaluated on a fonts heavy page. Comparing to the collapsible tabs feature, it is doing okay.

Server level. A month ago translatewiki.net got a new server with more memory and faster processors. The main benefit is that we can handle more simultaneous users and background tasks without them slowing each other down. At the same time, we upgraded many of the programs to newer versions. The switch from MySQL to MariaDB is the most important one. We haven’t tested it for our use case, but the Wikimedia Foundation found that the switch had overall positive impact on performance.

Web server level. In the beginning of November I configured our nginx web server to enable support for the SPDY protocol. This should greatly reduce latency when browsing over HTTPS. We are considering to switch to HTTPS by default. While tweaking nginx, I also fixed a few settings that relate on the compression and expiry times of JavaScript, SVG images and font assets when delivered to users. I used AWStats to see if our daily bandwidth usage decreased. It has not decreased significantly, but there is a lot of variation between days that make interpreting the data difficult. PageSpeed was used to ensure that caching headers are optimal and WebPagetest to confirm that pages load faster on different browsers in different places.

Application level. The Language Engineering team has recently worked a lot on the performance of Universal Language Selector (ULS) and Translate extensions. A short summary of the things which were done:

Reduce the amount of JavaScript and CSS delivered to the browser.
Delay the loading of JavaScript and CSS as much as possible (for example till the user opens ULS).
Optimize JPG, SVG and PNG images to the last byte with tools like jpegoptim, optipng.
Optimize the JavaScript to avoid slow actions (for example repaint events and dom changes). We used Chrome’s JavaScript profiler as well as the experimental tool “show potential scroll bottlenecks” to identify issues and confirm the fixes (thanks Ori).

In addition I fixed a major performance issue in one of the Translate API modules by replacing an inefficient algorithm with a faster one. While investigating that issue, I also noticed that ReplacementArray-strtr was taking 20% or so of MediaWiki run time. There is a less known PHP module FastStringSearch, which was not installed on the new server. Installing that module made a big difference on the MediaWiki profiling table: ReplacementArray-fss is now taking only about 0.20% of MediaWiki run time.

Finally, a thing called module local storage was enabled on Wikimedia wikis few days ago (the title of this post was taken from that discussion). As is usual for translatewiki.net, we were already beta testing that feature a few weeks before it went live on Wikimedia wikis.

Future plans

It is hard to plan the future for further performance improvements, as the bottlenecks and the places where you can make the most difference for the least effort change constantly, together with the technology and your content. I believe that HHVM, a JIT PHP virtual machine, is likely to be the next step which will make a significant difference. It is however not a straightforward thing to jump from a normal PHP intepreter to HHVM, so I will be keeping a close eye on how my colleagues at the Wikimedia Foundation are progressing with the adoption of HHVM.

Another relatively small thing on the horizon is better compression of inline SVG images in CSS style sheets, by avoiding unnecessary base64 encoding. Or something else might happen even before it.

Finally, I’d like to highlight that while the application-level improvements automatically benefit third party users, there really isn’t any coherent documentation on how to improve performance of a MediaWiki site at all levels. Configuring localisation cache, nginx and/or Varnish, tweaking MySQL or MariaDB and installing Memcached or Redis should be part of any capable sysadmin’s skills; but even just tailoring them for MediaWiki, let alone knowing which PHP modules to install, is likely not known by many. For example, I wouldn’t be surprised if there were very few or even no sites using the FastStringSearch module outside of Wikimedia and translatewiki.net.

Pet project: Optimizing message index to the last byte

The message index is a crucial component of Translate, so I made an experiment by implementing a trie store for the message index to optimize it. The short story is that I could not get it fast enough for practical use easily. Continue for full story.

Pet projects

A tree in Helsinki (October) showing something tries can’t produce: wonderful fall colours (ruska in Finnish)

For context, in our development team each developer has time for experimentation, outside of the planned development sprint tasks. During that time the developer can try out new technologies, fix issues that are important to them personally or just do something fun and interesting. We call these pet projects and they let us do some cool things.

For example, the insertables I described in my previous blog post are something I did as a pet project. Insertables were actually part of the original translation UX (TUX) design specifications, but they were not implemented because of other priorities. I decided to implement them because users (not managers) were asking for it. I wasn’t convinced initially, but when I saw users translating with tablets I changed my mind. Insertables were a good pet project because they were relatively small and fun a thing to do.

This is all I have to say about pet projects – the non technical readers can skip the rest of this post, where I go into the details of this pet project.

Message index

I probably have introduced the message index in my earlier posts, but let me do it again quickly. I’ll use an example for this. Let’s assume we have a small software called Greeter. It has a localisation file like this:

# l10n/en/greetings.properties
greeting.noon = Good day
greeting.morning = Good morning
greeting.evening = Good evening
greeting.night = Good night

When this kind of file is set up with the Translate extension (for instance in translatewiki.net), each string is stored as a wiki page. Each translation is a separate page, too.

translatewiki.net/wiki/Greeter:greeting.noon/en -> “Good day”
translatewiki.net/wiki/Greeter:greeting.noon/fi -> “Hyvää päivää”

The bolded parts are called page titles in MediaWiki. The message index can be defined simply as a map from the page title of each known message (without the language code) to the message group it belongs to. If we printed it out it would look something like this:

1244:greeting-noon => [greeter]
1244:greeting-morning => [greeter]
1244:greeting-evening => [greeter]
1244:greeting-night => [greeter]

So, every time someone adds a new message for translation, we need to update the message index. Every time someone makes a translation, we need to query the message index. The user is waiting, so both of these actions need to be fast, while using a reasonable amount of memory.

Implementations problems

When we get to the order of 50 000 or even more known messages, creation and accessing of the message index starts to get slow in PHP, even though it’s basically just a lot of strings, and string processing should be fast, right? Not so in PHP, where holding the message index as an array of arrays takes tens of megabytes in memory. An array in php is kind of a mix of hashtable and linked list. It uses more memory for extra features and versatility.. In the case of message index we would gladly like to trade some features for reduced memory usage.

There are many aspects in message index optimization, but so far I haven’t found a solution without downsides. If the whole index was small enough, it could be kept in memory, making things faster; but currently it can only be stored in various kinds of databases, that allow querying the index one title at the time.

Currently at translatewiki.net we are using CDB files, which are immutable databases stored on a file on the file system. This is okay for our use case: the index is accessed from disk; only when the data changes, you have to build the whole thing from scratch and you have to worry about memory usage and speed. The current problem we have with this approach is that it takes a lot of memory to recreate it, and the few second running time is on the borderline of acceptable speed for having user to wait for it. There isn’t too much room for growth.

To reach the current state, I’ve tried using references to store the group names to avoid repeating them and storing the resulting array in a serialized file. I’ve tried storing the whole structure in a database table, which works well to certain amount of messages. This time I’m going to try something else. The idea is to save space by considering that the message keys share a lot of substrings, for instance the messages of a MediaWiki extension having all keys prefixed with the extension’s name. I decided to use tree structures to experiment.

Trees and tries

Disclaimer: I haven’t studied algorithms in depth so I’m just trying to apply what I know.

We can represent all the relationships between message names and their groups as a set of mostly similar strings which may share common prefixes. I could have used a tree, but I decided to use a trie. A trie is a tree where consecutive nodes which only have one child are merged together. Here is an example of how the message index above would look like in a trie (first image), compared to the full tree (second image). As you can see, the trie is more compact compared to the tree because it has less nodes and branches. The trie is also more compact than an array as the common prefixes can be stored only once and we are not using any hashes which are used in arrays. Click for full size.

To create a message index using tries, I started by googling if there are any algorithms already implemented in PHP for constructing tries. I could not find any, so I just converted into PHP a Python script (which was likely converted from Java). Then I implemented a custom binary format that could be stored in a file and a custom lookup that would use the data loaded from the file into a memory.
I tried many options for optimizing the creation of the trie while minimizing the storage consumption.

One of the curious things was that, when inserting a new string to the trie, it is faster to loop over all the current children of the node comparing the first letter of the child against the first letter of the string we are inserting, rather than to use binary search to find the correct insertion point. The latter would mean keeping the list of children sorted and doing less comparisons by using binary search when doing lookups and insertions. I assume this is because inserting at the end of the array is fast, but inserting in the middle of the array (to keep it sorted) is slow because (my guess) PHP either recreates the array or updating the linked list pointers is slow for some other reason.

For the storage format I tried various kinds of indexes of strings to store the substrings only once, but all the pointers to the strings and child nodes also take a lot of space (4 bytes per pointer, where 4 bytes can also store four characters assuming ascii keys). I’m sure more space savings could be gained by experimenting with alignments so that smaller pointers could be used. Maybe it would be possible borrow some of the algorithms designed to optimize finite state automata – I believe those are much better than what I can do on my own.

Here are some numbers (approximate because I ran out of time to measure properly) on how it compares to the CDB message index solution:

Property	CDB	Trie
Size on disk	6 MiB	1.5 MiB (0.5 compressed with gzip)
Time to create	1 second	7 seconds

For now I declare this pet project as something that cannot be used. Maybe some day I will get back to it and try make it good enough for real use, but now I already have other interesting pet projects in my mind. If I get suggestions from you how to reach practical solutions, I will of course try them out sooner. I just want to mention that there a many things that could still be explored: QuickHash, constant hash database or finding ways to store group information so that message index is not needed at all.

-- Niklas Laxström.

It rains like a saavi

About me, me and me