Category Archives: English

Goes to some planets

Pet project: Optimizing message index to the last byte

The message index is a crucial component of Translate, so I made an experiment by implementing a trie store for the message index to optimize it. The short story is that I could not get it fast enough for practical use easily. Continue for full story.

Pet projects

A tree in Helsinki (October) showing something tries can’t produce: wonderful fall colours (ruska in Finnish)

For context, in our development team each developer has time for experimentation, outside of the planned development sprint tasks. During that time the developer can try out new technologies, fix issues that are important to them personally or just do something fun and interesting. We call these pet projects and they let us do some cool things.

For example, the insertables I described in my previous blog post are something I did as a pet project. Insertables were actually part of the original translation UX (TUX) design specifications, but they were not implemented because of other priorities. I decided to implement them because users (not managers) were asking for it. I wasn’t convinced initially, but when I saw users translating with tablets I changed my mind. Insertables were a good pet project because they were relatively small and fun a thing to do.

This is all I have to say about pet projects – the non technical readers can skip the rest of this post, where I go into the details of this pet project.

Message index

I probably have introduced the message index in my earlier posts, but let me do it again quickly. I’ll use an example for this. Let’s assume we have a small software called Greeter. It has a localisation file like this:

# l10n/en/greetings.properties
greeting.noon = Good day
greeting.morning = Good morning
greeting.evening = Good evening
greeting.night = Good night

When this kind of file is set up with the Translate extension (for instance in translatewiki.net), each string is stored as a wiki page. Each translation is a separate page, too.

translatewiki.net/wiki/Greeter:greeting.noon/en -> “Good day”
translatewiki.net/wiki/Greeter:greeting.noon/fi -> “Hyvää päivää”

The bolded parts are called page titles in MediaWiki. The message index can be defined simply as a map from the page title of each known message (without the language code) to the message group it belongs to. If we printed it out it would look something like this:

1244:greeting-noon => [greeter]
1244:greeting-morning => [greeter]
1244:greeting-evening => [greeter]
1244:greeting-night => [greeter]

So, every time someone adds a new message for translation, we need to update the message index. Every time someone makes a translation, we need to query the message index. The user is waiting, so both of these actions need to be fast, while using a reasonable amount of memory.

Implementations problems

When we get to the order of 50 000 or even more known messages, creation and accessing of the message index starts to get slow in PHP, even though it’s basically just a lot of strings, and string processing should be fast, right? Not so in PHP, where holding the message index as an array of arrays takes tens of megabytes in memory. An array in php is kind of a mix of hashtable and linked list. It uses more memory for extra features and versatility.. In the case of message index we would gladly like to trade some features for reduced memory usage.

There are many aspects in message index optimization, but so far I haven’t found a solution without downsides. If the whole index was small enough, it could be kept in memory, making things faster; but currently it can only be stored in various kinds of databases, that allow querying the index one title at the time.

Currently at translatewiki.net we are using CDB files, which are immutable databases stored on a file on the file system. This is okay for our use case: the index is accessed from disk; only when the data changes, you have to build the whole thing from scratch and you have to worry about memory usage and speed. The current problem we have with this approach is that it takes a lot of memory to recreate it, and the few second running time is on the borderline of acceptable speed for having user to wait for it. There isn’t too much room for growth.

To reach the current state, I’ve tried using references to store the group names to avoid repeating them and storing the resulting array in a serialized file. I’ve tried storing the whole structure in a database table, which works well to certain amount of messages. This time I’m going to try something else. The idea is to save space by considering that the message keys share a lot of substrings, for instance the messages of a MediaWiki extension having all keys prefixed with the extension’s name. I decided to use tree structures to experiment.

Trees and tries

Disclaimer: I haven’t studied algorithms in depth so I’m just trying to apply what I know.

We can represent all the relationships between message names and their groups as a set of mostly similar strings which may share common prefixes. I could have used a tree, but I decided to use a trie. A trie is a tree where consecutive nodes which only have one child are merged together. Here is an example of how the message index above would look like in a trie (first image), compared to the full tree (second image). As you can see, the trie is more compact compared to the tree because it has less nodes and branches. The trie is also more compact than an array as the common prefixes can be stored only once and we are not using any hashes which are used in arrays. Click for full size.

To create a message index using tries, I started by googling if there are any algorithms already implemented in PHP for constructing tries. I could not find any, so I just converted into PHP a Python script (which was likely converted from Java). Then I implemented a custom binary format that could be stored in a file and a custom lookup that would use the data loaded from the file into a memory.
I tried many options for optimizing the creation of the trie while minimizing the storage consumption.

One of the curious things was that, when inserting a new string to the trie, it is faster to loop over all the current children of the node comparing the first letter of the child against the first letter of the string we are inserting, rather than to use binary search to find the correct insertion point. The latter would mean keeping the list of children sorted and doing less comparisons by using binary search when doing lookups and insertions. I assume this is because inserting at the end of the array is fast, but inserting in the middle of the array (to keep it sorted) is slow because (my guess) PHP either recreates the array or updating the linked list pointers is slow for some other reason.

For the storage format I tried various kinds of indexes of strings to store the substrings only once, but all the pointers to the strings and child nodes also take a lot of space (4 bytes per pointer, where 4 bytes can also store four characters assuming ascii keys). I’m sure more space savings could be gained by experimenting with alignments so that smaller pointers could be used. Maybe it would be possible borrow some of the algorithms designed to optimize finite state automata – I believe those are much better than what I can do on my own.

Here are some numbers (approximate because I ran out of time to measure properly) on how it compares to the CDB message index solution:

Property	CDB	Trie
Size on disk	6 MiB	1.5 MiB (0.5 compressed with gzip)
Time to create	1 second	7 seconds

For now I declare this pet project as something that cannot be used. Maybe some day I will get back to it and try make it good enough for real use, but now I already have other interesting pet projects in my mind. If I get suggestions from you how to reach practical solutions, I will of course try them out sooner. I just want to mention that there a many things that could still be explored: QuickHash, constant hash database or finding ways to store group information so that message index is not needed at all.

Review of Gettext po(t) file format

Gettext shows its age both in developer and translator friendliness. What’s wrong with the old known localisation file formats which Google and Mozilla among others are so keen to replace? I don’t have a full answer to that. Gettext is clearly quite inflexible compared to Mozilla’s file format (which is almost a programming language) and it does not support many of the new features in Google’s resource bundles.

My general recommendation is: use the file format best supported by your i18n framework. If you can choose, prefer key based formats. Only try new file formats if you need the new features, because tool support for them is not as good. There is also no clarity which of the new file formats will “win” the fight and become popular.

When making something new, it is good to look back. The motivation why I wrote this post initially was my annoyances writing a tool which supports this format, but the context I’m going to give is completely different. It has been waiting as draft to be published for a long time because it lacked context where it makes sense. Maybe this also helps people, who are wondering what localisation file format they should use.

Enough of the general thoughts. But let’s start this evaluation with the good things:
Can support plural for many languages. The plural syntax is flexible enough to cover at least most if not all of world’s languages.
Fuzzy translations. It has a standard way to mark outdated translations, which is a necessity for this format which does not identify strings.
Tool support. Gettext can be used in many programming languages and there are plenty of tools for translators.

And then the things I don’t like:
Strings have no identifiers. This is my biggest annoyance with Gettext. Strings are identified by their contents, which means that fixing a typo in the source invalidates all translations. It also makes it impossible to keep any track of history. This causes another problem: Identical strings are collapsed by default. This is especially annoying since in English words like Open (action) and Open (state) are the same but in other languages they are different. This effectively prevents proper translations, unless a message context is provided, but here lies another problem: Not all implementations support passing context. Last time I checked this was the case at least in Python.
And one nasty corner case for tool makers is that empty context is different from no context. If you don’t handle this right you will be producing invalid Gettext files.
I listed plural support above as a plus, but it is not without its problems. One string can only have plural forms depending on one variable. This forces the developers to use lego sentences when there is more than one number, or force the translators to make ungrammatical translations. Not to mention that, in Arabic and other languages where there can be five or even more forms, you need to repeat the whole string as many times with small changes. Lots of overhead updating and proofreading that, as opposed to an inline syntax where you only mark the differences. To be fair, with an inline syntax it might hard to see how each plural form looks in full, but there are solutions to that.
There is no standard way to present authorship information except for last translator. The file header is essentially free form text, making it hard to process and update that information programmatically. To be fair, this is the case for almost all i18n file formats I’ve seen.
The comments for individual strings are funky. There are different kinds of comments that start with “#,” “#|”, not documented anywhere as far as I know, and the order of different kinds of comments matters! Do it wrong and you’ll have a file that some tools refuse to use. Not to mention that developers can also leave comments for the translators, in addition to the context parameter (so there are two ways!): the translators might or might not see them depending on the tool they use and on what is propagated from the pot file to the po file. It is quite a hassle to keep these comments in sync and repeated in all the translation files.

I’m curious to hear whether you would like to see more of these evaluations and perhaps a comparison of the formats. If there isn’t much interest I likely won’t do more.

Insertables in Translate make translating easier

Insertables are a new tool to easily copy some text from the source language to your translation with one click.

Have you ever translated anything with the Translate extension? Did it contain markup like this?

[http://very.long.url/here link description]
{{GENDER:$1|he|she}} posted $2 on $3

If so, then you know what this is about. Have you ever translated anything with the Translate extension while using a tablet or another device without a physical keyboard? If so, then you likely know why this interesting.

When you translate text written in wiki markup, or software interface strings, you will encounter the examples above, and many more parts which you need to copy verbatim while translating. These parts contain special characters like braces, dollar signs, brackets, pipes and so on. These characters are cumbersome to type on non-English keyboards, where they have been moved to more difficult to reach key combinations in favour of local characters – if they exist in the layout at all. If they don’t exist in the keyboard layout, you need to switch keyboard layouts just to type few characters and then switch it back.

Does this sound cumbersome? Many translators in fact do not do that, but instead they copy and paste the text from the source text. On tablets however, copy and paste itself is a cumbersome thing. Insertables are a solution to this usability issue.

We can automatically identify a part of the translatable text which has the following properties: it should not be changed and it is difficult to type. We can then present these parts of strings as buttons near the translation. Clicking or pressing that button inserts the text into the translation. These buttons complement the insert source text button and are optional to use, like all translation helpers we provide.

Happy translator using the new feature

As of now, we only detect a few types of these insertables: plural, grammar magic words, and variables in MediaWiki style ($1). Read more on Translate documentation for how to contribute more insertables.

First QUnit test for Translate extension – with tutorial

It’s about time the Translate extension gets QUnit tests: the amount of JavaScript in it has exploded in the past year. Here is a quick intro on how to add QUnit tests for a MediaWiki extension which doesn’t have any yet.

Step 1: Create a tests directory.

The Translate extension already has a tests/ directory with a lot of PHPUnit tests . For now I just created a qunit subdirectory under it.

Step 1: Create a test file.

The function I want to test is in a file at
resources/js/ext.translate.parsers.js.
I created a corresponding test file
tests/qunit/ext.translate.parsers.test.js.

Step 3: Register the test file.

In Translate, all the resource loader modules are defined in Resources.php. At the bottom of the file I register the test modules via the ResourceLoaderTestModules hook with an anonymous function.

$wgHooks['ResourceLoaderTestModules'][] =
	// Dependencies must be arrays here
	function ( array &$modules ) use ( $resourcePaths ) {
		$modules['qunit']['ext.translate.parsers.test'] = array(
			'scripts' => array( 'tests/qunit/ext.translate.parsers.test.js' ),
			'dependencies' => array( 'ext.translate.parsers' ),
		) + $resourcePaths;

		return true;
	};

The $resourcePaths I have defined already earlier:

$resourcePaths = array(
	'localBasePath' => __DIR__,
	'remoteExtPath' => 'Translate'
);

Step 4: Write the tests

Here is a simple example with only one test. Note how the assert is taken via function parameter to avoid using global functions.

/**
 * Tests for ext.translate.parsers.js.
 *
 * @file
 * @licence GPL-2.0+
 */

( function ( $, mw ) {
	'use strict';

	QUnit.module( 'ext.translate.parsers', QUnit.newMwEnvironment() );

	QUnit.test( '-- External links', 1, function ( assert ) {
		assert.strictEqual(
			'This page is [in English]',
			mw.translate.formatMessageGently( 'This page is [in English]' ),
			'Brackets without protocol doesn\'t make a link'
		);
	} );

}( jQuery, mediaWiki ) );

Step 5: Run the tests

I ran the tests in my development wiki and they passed. The patch set is in Gerrit. Also see the QUnit page in mediawiki.org.

On course to machine translation

It has been a busy spring: I have yet to blog about Translate UX and Universal Language Selector projects, which have been my main efforts.
But now something different. In this field you can never stop learning. So I was very pleased when my boss let me participate in a week-long course, where Francis Tyers and Tommi Pirinen taught how to do machine translation with Apertium. Report of the course follows.

From translation memory to machine translation

Before going to the details about the course, I want to share my thoughts about what is the relation between the different translation memory and machine translations techniques we are using to help translators. The three different techniques are:

Crude translation memory: for example the TTMServer of Translate
Statistical machine translation: for example Google Translate or Microsoft Translator
Rule-based machine translation: for example Apertium

In the figure below, I have used two properties to compare them.

On x-axis is the amount of information that is extracted from the stored data. Here the stored data is usually a corpus of aligned^* translations in two or more languages.
On y-axis is the amount of external knowledge used by the system. This data is usually dictionaries, rules how words inflect and rules about grammar–or even how to split text into sentences and words.

^* Aligned means that the system knows which parts of the text correspond to each other in the translations. Alignment can be at paragraph level, sentence level or even smaller parts of the text.

Translation memory and machine translation comparison

A very crude implementation just stores an existing translation and can retrieve it if the very same text is translated again.

TTMServer is a little more sophisticated: it splits the translation into paragraph-sized chunks, and it can retrieve the existing translation even if the new text does not match the old text exactly. This system uses only a little information about the data. Even if all the words exist in it, translated as part of different units (strings), the system still cannot provide any kind of translation. Internally, TTMServer uses some external knowledge on how to split up text into words, in order to speed up translation retrieval.

Statistical machine translation at simplest is just a translation memory which extracts more information about the stored translation data. It gathers a huge database about which words usually occur as translation of the words in the source language. Usually it also stores the context so that in the sentence “walking along the river bank” the term “bank” is not interpreted as a building. Most sophisticated systems can also include knowledge about inflection and grammar to filter out invalid interpretations, or even fix grammatically incorrect forms.

On the right hand side of the figure we have rule-based machine translation systems like Apertium. These systems mainly rely on language dependent information supplied by the maker of the system: bilingual dictionaries, inflection and syntax rules are needed for them to function. Unlike the preceding ones, such systems are always language specific. Creating a machine translation needs a linguist for each language in the system.
Still, even these systems can benefit from statistical methods. While they do not store translation data itself, such data can be analysed and used as input to find the correct way to read ambiguous sentences, or the most common translation of a word in the given context among some alternatives.

The ultimate solution for machine translation is most likely a combination of rules and information extracted from a huge translation corpora.

The course

To create a machine translation system with Apertium, you need to choose a source and target language. I built a system to translate from Kven to Finnish. Kven is very close to Finnish, so it was quite easy to do even though I do not know much Kven. Each student was provided skeleton files and a story in the source language, also translated to the target language by a human translator.

We started by adding words in order of frequency to the lexicon. Lexicon defines part of speech and the inflection paradigms of the words. The paradigms are used to analyze the word forms, and also for generation when translating in the opposite direction. Then we added phonological rules. For example Finnish has a vowel harmony. Because of that, many word endings (cases) have two forms, depending on the word – for example koirassa (in the dog), but hiiressä (in the mouse).

As a third step, we created a bilingual dictionary in a form that is suitable for machines (read: XML). At this point we started seeing some words in the target language. Of course we also had to add the lexicon for the target language, if nobody else had done it already.

Finally we started adding rules.
We added rules to disambiguate sentences with multiple readings. For example, in the sentence “The door is open” we added a rule that open is an adjective rather than a verb, because the sentence already has a verb.
We added rules to convert the grammar. For example Finnish cases are usually replaced with prepositions in English. We might also need to add words: “sataa” needs an explicit subject in English, “it rains”.

At the end we compared the translation produced by our system with the translation made by the human translator. We briefly considered two ways to evaluate the quality of the translation.
First, we can use something like edit distance for words (instead of characters) to count how many insertion, deletions or substitutions are needed to change the machine translation to human translation. Otherwise, we can count how many words the human translator needs to change when copy editing the machine translation.
Machine translation systems start to be useful when you need to fix only one word out of six or more words in the translation.

The future

A little while ago Erik asked how the Wikimedia Foundation could support machine translation, which is now mostly in hands of big commercial entities (though the European Union is also building something) and needs an open source alternative.

We do not have lot of translation corpora like Google. We do have lots of text in different languages, but it is not the same content in all languages and it’s not aligned. Exceptions are translatewiki.net and other places where translations are done with the Translate extension. As a side note I think that translatewiki.net contains one of the most multilingual parallel translation corpora under a free license.

Given that we have lots of people in the Wikimedia movement who are multilingual and interested in languages, I think we should cooperate with an existing open source machine translation system (like Apertium) in a way that allows our users to enhance that system. Doing more translations increases the data stored in a translation memory making it more useful. In a similar fashion, doing more translations with machine translation system should make it better.

Apertium has already been in use on the Nynorsk Wikipedia. Bokmål and nynorsk are closely related languages: the kind of situation where Apertium excels.

One thing I have been thinking is that, now that the Wikimedia Language Engineering team is planning to build tools to help translate Wikipedia articles into other languages, we could closely integrate it with Apertium. We could provide an easy way for translators to add missing words and report unintelligible sentences.

I don’t expect most of our translators to actually write and correct rules, so someone should manage that on Apertium side. But at least word collection could be mostly automated; I bet someone has tried and will try to use Wiktionary data too.

As a first step, Wikimedia Foundation could set up their own Apertium instance as a web service for our needs (existing instances are too unstable). The translate extension, for example, can query such a web service to provide translation suggestions.

FOSDEM talk reflections 3/3: HipHop, communities, public procurement

Nikerabbit arrives at the MediaWiki meetup at FOSDEM

Meetup of MediaWiki community. Or Wikimedia tech? How to call the Wikimedia software development ecosystem? (Photo by henna, copyright status unknown.)

This is the third post about FOSDEM 2013; see 1/3: I18n in the WEB, Mozilla i18n and L20n for the first and 2/3: docs, code and community health, stability for the second. Links to the abstracts in the headers.

Scaling PHP with HipHop

HipHop is still alive, and faster than ever. It has evolved from PHP to C translator to a JIT bytecode interpreter system, just like PHP itself is, without JIT of course. The speedups they are seeing are impressive (it was deployed on Facebook about a month ago). Given that they have removed the compile everything before deploy step, it is now much more feasible to use.

I’m considering to give HipHop a try on translatewiki.net later this year, probably after we have upgraded to at least Ubuntu 12.04, where Facebook provides packages for it. It is still a pain in the ass to set up manually, as it was few years ago. Wikimedia Foundation (WMF) has dropped its evaluation, but perhaps they will reconsider after our experiences, and HipHop, or hhvm as it is called now, has indeed changed a lot since then.

It was highlighted that the supported language features and libraries of hhvm and PHP vary to an extent. hhvm provides some nice features like strict type hinting, but it is unlikely I can use those anytime soon, since there is no way to take advantage of these on hhvm without breaking support for normal PHP, which is something that really cannot be done in the MediaWiki ecosystem.

Community/BOF meetup

Almost 20 people were around, a few outside of WMF. Discussions circled around events like the Amsterdam Hackathon and MediaWiki groups. The most interesting part (to me) is how to call the Wikimedia software development ecosystem, so it can be marketed properly. Suggestions ranged from extending the meaning of MediaWiki to cover everything including mobile, gadgets and so on; using Wikipedia as it is the brand most well known; or creating a new Wikitech brand.

There are pros and cons to each of the above, but one thing is true: There is no name that can currently be used to refer to everything technical done around MediaWiki and Wikimedia that would also be understood by potential participants. Also, MediaWiki development is not perceived to be cool anymore, because it’s PHP. But it isn’t. MediaWiki development is also Redis, Varnish, puppet, git, Solr, HipHop, semantic, node.js, mobile, OpenStack, and more. Quim Gil will continue work on coming up with a brand. Curious visitors can also compare this to what KDE did recently when they expanded the meaning of KDE to be not only the desktop, but also the community and everything they do. The change process wasn’t painless for them, and wouldn’t be for us, but at the same time (IMHO) the change has been quite successful and beneficial to KDE.

Qt Project Update

Qt: a maintainer for each subsystem helps getting your patch reviewed, unlike MediaWiki

Qt is doing well. Qt5 is evolution instead of revolution (what Qt4 was to Qt3). The contributors outside of Nokia (and nowadays Digia) have risen to about one third of all commits. They are using Gerrit like MediaWiki. But unlike MediaWiki they have explicit hierarchy, with maintainers who are responsible for keeping each subsystem in shape. It also means that there is always at least one person you can talk to, to get your patch reviewed, unlike in the MediaWiki community.

Their platform support is also nice: Linux, Windows and OSX are fully supported, while iOS, Android and BlackBerry OS are also working more or less.

Fixing public procurement

Forgive me if I use incorrect terms. In a nutshell there is a law in Finland (coming from the EU) that disallows governmental organizations to request software systems by referring to an exact producer. So for example a hypothetical example “We want Microsoft Office on all our work stations in X department” is illegal, while “We want an office tools suite that includes documents, presentations, …” is legal. Free software people in Finland did an analysis of how many times this law has been violated… quite many… and have been sending letters that ask them to read the rules and fix their procurement.

The talk continued with the observation that there is no entity to enforce this rule, and that it is difficult to get the companies put into disadvantage to sue. One side argued that fixing this particular issue harms doing wide scale education of people on this and other issues. Or when is it useful to sue instead of trying to educate? When are the lost opportunities bigger a harm than bad publicity and money spent on suing? Apparently Microsoft has sued successfully in Finland to gain lots of money without a big PR hit. Open source solutions are usually discarded because the exit costs of the previous system are placed on the new solution instead of the old vendor lock-in solution.

All in all, the target of this kind of work is to further open source use in governments by allowing free competition; they want to do it EU-wide.

The Keeper of Secrets: The Dance of Community Leadership

FOSDEM preparty was definitely not a quiet beerless one

This is the first time I’ve seen Leslie Hawthorn speaking. In her talk, which was full of beer jokes (a bit too many to my taste), I caught some points:

Don’t be a jerk.
Stop gossiping and talk directly to people you have problems with.
Don’t ignore difficult people, be brave enough to let them know your honest opinion.
Don’t be a jerk while talking about difficult things with people, do it politely and cooperatively.
Face to face meetings are essential to community building (my addition: also include possibilities do it in beerless, quiet places).

FOSDEM talk reflections 2/3: docs, code and community health, stability

This is the second post about FOSDEM 2013; see 1/3: I18n in the WEB, Mozilla i18n and L20n for the first. Links to the abstracts in the headers.

Open Sourcing Documentation

Don’t keep documentation to yourself, release it with an open license. Others might see the forest from the trees when you are too close to the problem. Translators can translate the documentation, but this needs proper tools, something which was not mentioned in the presentation.

Also mentioned in the presentation, webplatform.org (greetings sent to Ryan Lane who helped to build it) and the problems of Mozilla having to support both webplatform and its own MDN, where the latter has wider scope and more restrictive license than the former.

Coping with the proliferation of tools within your community

Entering the MediaWiki community and contributing to MediaWiki is hard for many reasons

I got the impression that XWiki is the everything and a kitchen sink of wikis. It has several nice points related to having everything in one place (one wiki):

Only one place to have account and one place to sign in.
Can search everything like bugs, commits, IRC logs and documentation at once.

In my opinion it’s a nice starting point for projects, but in the end it’s also about the quality of individual tools that matters when projects grow. I don’t see how MediaWiki could replace Gerrit or Bugzilla with something provided by XWiki.

XWiki is among one of the candidates to take over MediaWiki if MediaWiki is not able to revitalize its community by improving the extension and gadget development ecosystem. XWiki being written in Java can be a deterrent for some over PHP (but the opposite is certainly true, too).

How we made the Jenkins community

It’s not enough to lower the barriers, the barriers must be removed wherever possible. While Wikipedia is/was quite open and easy to use (not going too deep into that), MediaWiki development had and still has many barriers. It was not long ago that getting commit access took ages and you had to present your CV. Nowadays commit access is easier to get because we use Gerrit to review code before it is merged. But Gerrit is quite a complex beast and we still don’t have GitHub integration to accept drive-by patches. Lack of documentation and difficulty of getting patches reviewed is also a problem, not to mention the lack of a shared vision where MediaWiki should go.

Another problem that I think is not being realized is that MediaWiki code, while not as bad as it was, is in my humble opinion not improving fast enough. MediaWiki core is a huge monolithic piece of code, entangled in many places to the extent that it is impossible to extend without refactoring the affected code first. This greatly affects the innovation and creation of new extensions and thus is a problem for the MediaWiki ecosystem. The core code needs to become cleaner and more modular. The modularity and quality of Jenkins APIs seemed to be a major reason for the applaudable growth of its community.

Wish: Can we have an integrated search that provides results from mediawiki.org, Bugzilla, IRC logs, mailing lists and other relevant places and not just a service on labs known to the lucky few? There is also lots of stuff in etherpads, Google documents and even on blogs that will probably never be searchable.

Improving Stability of Mozilla Products

Something we lack compared to Mozilla is the crash metrics (note the diversity at their booth)

This talk made me wish for information on what actual problems MediaWiki users and developers are facing. Unfortunately, due to the nature of PHP, fatal errors and warnings can be caused by so many things, including syntax errors users made in their LocalSettings.php, that the data would need lots of filtering. There are also privacy concerns, but it should be possible to do something by collecting exceptions and JavaScript errors and aggregating them for some selected few to see. This would help to prioritize bugs, as Bugzilla and IRC channels are bad indicators of how many people are actually encountering a particular issue.

I’d like to be optimistic, but the best thing we have so far is translatewiki.net, which collects PHP errors, warnings and exceptions as well as JavaScript errors. However, aside from the PHP warnings and notices that are announced on IRC in #mediawiki-i18n, that information is only available to me and few others, and after all translatewiki.net is just a minor piece of the total system. One of my personal wishes is to have more priority given to the issues affecting translatewiki.net, as experience learns us that issues discovered on translatewiki.net, will most often also surface as issues in Wikimedia wikis. Mozilla can state things like: This bug affects 10% of all active users or hundred million users every day. Can we please have that, too?

It would be so awesome if the Wikimedia Foundation could release similar kinds of information to the wider public.

Shorts

IonMonkey: Yet Another JIT Compiler for JavaScript?

Why is yet another JIT compiler needed? Because the new one is better! Sometimes it makes sense. IonMonkey is based on well-understood techniques like static single assigment.

WebRTC: Real time web communication

The open source technology stack to kill Skype and Google Hangout is in process, but will still take a while to reach a browser near you. Check out http://reveal.rs.af.cm/ with a recent version of Google Chrome for a preview of what can also be accomplished with WebRTC.

PDF.js – Firefox’s HTML5 PDF Viewer

This is something that the Wikimedia Foundation could perhaps use too, though limited browser support is an issue.

The presentation had some nice tidbits on how the PDF standard includes everything and a kitchen sink like a 3D model viewer. It also explained what is easy to port to HTML5 (many things can be done with Canvas specification) and what is not. HTML5 specifications are going to see additions due to this work.

Changesets evolution with Mercurial

A cool idea: Tracking the changes to the commit history itself, so you can alter the history without deleting any parts of it. Questionable though if this will see wide spread use, and it’s only (coming to) in Mercurial now, not in Git.

FOSDEM talk reflections 1/3: I18n in the WEB, Mozilla i18n and L20n

FOSDEM 2013 was attended by several Wikimedians.

Now that I’ve slept over the presentations I attended at FOSDEM, it’s a good time to think about what I heard and how it related to what I am doing. It is also a good time before I forget what I heard. I didn’t get to talk to that many people this year, mostly running from one talk to another.

There will be three parts to the series of these blog posts. I will start with i18n related topics and then other presentations roughly in the order I saw them (headers link to abstracts). There will also be a follow-up post on the gettext format detailing the good and bad sides from today’s point of view. Stay tuned!

An Integrated Localization Environment

Mozilla keeps pushing new i18n stuff, though the general feeling of this and other related talks is that they either have not defined what is the issue they are fixing, or they have defined in a way that is completely different from what we are working on.

While we are trying to make it as easy as possible for translators to translate (in a technical sense, they already have enough of complexity due to language itself), the ILE proposed in this talk is essentially a IDE (integrated development environment) – a glorified text editor that programmers often use for programming. It has features like highlighting syntax via colors and automatic completion for translation file syntax.

But do translators really care about particular syntax of translation in a file, or are they in fact more happy if they do not need to care about files and version control systems at all, while at the same time having access to aids like translation memories and change tracking in an interface created by UX designers, as we have in translatewiki.net?

“It helps to see the messages above and below to understand the context”
You can see the related messages close to each other in almost any translation tool, even though showing related messages next to each other is not a replacement for proper documentation of context for each message.

“I don’t see how form based translation tools would cope with more complex localisation file formats like L20n”
I don’t think the solution to facilitating proper localisation is to turn the localisation itself into programming. The cases where more complex logic is needed are actually relatively few and I think it is worthwhile to keep the common case as simple as possible while supporting also the more complex cases in a standardized, data driven way, like using the CLDR.

L20n

Mozilla keeps pushing new i18n stuff: who is the user they are designing new tools for?

This talk was an update to the similar presentation on L20n last year. What I said on the previous post about turning localisation into programming applies here too.

It is nice that you specify grammatical gender for things, but this format does not really solve the problem that many variables actually come from user input, for which we cannot specify this information.

It is nice that you can make custom plural rules, but in almost all cases the standard set of plural rules that comes from standards like CLDR is enough.

It is nice that you can mix gender and plural and even many plural in one message using nested hashes (arrays in PHP), but it is not nice at all that you have to translate the message N*M*O times as the number of variables increase. I firmly believe that inline syntax like {{GENDER:$1|he|she}} eats {{PLURAL:$2|apple|apples}} is superior in this regard.

If we strip the plural, gender, time formatting etc. support from L20n, we actually just get a complex file format for storing things, something which we already have many variants of. The aforementioned features are usually provided by the i18n library (or definitely should be; unfortunately this is not always the case) so what they have done is actually moving the complexity of language from i18n libraries and software developers to translators. Aiming at “keep common case simple, but support complex cases where needed”, I don’t think this is as presented a good trade-off between simplicity and flexibility.

webL10n: client-side i18n / l10n library

This talk was about adapting some nice parts of L20n to .properties format. The result is somewhat more complex than plain .properties and not as flexible as L20n. Even having gender and plural in the same message is problematic in this format.

I’d like to highlight two ideas in webl10n. Sidenote: Why call it l10n when it is actually an i18n library for developers, similar to jquery.i18n.

The first idea is that you can have html like this:

<div l10n-data-id=retro> <div>Please <a href="login/">log in</a></div> </div>

And the translators see this:

retro = <div>Please <a>log in</a></div>

The translation, when displayed, is properly merged to the original html so that the classes and link targets are preserved. I don’t know what happens if the translation is outdated and the structure is changed, but I guess we just should not use outdated translations with this system. When escaping is handled properly, this is a very nice way to handle what we call lego messages, where the text of the link is in a separate message, because due to escaping we can’t have link and link text in the same message.

Another idea is that if you have HTML like this:

<input type="search" placeholder="Search messages" title="Message search box">

You can turn it to this.

<input type="search" data-l10n-id="searchbox">

And translators will see this (using .properties format here)

searchbox.placeholder=Search messages searchbox.title=Message search box

This simplifies the html the developers need to write.

Finally, take a look also at Pau’s Design talks at FOSDEM 2013.

New language stuff for developers and users

MLEB 2013.01 has been released by Amir Aharoni. Lots of development has been happening in Translate due to the work on the new translation interfaces. If you are a developer, please also checkout the latest new and changed Web APIs and give us a shout in #mediawiki-i18n @freenode if you see something obviously wrong or missing. Also included in this release are bug fixes for Universal Language Selector, while the other included extensions didn’t see many changes.

Some months ago I wrote about Language tag validation in MediaWiki. A nice person named Siebrand Mazeland decided to improve the situation. As of now we have three new methods developed by the Wikimedia Language engineering team:

isSupportedLanguage
isWellFormedLanguageTag
isKnownLanguageTag

Unless these methods are backported to MediaWiki 1.19 and MediaWiki 1.20, it will take a while before these are being used in extensions, but after a while we should see faster and more readable code.

How I debug performance issues in MediaWiki

The earlier post does not describe how I usually do performance improvements. Usually it starts with debugging the less innocent-looking messages by our IRC bot rakkaus, which relays PHP error messages to the IRC channel. An example:

[01-Nov-2012 20:16:25 UTC] PHP Fatal error: Maximum execution time of 30 seconds exceeded in /www/translatewiki.net/w/extensions/Translate/ttmserver/TTMServer.php on line 100

After this I have to use the timestamp to match our webserver access log and try if I can reproduce the issue by loading the same url. PHP is very unhelpful in this regard: fatal errors don’t give the request url nor stacktrace. Sometimes it is a command line script like the job runner initiated via cron. For those cases I’ve implemented a simple logging of all maintenance script executions, but they are still annoying to debug. Once I am able to reproduce the issue on the production environment, I try to reproduce it also on my development environment. Oh boy, it is fun if that is not possible. If I can, however, I will usually start by looking the per-request profiling included in the page source, with output like this:

0.0558 8.5M Connected to database 0 at localhost 0.0562 8.5M Query sandwiki (14) (slave): SELECT /* SqlBagOStuff::getMulti Nike */ keyname,value,exptime FROM `bw_objectcache` WHERE keyname = ‘sw:messages:fi’

Here we see that it takes 56 milliseconds before MediaWiki even connects to the database, and the first thing it does is to load messages for the current user language. What usually follows is old style debugging where I add echo and var_dump statements until I have understood what is happening and what is inefficient. After that, the creative phase begins: finding a way to make it faster. Usually there is some sort of bug in the code that causes it to do unnecessary work. Rarely the bad performance is actually caused by slow algorithms. This kind of makes sense: the datasets we are processing are usually small, and when they are bigger, it is usually written in an efficient way in the first place.

I love performance tuning, but I have to be prudent to pick the right things to optimize, because it is also a great time sink, and as a busy person I am entitled only to few time sinks at a time.

It rains like a saavi

About me, me and me