Category Archives: vapaasuomi

Goes to vapaasuomi planet

MediaWiki short urls with nginx and main page without redirect

This post has been updated 2015-09-06 with simplified code suggested by Krinkle and again in 2017-04-04.

Google PageSpeed Insights writes:

Redirects trigger an additional HTTP request-response cycle and delay page rendering. In the best case, each redirect will add a single round trip (HTTP request-response), and in the worst it may result in multiple additional round trips to perform the DNS lookup, TCP handshake, and TLS negotiation in addition to the additional HTTP request-response cycle. As a result, you should minimize use of redirects to improve site performance.

Let’s consider the situation where you run MediaWiki as the main thing on your domain. When user goes to your domain example.com, MediaWiki by default will issue a redirect to example.com/wiki/Main_Page, assuming you have configured the recommended short urls.

In addition the short url page writes:

Note that we do not recommend doing a HTTP redirect to your wiki path or main page directly. As redirecting to the main page directly will hard-code variable parts of your wiki’s page setup into your server config. And redirecting to the wiki path will result in two redirects. Simply rewrite the root path to MediaWiki and it will take care of the 301 redirect to the main page itself.

So are we stuck with a suboptimal solution? Fortunately, there is a way and it is not even that complicated. I will share example snippets from translatewiki.net configuration how to do it.

Configuring nginx

For nginx, the only thing we need in addition the default wiki short url rewrite is to rewrite / so that it is forwarded to MediaWiki. The configuration below assumes MediaWiki is installed in the w directory under the document root.

location ~ ^/wiki/ {
	rewrite ^ /w/index.php;
}

location = / {
	rewrite ^ /w/index.php;
}

Whole file for the curious.

Configuring MediaWiki

First, in our LocalSettings.php we have the short url configuration:

$wgArticlePath      = "/wiki/$1";
$wgScriptPath       = "/w";

In addition we use hooks to tell MediaWiki to make / the URL for the main page, not to be redirected:

$wgHooks['GetLocalURL'][] = function ( &$title, &$url, $query ) {
	if ( $title->isExternal() && $query != '' && $title->isMainPage() ) {
		$url = '/';
	}
};

// Tell MediaWiki that "/" should not be redirected
$wgHooks['TestCanonicalRedirect'][] = function ( $request ) {
	return $request->getRequestURL() !== '/';
};

This has the added benefit that all MediaWiki generated links to the main page point to the domain root, so you only have one canonical url for the wiki main page. The if block in the middle with strpos checks for problematic characters ? and & and forces them to use the long URLs, because otherwise they would not work correctly with this nginx rewrite rule.

And that’s it. With these changes you can have your main page displayed on your domain without redirect, also keeping it short for users to copy and share. This method should work for most versions of MediaWiki, including MediaWiki 1.26 which forcefully redirects everything that doesn’t match the canonical URL as seen by MediaWiki.

translatewiki.net – harder, better, faster, stronger

I am very pleased to announce that translatewiki.net has been migrated to new servers sponsored by netcup GmbH. Yes, that is right, we now have two servers, both of which are more powerful than the old server.

Since the two (virtual) servers are located in the same data center and other nitty gritty details, we are not making them redundant for the sake of load balancing or uptime. Rather, we have split the services: ElasticSearch runs on one server, powering the search, translation search and translation memory; everything else runs on the other server.

In addition to faster servers and continuous performance tweaks, we are now faster thanks to the migration from PHP to HHVM. The Wikimedia Foundation did this a while ago with great results, but HHVM has been crashing and freezing on translatewiki.net for unknown reasons. Fortunately, recently I found a lead that the issue is related to a ini_set function, which I was easily able to work around while the investigation on the root cause continues.

Non-free Google Analytics confirms that we now server pages faster.

Non-free Google Analytics confirms that we now serve pages faster: the small speech bubble indicates migration day to new servers and HHVM. Effect on the actual page load times observed by users seems to be less significant.

We now have again lots of room for growth and I challenge everyone to make us grow with more translations, new projects or other legitimate means, so that we reach a point where we will need to upgrade again ;). That’s all for now, stay tuned for more updates.

14 more languages “fully” translated this week

This week, MediaWiki’s priority messages have been fully translated in 14 more languages by about a dozen translators, after we checked our progress. Most users in those languages now see the interface of Wikimedia wikis entirely translated.

In two months since we updated the list of priority translations, languages 99+ % translated went from 17 to 60. No encouragement was even needed: those 60 languages are “organically” active, translators quickly rushed to use the new tool we gave them. Such regular and committed translators deserve a ton of gratitude!

However, we want to do better. We did something simple: tell MediaWiki users that they can make a difference, even if they don’t know. «With a couple hours’ work or less, you can make sure that nearly all visitors see the wiki interface fully translated.» The results we got in few hours speak for themselves:

Special:TranslationStats graph of daily registrations

This week’s peak of new translator daily registrations was ten times the usual

Special:TranslationStats of daily active translators

Many were eager to help: translation activity jumped immediately

Thanks especially to CERminator, David1010, EileenSanda, KartikMistry, Njardarlogar, Pymouss, Ranveig, Servien, StanProg, Sudo77(new), TomášPolonec and Чаховіч Уладзіслаў, who completed priority messages in their languages.

For the curious, the steps to solicit activity were:

take languages 80–99 % translated (about 60);
list active users using those languages on Wikimedia wikis (about 600);
send them a short talk page message (example);
welcome and thank them!

There is a long tail of users who see talk page messages only after weeks or months, so for most of those 60 languages we hope to get more translations later. It will be harder to reach the other hundreds languages, for which there are only 300 active users in Wikimedia according to interface language preferences: about 100 incubating languages do not have a single known speaker on any wiki!

We will need a lot of creativity and word spreading, but the lesson is simple: show people the difference that their contribution can make for free knowledge; the response will be great. Also, do try to reach the long tail of users and languages: if you do it well, you can communicate effectively to a large audience of silent and seemingly unresponsive users on hundreds Wikimedia projects.

IWCLUL 3/3: conversations and ideas

In IWCLUL talks, Miikka Silfverberg’s mention of collecting words from Wikipedia resonated with my earlier experiences working with Wikipedia dumps, especially the difficulty of it. I talked with some people at the conference and everyone seemed to agree that processing Wikipedia dumps takes a lot of time, which they could spend for something else. I am considering to publish plain text Wikipedia dumps and word frequency lists. While working in the DigiSami project, I familiarized myself with the utilities as well as the Wikimedia Tool Labs, so relatively little effort would be needed. The research value would be low, but it would be worth it, if enough people find these dumps and save time. A recent update is that Parsoid is planning to provide plain text format, so this is likely to become even easier in the future. Still, there might be some work to do collect pages into one archive and decide which parts of page will stay and which will be removed: for example converting an infobox to collection of isolated words is not useful for use cases such as WikiTalk, and it can also easily skew word frequencies.

I talked with Sjur Moshagen about keyboards for less resourced languages. Nowadays they have keyboards for Android and iOS, in addition to keyboards for computers (which already existed). They have some impressing additional features, like automatically adding missing accents to typed words. That would be too complicated to implement in jquery.ime, a project used by Wikimedia that implements keyboards in a browser. At least the aforementioned example uses finite state transducer. Running finite state tools in the browser does not yet feel realistic, even though some solutions exist^*. The alternative of making requests to a remote service would slow down typing, except perhaps with some very clever implementation, which would probably be fragile at best. I have still to investigate whether there is some middle ground to bring the basic keyboard implementations to jquery.ime.

^*Such as jsfst. One issue is that the implementations and the transducers themselves can take lot of space, which means we will run into same issues as when distributing large web fonts at Wikipedia.

I spoke with Tommi Pirinen and Antti Kanner about implementing a dictionary application programming interface (API) for the Bank of Finnish Terminology in Arts and Sciences (BFT). That would allow direct use of BFT resources in translation tools like translatewiki.net and Wikimedia’s Content Translation project. It would also help indirectly, by using a dump for extending word lists in the Apertium machine translation software.

I spoke briefly about language identification with Tommi Jauhiainen who had a poster presentation about the project “The Finno-Ugric languages and the internet”. I had implemented one language detector myself, using an existing library. Curiously enough, many other people met in Wikimedia circles have also made their own implementations. Mine had severe problems classifying languages which are very close to each other. Tommi gave me a link for another language detector, which I would like to test in the future to compare its performance with previous attempts. We also talked about something I call “continuous” language identification, where the detector would detect parts of running text which are in a different language. A normal language detector will be useful for my open source translation memory service project, called InTense. Continuous language identification could be used to post-process Wikipedia articles and tag foreign text so that correct fonts are applied, and possibly also in WikiTalk-like applications, to provide the text-to-speech (TTS) with a hint on how to pronounce those words.

Reasonator is a software that generates visually pleasing summary pages in natural language and structured sections, based on structured data. More specifically, it uses Wikidata, which is the Wikimedia structured data project, developed by Wikimedia Germany. Reasonator works primarily for persons, though other types or subjects are being developed. Its localisation is limited, compared to the about three hundred languages of MediaWiki. Translating software which generates natural language sentences dynamically is very different from the usual software translation, which consists mostly of fixed strings with occasional placeholder which is replaced dynamically when showing text to an user.

It is not a new idea to use grammatical framework (GF), which is a language translation software based on interlingua, for Reasonator. In fact I had proposed this earlier in private discussions to Gerard Meijssen, but this conference renewed my interest in the idea, as I attended the GF workshop held by Aarne Ranta, Inari Listenmaa and Francis Tyers. GF seems to be a good fit here, as it allows limited context and limited vocabulary translation to many languages simultaneously; vice versa, Wikidata will contain information like gender of people, which can be fed to GF to get proper grammar in the generated translations. It would be very interesting to have a prototype of a Reasonator-like software using GF as the backend. The downside of GF is that (I assume) it is not easy for our regular translators to work with, so work is needed to make it easier and more accessible. The hypothesis is that with GF backend we would get a better language support (as in grammatically correct and flexible) with less effort on the long run. That would mean providing access to all the Wikidata topics even in smaller languages, without the effort of manually writing articles.

IWCLUL 2/3: morphology, OCR, a corpus vs. Wiktionary

More on IWCLUL: now on the sessions. The first session of the day was by the invited speaker Kimmo Koskenniemi. He is applying his two-level formalism in a new area, old literary Finnish (example of old literary Finnish). By using two-level rules for old written Finnish together with OMorFi, he is able to automatically convert old text to standard Finnish dictionary forms, which can be used, in the main example, as an input text to an search engine. He uses weighted transducers to rank the most likely equivalent modern day words. For example the contemporary spelling of wijsautta is viisautta, which is an inflected form of the noun viisaus (wisdom). He only takes the dictionary forms, because otherwise there are too many unrelated suggestions. This avoids the usual problems of too many unrelated morphological analyses: I had the same problen in my master’s thesis when I attempted using OMorFi to improve Wikimedia’s search system, which was still using Lucene at that time.

Jeremy Bradley gave presentation about an online Mari corpus. Their goal was to make a modern English-language textbook for Mari, for people who do not have access to native speakers. I was happy to see they used a free/copyleft Creative Commons license. I asked him whether they considered Wiktionary. He told me he had discussed with a person from Wiktionary who was against an import. I will be reaching my contacts and see whether an another attempt will succeed. The automatic transliteration between Latin, Cyrillic and IPA was nice, as I have been entertaining the idea of doing transliteration from Swedish to Finnish for WikiTalk, to make it able to function in Swedish as well by only using Finnish speech components. One point sticks with me: they had to add information about verb complements themselves, as they were not recorded in their sources. I can sympathize with them based on my own language learning experiences.

Stig-Arne Grönroos’ presentation on Low-resource active learning of North Sámi morphological segmentation did not contain any surprises for me after having been exposed to this topic previously. All efforts to support languages where we have to cope with limited resources are welcome and needed. Intermediate results are better than working with nothing while waiting for a full morphological analyser, for example. It is not completely obvious to me how this tool can be used in other language technology applications, so I will be happy to see an example.

Miikka Silfverberg presented about OCR, using OMorFi: can morphological analyzers improve the quality of optical character recognition? To summarize heavily, OCR performed worse when OMorFi was used, compared to just taking the top N most common words from Wikipedia. I understood this is not exactly the same problem of large number of readings generated by morphological analyser, rather something different but related.

Prioritizing MediaWiki’s translation strings

After a very long wait, MediaWiki’s top 500 most important messages are back at translatewiki.net with fresh data. This list helps translators prioritize their work to get most out of their effort.

What are the most important messages

In this blog post the term message means a translatable string in a software; technically, when a message is shown to users, they see different strings depending on the interface language.

MediaWiki software includes almost 5.000 messages (~40.000 words), or almost 24.000 messages (~177.000 words) if we include extensions. Since 2007, we make a list of about 500 messages which are used most frequently.

Why? If translators can translate few hundreds words per hour, and translating messages is probably slower than translating running text, it will take weeks to translate everything. Most of our volunteer translators do not have that much time.

Assuming that the messages follow a long tail pattern, a small number of messages are shown* to users very often, like the Edit button at the top of page in MediaWiki. On the other hand, most messages are only shown on rare error conditions or are part of disabled or restricted features. Thus it makes sense to translate the most visible messages first.

Concretely, translators and i18n fans can monitor the progress of MediaWiki localisation easily, by finding meaningful numbers in our statistics page; and we have an clear minimum service level for new locales added to MediaWiki. In particular, the Wikimedia Language committee requires that at very least all the most important messages are translated in a language before that language is given a Wikimedia project subdomain. This gives an incentive to kickstart the localisation in new languages, ensures that users see Wikimedia projects mostly in their own language and avoids linguistic colonialism.

The screenshot shows an example page with messages replaced by their key instead of their string content. Click for full size.

Some history and statistics

The usage of the list for monitoring was fantastically impactful in 2007 and 2009 when translatewiki.net was still ramping up, because it gave translators concrete goals and it allowed to streamline the language proposal mechanism which had been trapped into a dilemma between a growing number of requests for language subdomains and a growing number of seemingly-dead open subdomains. There is some more background on translatewiki.net.

Languages with over 99 % most used messages translated were:

38 in 2007 shortly after the new list;
96 one year later against the same list (or 100 with the new list shortly after) , which means the list helped triple the coverage;
181 in 2011 with the 2009 list, but actually 82 after updating; back to 115 6 months later and 143 after 6 more months;
113 in latest available data for the 2011 list (April 2014), only 17 right now with the new list.

There is much more to do, but we now have a functional tool to motivate translators! To reach the peak of 2011, the least translated language among the first 181 will have to translate 233 messages, which is a feasible task. The 300th language is 30 % translated and needs 404 more translations. If we reached such a number, we could confidently say that we really have Wikimedia projects in 280+ languages, however small.

* Not necessarily seen: I’m sure you don’t read the whole sidebar and footer every time you load a page in Wikipedia.

Process

At Wikimedia, first, for about 30 minutes we logged all requests to fetch certain messages by their key. We used this as a proxy variable to measure how often a particular message is shown to the user, which again is a proxy of how often a particular message is seen by the user. This is in no way an exact measurement, but I believe it good enough for the purpose. After the 30 minutes, we counted how many times each key was requested and we sorted by frequency. The result was a list containing about 17.000 different keys observed in over 15 million calls. This concluded the first phase.

In the second phase, we applied a rigorous human cleanup to the list with the help of a script, as follows:

We removed all keys not belonging to MediaWiki or any extension. There are lots of keys which can be customized locally, but which don’t correspond to messages to translate.
We removed all messages which were tagged as “ignored” in our system. These messages are not available for translation, usually because they have no linguistic content or are used only for local site-specific customization.
We removed messages called less than 100 times in the time span and other messages with no meaningful linguistic content, like messages where there are only dashes or other punctuation which usually don’t need any changes in translation.
We removed any messages we judged to be technical or not shown often to humans, even though they appeared high in this list. This includes some messages which are only seen inside comments in the generated HTML and some messages related to APIs or EXIF features.

Finally, some coding work was needed by yours truly to let users select those messages for translation at translatewiki.net.

Discoveries

In this process some points emerged that are worth highlighting.

310 messages (62 %) of the previous list (from 2011) are in the new list as well. Superseded user login messages have now been removed.
Unsurprisingly, there are new entries from new highly visible extensions like MobileFrontend, Translate, Collection and Echo. However, except a dozen languages, translators didn’t manage to keep up with such messages in absence of a list.
I just realized that we are probably missing some high visibility messages only used in the JavaScript side. That is something we should address in the future.
We slightly expanded the list from 500 to 600 messages, after noticing there were few or no “important” messages beyond this point. This will also allow some breathing space to remove messages which get removed.
We did not follow a manual passage as in the original list, which included «messages that are not that often used, but important for a proper look and feel for all users: create account, sign on, page history, delete page, move page, protect page, watchlist». A message like “watchlist” got removed, which may raise suspicions: but it’s “just” the HTML title of Special:Watchlist, more or less as important as the the name “Special:Watchlist” itself, which is not included in the list either (magic words, namespaces or special pages names are not included). All in all, the list seems plausible.

Conclusion

Finally, the aim was to make this process reproducible so that we could do it yearly, or even more often. I hope this blog post serves as a documentation to achieve that.

I want to thank Ori Livneh for getting the key counts and Nemo for curating the list.

IWCLUL event report 1/3: the story

IWCLUL is short for International Workshop on Computational Linguistics for Uralic Languages. I attended the conference, held on January 16th 2015, and presented a joint paper with Antti on Multilingual Semantic MediaWiki for Finno-Ugric dictionaries at the poster session.

I attentively observe the glimmering city lights of Tromsø as the plane lands in darkness to orientate myself to the maps I studied on my computer before the trip. At the airport I receive a kind welcome by Trond, in Finnish, together with a group of other people going to the conference. While he is driving us into our hotels, Trond elaborates the sights of the island we pass by. I and Antti, who is co-author of our paper about Multilingual Semantic MediaWiki, check in to the hotel and joke about the tendency of forgetting posters in different places.

Next morning I meet Stig-Arne at breakfast. We decide to go see the local cable car. We wander around the city center until we finally find a place where they sell bus tickets. We had asked a few people but they gave conflicting different directions. We take the bus and then Fjellheisen, the cable car, to the top. The sights are wonderful even in winter. I head back, do some walking in the center. I buy some postcards and use that as an excuse to get inside and warm up.

On Friday, on the conference day, almost by miracle, we end up in the conference place without too many issues, despite seeing no signs in the University of Tromsø campus. More information of the conference itself will be provided in the following parts. And the poster? We forgot to take it with us from the social event after the conference.

GNU i18n for high priority projects list

Today, for a special occasion, I’m hosting this guest post by Federico Leva, dealing with some frequent topics of my blog.

A special GNU committee has invited everyone to comment on the selection of high priority free software projects (thanks M.L. for spreading the word).

In my limited understanding from looking every now and then in the past few years, the list so far has focused on “flagship” projects which are perceived to the biggest opportunities, or roadblocks to remove, for the goal of having people only use free/libre/open source software.

A “positive” item is one which makes people want to embrace GNU/Linux and free software in order to user it: «I want to use Octave because it’s more efficient». A “negative” item is an obstacle to free software adoption, which we want removed: «I can’t use GNU/Linux because I need AutoCAD for work».

We want to propose something different: a cross-fuctional project, which will benefit no specific piece of software, but rather all of them. We believe that the key for success of each and all the free software projects is going to be internationalization and localization. No i18n can help if the product is bad: here we assume that the idea of the product is sound and that we are able to scale its development, but we “just” need more users, more work.

What we believe

If free software is about giving control to the user, we believe it must also be about giving control of the language to its speakers. Proper localisation of a software can only be done by people with a particular interest and competence in it, ideally language natives who use the software.

It’s clear that there is little overlap between this group and developers; if nothing else, because most free software projects have at most a handful developers: all together, they can only know a fraction of the world’s languages. Translation is not, and can’t be, a subset of programming. A GNOME dataset showed a strong specialisation of documenters, coders and i18n contributors.

We believe that the only way to put them in control is to translate the wiki way: easily, the only requirement being language competency; with no or very low barriers on access; using translations immediately in the software; correcting after the fact thanks to their usage, not with pre-publishing gatekeeping.

Translation should not be a labyrinth

In most projects, the i18n process is hard to join and incomprehensible, if explained at all. GNOME has a nice description of their workflow, which however is a perfect example of what the wiki way is not.

A logical consequence of the wiki way is that not all translators will know the software like their pockets. Hence, to translate correctly, translators need message documentation straight in their translation interface (context, possible values of parameters, grammatical role of words, …): we consider this a non-negotiable feature of any system chosen. Various research agrees.

Ok, but why care?

I18n is a recipe for success

First. Developers and experienced users are often affected by the software localisation paradox, which means they only use software in English and will never care about l10n even if they are in the best position to help it. At this point, they are doomed; but the computer users of the future, e.g. students, are not. New users may start using free software simply because of not knowing English and/or because it’s gratis and used by their school; then they will keep using it.

With words we don’t like much, we could say: if we conquer some currently marginal markets, e.g. people under a certain age or several countries, we can then have a sufficient critical mass to expand to the main market of a product.

Research is very lacking on this aspect: there was quite some research on predicting viability of FLOSS projects, but almost nothing on their i18n/l10n and even less on predicting their success compared to proprietary competitors, let alone on the two combined. However, an analysis of SourceForge data from 2009 showed that there is a strong correlation between high SourceForge rank and having translators (table 5): for successful software, translation is the “most important” work after coding and project management, together with documentation and testing.

Second. Even though translation must not be like programming, translation is a way to introduce more people in the contributor base of each piece of software. Eventually, if they become more involved, translators will get in touch with the developers and/or the code, and potentially contribute there as well. In addition to this practical advantage, there’s also a political one: having one or two orders of magnitude more contributors of free software, worldwide, gives our ideas and actions a much stronger base.

Practically speaking, every package should be i18n-ready from the beginning (the investment pays back immediately) and its “Tools”/”Help” menu, or similarly visible interface element, should include a link to a website where everyone can join its translation. If the user’s locale is not available, the software should actively encourage joining translation.

Arjona Reina et al. 2013, based on the observation of 41 free software projects and 22 translation tools, actually claim that recruiting, informing and rewarding the translators is the most important factor for success of l10n, or even the only really important one.

Exton, Wasala et al. also suggest to receive in situ translations in a “crowdsourcing” or “micro-crowdsourcing” limbo, which we find superseded by a wiki. In fact, they end up requiring a “reviewing mechanism such as observed in the Wikipedia community” anyway, in addition to a voting system. Better keep it simple and use a wiki in the first place.

Third. Extensive language support can be a clear demonstration of the power of free software. Unicode CLDR is an effort we share with companies like Microsoft or Apple, yet no proprietary software in the world can support 350 languages like MediaWiki. We should be able to say this of free software in general, and have the motives to use free software include i18n/l10n.

Research agrees that free software is more favourable for multilingualism because compared to proprietary software translation is more efficient, autonomous and web-based (Flórez & Alcina, 2011; citing Mas 2003, Bowker et al. 2008).

The obstacle here is linguistic colonialism, namely the self-disrespect billions of humans have for their own language. Language rights are often neglected and «some languages dominate» the web (UNO report A/HRC/22/49, §84); but many don’t even try to use their own language even where they could. The solution can’t be exclusively technical.

Fourth. Quality. Proprietary software we see in the wild has terrible translations (for example Google, Facebook, Twitter). They usually use very complex i18n systems or they give up on quality and use vote-based statistical approximation of quality; but the results are generally bad. A striking example is Android, which is “open source” but whose translation is closed as in all Google software, with terrible results.

How to reach quality? There can’t be an authoritative source for what’s the best translation of every single software string: the wiki way is the only way to reach the best quality; by gradual approximation, collaboratively. Free software can be more efficient and have a great advantage here.

Indeed, quality of available free software tools for translation is not a weakness compared to proprietary tools, according to the same Flórez & Alcina, 2011: «Although many agencies and clients require translators to use specific proprietary tools, free programmes make it possible to achieve similar results».

We are not there yet

Many have the tendency to think they have “solved” i18n. The internet is full of companies selling i18n/10n services as if they had found the panacea. The reality is, most software is not localised at all, or is localised in very few languages, or has terrible translations. Explaining the reasons is not the purpose of this post; we have discussed or will discuss the details elsewhere. Some perspectives:

Gettext is powerful but problematic (cf. the Gettext Localisation horror story, 1999).
Mozilla i18n and L20n has an unclear direction and a tendency to turn localisation into programming.
We know little of most proprietary software.
In the translatewiki.net intro, Translating the wiki way (video) and Localisation for developers (doc) we try to explain what matters for us and how we do things at translatewiki.net.

A 2000 survey confirms that education about i18n is most needed: «There is a curious “localisation paradox”: while customising content for multiple linguistic and cultural market conditions is a valuable business strategy, localisation is not as widely practised as one would expect. One reason for this is lack of understanding of both the value and the procedures for localisation.»

Can we win this battle?

We believe it’s possible. What above can look too abstract, but it’s intentionally so. Figuring out the solution is not something we can do in this document, because making i18n our general strength is a difficult project: that’s why we argue it needs to be in the high priority projects list.

The initial phase will probably be one of research and understanding. As shown above, we have opinions everywhere, but too little scientific evidence on what really works: this must change. Where evidence is available, it should be known more than it currently is: a lot of education on i18n is needed. Sharing and producing knowledge also implies discussion, which helps the next step.

The second phase could come with a medium term concrete goal: for instance, it could be decided that within a couple years at least a certain percentage of GNU software projects should (also) offer a modern, web-based, computer-assisted translation tool with low barriers on access etc., compatible with the principles above. Requirements will be shaped by the first phase (including the need to accommodate existing workflows, of course).

This would probably require setting up a new translation platform (or giving new life to an existing one), because current “bigs” are either insufficiently maintained (Pootle and Launchpad) or proprietary. Hopefully, this platform would embed multiple perspectives and needs of projects way beyond GNU, and much more un-i18n’d free software would gravitate here as well.

A third (or fourth) phase would be about exploring the uncharted territory with which we share so little, like the formats, methods and CAT tools existing out there for translation of proprietary software and of things other than software. The whole translation world (millions of translators?) deserves free software. For this, a way broader alliance will be needed, probably with university courses and others, like the authors of Free/Open-Source Software for the Translation Classroom: A Catalogue of Available Tools and tuxtrans.

“What are you doing?”

Fair question. This proposal is not all talk. We are doing our best, with the tools we know. One of the challenges, as Wasala et al. say, is having a shared translation memory to make free software translation more efficient: so, we are building one. InTense is our new showcase of free software l10n and uses existing translations to offer an open translation memory to everyone; we believe we can eventually include practically all free software in the world.

For now, we have added a few dozens GNU projects and others, with 55 thousands strings and about 400 thousands translations. See also the translation interface for some examples.

If translatewiki.net is asked to do its part, we are certainly available. MediaWiki has the potential to scale incredibly, after all: see Wikipedia. In a future, a wiki like InTense could be switched from read-only to read/write and become a über-translatewiki.net, translating thousands of projects.

But that’s not necessarily what we’re advocating for: what matter is the result, how much more well-localised software we get. In fact, MediaWiki gave birth to thousands of wikis; and its success is also in its principles being adopted by others, see e.g. the huge StackExchange family (whose Q&A are wikis and use a free license, though more individual-centred).

Maybe the solution will come with hundreds or thousands separate installs of one or a handful software platforms. Maybe the solution will not be to “translate the wiki way”, but a similar and different concept, which still puts the localisation in the hands of users, giving them real freedom.

What do you think? Tell us in the comments.

Oregano deployment tool

This blog post introduces oregano, a non-complex, non-distributed, non-realtime deployment tool. It currently consists of less than 100 lines of shell script and is licensed under the MIT license.

The problem. For a very long time, we have run translatewiki.net straight from a git clone, or svn checkout before that. For years, we have been the one wiki which systematically run latest master, with few hours of delay. That was not a problem while we were young and wild. But nowadays, due to the fact that we carry dozens of local patches and thanks to the introduction of composer, it is quite likely that git pull --rebase will stop in a merge conflict. As a consequence, updates have become less frequent, but have semi-regularly brought the site down for many minutes until the merge conflicts were manually resolved. This had to change.

The solution. I wrote a simple tool, probably re-inventing the wheel for the hundredth time, which separates the current deployment in two stages: preparation and pushing out new code. Since I have been learning a lot about Salt and its quirks, I named my tool “oregano”.

How it works. Basically, oregano is a simple wrapper for symbolic links and rsync. The idea is that you prepare your code in a directory named workdir. To deploy the current state in workdir, you must first create a read-only copy by running oregano tag. After that, you can run oregano deploy, which will update symbolic links so that your web server sees the new code. You can give the name of the tag with both commands, but by default oregano will name a new tag after the current timestamp, and deploy the most recently created tag. If, after deploying, you find out that the new tag is broken, you can quickly go back to the previously deployed code by running oregano rollback. Below this is shown as a command line tutorial.

mkdir /srv/mediawiki/ # the path does not matter, pick whatever you want

cd /srv/mediawiki

# Get MediaWiki. Everything we want to deploy must be inside workdir
git clone https://github.com/wikimedia/mediawiki workdir

oregano tag
oregano deploy

# Now we can use /srv/mediawiki/targets/deployment where we want to deploy
ln -s /srv/mediawiki/targets/deployment /www/example.com/docroot/mediawiki

# To update and deploy a new version
cd workdir
git pull
# You can run maintenance scripts, change configuration etc. here
nano LocalSettings.php

cd .. # Must be in the directory where workdir is located
oregano tag
oregano deploy

# Whoops, we accidentally introduced a syntax error in LocalSettings.php
oregano rollback

As you can see from above, it is still possible to break the site if you don’t check what you are deploying. For this purpose I might add support for hooks, so that one could run syntax checks whose failure would prevent deploying that code. Hooks would also be handy for sending IRC notifications, which is something our existing scripts do when code is updated: as pushing out code is now a separate step, they are currently incorrect.

By default oregano will keep the 4 newest tags, so make sure you have enough disk space. For translatewiki.net, which has MediaWiki and dozens of extensions, each tag takes about 200M. If you store MediaWiki localisation cache, pre-generated for all languages, inside workdir, then you would need 1.2G for each tag. Currently, at translatewiki.net, we store localisation cache outside workdir, which means it is out of sync with the code. We will see if that causes any issues; we will move it inside workdir if needed. Do note that oregano creates a tag with rsync --cvs-exclude to save space. That also has the caveat that you should not name files or directories as core. Be warned; patches welcome.

The code is in the translatewiki repo but, if there is interest, I can move it to a separate repository in GitHub. Oregano is currently used in translatewiki.net and in a pet project of mine nicknamed InTense. If things go well, expect to hear more about this mysterious pet project in the future.

Midsummer cleanup: YAML and file formats, HHVM, translation memory

Wikimania 2014 is now over and that is a good excuse to write updates about the MediaWiki Translate extension and translatewiki.net.
I’ll start with an update related to our YAML format support, which has always been a bit shaky. Translate supports different libraries (we call them drivers) to parse and generate YAML files. Over time the Translate extension has supported four different drivers:

spyc uses spyc, a pure PHP library bundled with the Translate extension,
syck uses libsyck which is a C library (hard to find any details) which we call by shelling out to P erl,
syck-pecl uses libsyck via a PHP extension,
phpyaml uses the libyaml C library via a PHP extension.

The latest change is that I dropped syck-pecl because it does not seem to compile with PHP 5.5 anymore; and I added phpyaml. We tried to use sypc a bit but the output it produced for localisation files was not compatible with Ruby projects: after complaints, I had to find an alternative solution.

Joel Sahleen let me know of phpyaml, which I somehow did not found before: thanks to him we now use the same libyaml library that Ruby projects use, so we should be fully compatible. It is also the fastest driver of the four. Anyone generating YAML files with Translate is highly recommended to use the phpyaml driver. I have not checked how phpyaml works with HHVM but I was told that HHVM ships with a built-in yaml extension.

Speaking of HHVM, the long standing bug which causes HHVM to stop processing requests is still unsolved, but I was able to contribute some information upstream. In further testing we also discovered that emails sent via the MediaWiki JobQueue were not delivered, so there is some issue in command line mode. I have not yet had time to investigate this, so HHVM is currently disabled for web requests and command line.

I have a couple of refactoring projects for Translate going on. The first is about simplifying the StringMangler interface. This has no user visible changes, but the end goal is to make the code more testable and reduce coupling. For example the file format handler classes only need to know their own keys, not how those are converted to MediaWiki titles. The other refactoring I have just started is to split the current MessageCollection. Currently it manages a set of messages, handles message data loading and filters the collection. This might also bring performance improvements: we can be more intelligent and only load data we need.

Aiming high: creating a translation memory that works for Wikipedia; even though a long way from here (photo Marie-Lan Nguyen, CC BY 3.0)

Finally, at Wikimania I had a chance to talk about the future of our translation memory with Nik Everett and David Chan. In the short term, Nik is working on implementing in ElasticSearch an algorithm to sort all search results by edit distance. This should bring translation memory performance on par with the old Solr implementation. After that is done, we can finally retire Solr at Wikimedia Foundation, which is much wanted especially as there are signs that Solr is having problems.

Together with David, I laid out some plans on how to go beyond simply comparing entire paragraphs by edit distance. One of his suggestions is to try doing edit distance over words instead of characters. When dealing with the 300 or so languages of Wikimedia, what is a word is less obvious than what is a character (even that is quite complicated), but I am planning to do some research in this area keeping the needs of the content translation extension in mind.

-- Niklas Laxström.

It rains like a saavi

About me, me and me