Tag Archives: language engineering team

Midsummer cleanup: YAML and file formats, HHVM, translation memory

Wikimania 2014 is now over and that is a good excuse to write updates about the MediaWiki Translate extension and translatewiki.net.
I’ll start with an update related to our YAML format support, which has always been a bit shaky. Translate supports different libraries (we call them drivers) to parse and generate YAML files. Over time the Translate extension has supported four different drivers:

  • spyc uses spyc, a pure PHP library bundled with the Translate extension,
  • syck uses libsyck which is a C library (hard to find any details) which we call by shelling out to Perl,
  • syck-pecl uses libsyck via a PHP extension,
  • phpyaml uses the libyaml C library via a PHP extension.

The latest change is that I dropped syck-pecl because it does not seem to compile with PHP 5.5 anymore; and I added phpyaml. We tried to use sypc a bit but the output it produced for localisation files was not compatible with Ruby projects: after complaints, I had to find an alternative solution.

Joel Sahleen let me know of phpyaml, which I somehow did not found before: thanks to him we now use the same libyaml library that Ruby projects use, so we should be fully compatible. It is also the fastest driver of the four. Anyone generating YAML files with Translate is highly recommended to use the phpyaml driver. I have not checked how phpyaml works with HHVM but I was told that HHVM ships with a built-in yaml extension.

Speaking of HHVM, the long standing bug which causes HHVM to stop processing requests is still unsolved, but I was able to contribute some information upstream. In further testing we also discovered that emails sent via the MediaWiki JobQueue were not delivered, so there is some issue in command line mode. I have not yet had time to investigate this, so HHVM is currently disabled for web requests and command line.

I have a couple of refactoring projects for Translate going on. The first is about simplifying the StringMangler interface. This has no user visible changes, but the end goal is to make the code more testable and reduce coupling. For example the file format handler classes only need to know their own keys, not how those are converted to MediaWiki titles. The other refactoring I have just started is to split the current MessageCollection. Currently it manages a set of messages, handles message data loading and filters the collection. This might also bring performance improvements: we can be more intelligent and only load data we need.

Théo Mancheron competes in the men's decathlon pole vault final

Aiming high: creating a translation memory that works for Wikipedia; even though a long way from here (photo Marie-Lan Nguyen, CC BY 3.0)

Finally, at Wikimania I had a chance to talk about the future of our translation memory with Nik Everett and David Chan. In the short term, Nik is working on implementing in ElasticSearch an algorithm to sort all search results by edit distance. This should bring translation memory performance on par with the old Solr implementation. After that is done, we can finally retire Solr at Wikimedia Foundation, which is much wanted especially as there are signs that Solr is having problems.

Together with David, I laid out some plans on how to go beyond simply comparing entire paragraphs by edit distance. One of his suggestions is to try doing edit distance over words instead of characters. When dealing with the 300 or so languages of Wikimedia, what is a word is less obvious than what is a character (even that is quite complicated), but I am planning to do some research in this area keeping the needs of the content translation extension in mind.

Summary of Translate workshop at Zürich hackathon

Hostel hall with hackers

The hall always provided power and wifi for eager hackers (photo CC-BY-SA by Ludovic Péron)

I held a Translate workshop at the Zürich hackathon. Naturally, others and I worked on Translate and translatewiki.net outside of the workshop as well. Here is a summary of the outcomes.

The workshop itself consisted of three topics of interest. I gave an introduction about the Content translation project, going over the basic design and features, followed by a Q&A. We then split into three small groups. One group continued talking about translating content in wider scope. The second group went over how to add new projects to translatewiki.net, using Huggle and Sharelatex as a concrete example. The third group consisted of me helping with programming questions about the Translate extension.

During the whole hackathon people worked on about 20 bugs and patches. I started a patch for glossary support in the Translate extension: a proof of concept, as simple as possible.

You can write a paper about that

“You can write a paper” is kind of a running joke in the language engineering team when the discussion sways so far from the original topic that it is no longer helping to get the work done. But sometimes sidelines turn out to be interesting and fruitful. When I was presented an opportunity to do a PhD related to wikis, languages and translation I could not pass it. And because of the joke, I can claim full innocence – they told me to! ;)

The results are in and…. I got accepted! Screams with joy and then quickly shies away hoping nobody noticed.

What does this mean?

Doctoral hat

The doctoral hat is the ultimate goal, right?

If you are a reader of this blog, the topics might get even more incomprehensible. Or the posts might be even more insightful and based on research instead of gut feelings. Hopefully, it doesn’t mean that I won’t have time to write more blog posts.

Practically, I will be starting at the beginning of January with the goal of writing a PhD dissertation and of graduating in about four years. The proposed topic for my dissertation is Supporting creation and interaction of open content with language technology, as part of the project “Finno-Ugric Digital Natives: Linguistic support for Finno-Ugric digital communities in generating online content”. As with my MA, I’ll do this at the University of Helsinki.

Initially I will be working three days a week on that and keep helping the language engineering team as well. We’ll see how it goes.

The first thing I will do is to participate in IWSDS (Workshop on Spoken Dialog Systems) held in January at Napa, California, USA. I will be presenting a paper about multilingual WikiTalk.

Performance is a feature

In case you haven’t already noticed, I like working on performance issues and performance improvements. Performance is a thing where you have to consider the whole stack: the speed of the server, efficient algorithms, server side caching, bandwidth and latency, client side caching and client side code. Here is a short recap of what has been done for translatewiki.net lately and some ideas for the future.

Recent improvements

Flame chart visualization

Chrome 29 (or later release) has added a helpful visualization for profiling data. In this image the speed of ULS JavaScript code is evaluated on a fonts heavy page. Comparing to the collapsible tabs feature, it is doing okay.

Server level. A month ago translatewiki.net got a new server with more memory and faster processors. The main benefit is that we can handle more simultaneous users and background tasks without them slowing each other down. At the same time, we upgraded many of the programs to newer versions. The switch from MySQL to MariaDB is the most important one. We haven’t tested it for our use case, but the Wikimedia Foundation found that the switch had overall positive impact on performance.

Web server level. In the beginning of November I configured our nginx web server to enable support for the SPDY protocol. This should greatly reduce latency when browsing over HTTPS. We are considering to switch to HTTPS by default. While tweaking nginx, I also fixed a few settings that relate on the compression and expiry times of JavaScript, SVG images and font assets when delivered to users. I used AWStats to see if our daily bandwidth usage decreased. It has not decreased significantly, but there is a lot of variation between days that make interpreting the data difficult. PageSpeed was used to ensure that caching headers are optimal and WebPagetest to confirm that pages load faster on different browsers in different places.

Application level. The Language Engineering team has recently worked a lot on the performance of Universal Language Selector (ULS) and Translate extensions. A short summary of the things which were done:

  • Reduce the amount of JavaScript and CSS delivered to the browser.
  • Delay the loading of JavaScript and CSS as much as possible (for example till the user opens ULS).
  • Optimize JPG, SVG and PNG images to the last byte with tools like jpegoptim, optipng.
  • Optimize the JavaScript to avoid slow actions (for example repaint events and dom changes). We used Chrome’s JavaScript profiler as well as the experimental tool “show potential scroll bottlenecks” to identify issues and confirm the fixes (thanks Ori).

In addition I fixed a major performance issue in one of the Translate API modules by replacing an inefficient algorithm with a faster one. While investigating that issue, I also noticed that ReplacementArray-strtr was taking 20% or so of MediaWiki run time. There is a less known PHP module FastStringSearch, which was not installed on the new server. Installing that module made a big difference on the MediaWiki profiling table: ReplacementArray-fss is now taking only about 0.20% of MediaWiki run time.

Finally, a thing called module local storage was enabled on Wikimedia wikis few days ago (the title of this post was taken from that discussion). As is usual for translatewiki.net, we were already beta testing that feature a few weeks before it went live on Wikimedia wikis.

Future plans

It is hard to plan the future for further performance improvements, as the bottlenecks and the places where you can make the most difference for the least effort change constantly, together with the technology and your content. I believe that HHVM, a JIT PHP virtual machine, is likely to be the next step which will make a significant difference. It is however not a straightforward thing to jump from a normal PHP intepreter to HHVM, so I will be keeping a close eye on how my colleagues at the Wikimedia Foundation are progressing with the adoption of HHVM.

Another relatively small thing on the horizon is better compression of inline SVG images in CSS style sheets, by avoiding unnecessary base64 encoding. Or something else might happen even before it.

Finally, I’d like to highlight that while the application-level improvements automatically benefit third party users, there really isn’t any coherent documentation on how to improve performance of a MediaWiki site at all levels. Configuring localisation cache, nginx and/or Varnish, tweaking MySQL or MariaDB and installing Memcached or Redis should be part of any capable sysadmin’s skills; but even just tailoring them for MediaWiki, let alone knowing which PHP modules to install, is likely not known by many. For example, I wouldn’t be surprised if there were very few or even no sites using the FastStringSearch module outside of Wikimedia and translatewiki.net.

On course to machine translation

It has been a busy spring: I have yet to blog about Translate UX and Universal Language Selector projects, which have been my main efforts.
But now something different. In this field you can never stop learning. So I was very pleased when my boss let me participate in a week-long course, where Francis Tyers and Tommi Pirinen taught how to do machine translation with Apertium. Report of the course follows.

From translation memory to machine translation

Before going to the details about the course, I want to share my thoughts about what is the relation between the different translation memory and machine translations techniques we are using to help translators. The three different techniques are:

  • Crude translation memory: for example the TTMServer of Translate
  • Statistical machine translation: for example Google Translate or Microsoft Translator
  • Rule-based machine translation: for example Apertium

In the figure below, I have used two properties to compare them.

  • On x-axis is the amount of information that is extracted from the stored data. Here the stored data is usually a corpus of aligned* translations in two or more languages.
  • On y-axis is the amount of external knowledge used by the system. This data is usually dictionaries, rules how words inflect and rules about grammar–or even how to split text into sentences and words.

* Aligned means that the system knows which parts of the text correspond to each other in the translations. Alignment can be at paragraph level, sentence level or even smaller parts of the text.

Translation memory and machine translation comparison

A very crude implementation just stores an existing translation and can retrieve it if the very same text is translated again.

TTMServer is a little more sophisticated: it splits the translation into paragraph-sized chunks, and it can retrieve the existing translation even if the new text does not match the old text exactly. This system uses only a little information about the data. Even if all the words exist in it, translated as part of different units (strings), the system still cannot provide any kind of translation. Internally, TTMServer uses some external knowledge on how to split up text into words, in order to speed up translation retrieval.

Statistical machine translation at simplest is just a translation memory which extracts more information about the stored translation data. It gathers a huge database about which words usually occur as translation of the words in the source language. Usually it also stores the context so that in the sentence “walking along the river bank” the term “bank” is not interpreted as a building. Most sophisticated systems can also include knowledge about inflection and grammar to filter out invalid interpretations, or even fix grammatically incorrect forms.

On the right hand side of the figure we have rule-based machine translation systems like Apertium. These systems mainly rely on language dependent information supplied by the maker of the system: bilingual dictionaries, inflection and syntax rules are needed for them to function. Unlike the preceding ones, such systems are always language specific. Creating a machine translation needs a linguist for each language in the system.
Still, even these systems can benefit from statistical methods. While they do not store translation data itself, such data can be analysed and used as input to find the correct way to read ambiguous sentences, or the most common translation of a word in the given context among some alternatives.

The ultimate solution for machine translation is most likely a combination of rules and information extracted from a huge translation corpora.

The course

To create a machine translation system with Apertium, you need to choose a source and target language. I built a system to translate from Kven to Finnish. Kven is very close to Finnish, so it was quite easy to do even though I do not know much Kven. Each student was provided skeleton files and a story in the source language, also translated to the target language by a human translator.

We started by adding words in order of frequency to the lexicon. Lexicon defines part of speech and the inflection paradigms of the words. The paradigms are used to analyze the word forms, and also for generation when translating in the opposite direction. Then we added phonological rules. For example Finnish has a vowel harmony. Because of that, many word endings (cases) have two forms, depending on the word – for example koirassa (in the dog), but hiiressä (in the mouse).

As a third step, we created a bilingual dictionary in a form that is suitable for machines (read: XML). At this point we started seeing some words in the target language. Of course we also had to add the lexicon for the target language, if nobody else had done it already.

Finally we started adding rules.
We added rules to disambiguate sentences with multiple readings. For example, in the sentence “The door is open” we added a rule that open is an adjective rather than a verb, because the sentence already has a verb.
We added rules to convert the grammar. For example Finnish cases are usually replaced with prepositions in English. We might also need to add words: “sataa” needs an explicit subject in English, “it rains”.

At the end we compared the translation produced by our system with the translation made by the human translator. We briefly considered two ways to evaluate the quality of the translation.
First, we can use something like edit distance for words (instead of characters) to count how many insertion, deletions or substitutions are needed to change the machine translation to human translation. Otherwise, we can count how many words the human translator needs to change when copy editing the machine translation.
Machine translation systems start to be useful when you need to fix only one word out of six or more words in the translation.

The future

A little while ago Erik asked how the Wikimedia Foundation could support machine translation, which is now mostly in hands of big commercial entities (though the European Union is also building something) and needs an open source alternative.

We do not have lot of translation corpora like Google. We do have lots of text in different languages, but it is not the same content in all languages and it’s not aligned. Exceptions are translatewiki.net and other places where translations are done with the Translate extension. As a side note I think that translatewiki.net contains one of the most multilingual parallel translation corpora under a free license.

Given that we have lots of people in the Wikimedia movement who are multilingual and interested in languages, I think we should cooperate with an existing open source machine translation system (like Apertium) in a way that allows our users to enhance that system. Doing more translations increases the data stored in a translation memory making it more useful. In a similar fashion, doing more translations with machine translation system should make it better.

Apertium has already been in use on the Nynorsk Wikipedia. Bokmål and nynorsk are closely related languages: the kind of situation where Apertium excels.

One thing I have been thinking is that, now that the Wikimedia Language Engineering team is planning to build tools to help translate Wikipedia articles into other languages, we could closely integrate it with Apertium. We could provide an easy way for translators to add missing words and report unintelligible sentences.

I don’t expect most of our translators to actually write and correct rules, so someone should manage that on Apertium side. But at least word collection could be mostly automated; I bet someone has tried and will try to use Wiktionary data too.

As a first step, Wikimedia Foundation could set up their own Apertium instance as a web service for our needs (existing instances are too unstable). The translate extension, for example, can query such a web service to provide translation suggestions.

New language stuff for developers and users

MLEB 2013.01 has been released by Amir Aharoni. Lots of development has been happening in Translate due to the work on the new translation interfaces. If you are a developer, please also checkout the latest new and changed Web APIs and give us a shout in #mediawiki-i18n @freenode if you see something obviously wrong or missing. Also included in this release are bug fixes for Universal Language Selector, while the other included extensions didn’t see many changes.

Some months ago I wrote about Language tag validation in MediaWiki. A nice person named Siebrand Mazeland decided to improve the situation. As of now we have three new methods developed by the Wikimedia Language engineering team:

  • isSupportedLanguage
  • isWellFormedLanguageTag
  • isKnownLanguageTag

Unless these methods are backported to MediaWiki 1.19 and MediaWiki 1.20, it will take a while before these are being used in extensions, but after a while we should see faster and more readable code.

MediaWiki Language Extension Bundle 2012.12

MediaWiki language extension bundle 2012.12 was released just before Christmas. It is compatible with MediaWiki versions 1.20 and 1.21alpha. Downloads and installation instructions can be found at https://www.mediawiki.org/wiki/MLEB. Announcements of new releases will be posted to mediawiki-i18n mailing list.

Here are the highlights:

cldr

  • English name for Azerbaijani (arz) was added.
  • A bug that caused local names for be-tarask not to be used was fixed.
  • Translations for be-tarask were updated.

Translate

Lots of development is ongoing on the translation user interface redesign project conducted by the WMF Language Engineering Team. New message list and translation editor (pictured) are in alpha stage, but interested users can activate them by using URL parameter tux=1 while on Special:Translate. Also, tux=0 gets back the old interface.

$wgTranslateAC and $wgTranslateEC were removed. If you were still using these, switch over to the TranslatePostInitGroups hook or $wgTranslateCC.

Bundled Solarium library was removed. Install it manually or use the MediaWiki Solarium extensions.

The new group selector (top) and part of the revamped translation editor.

Sneak peek from the new translation UX: the new group selector (top) and part of the revamped translation editor.

Other noteworthy changes in Translate:

  • ApiQueryMessageGroups module has lots of new functionality.
  • There is new ApiQueryLanguageStats module.
  • (bug 39761) Special:TranslationStats counts for edits includes also reviews
  • GettextFFS: Handle empty but existing msgctxt properly
  • New hook: TranslateSupportedLanguages

Universal Language Selector

  • Fixed a display issue in the Modern skin.
  • (bug 42382) Indicate context in input settings/more languages

MediaWiki Language Extension Bundle launches!

The Wikimedia Language Engineering team is pleased to announce the first release of the MediaWiki Language Extension Bundle. The bundle is a collection of a few selected MediaWiki extensions needed by any wiki which desires to be multilingual.
This first bundle release (2012.11) is compatible with MediaWiki 1.19, 1.20 and 1.21alpha.
The Universal Language Selector is a must have, because it provides an essential functionality for any user regardless on the number of languages he/she speaks: language selection, font support for displaying scripts badly supported by operating systems and input methods for typing languages that don’t use Latin (a-z) alphabet.
Maintaining multilingual content in a wiki is a mess without the Translate extension, which is used by Wikimedia, KDE and translatewiki.net, where hundreds of pieces of documentation and interface translations are updated every day; with Localisation Update your users will always have the latest translations freshly out of the oven. The Clean Changes extension keeps your recent changes page uncluttered from translation activity and other distractions.
Don’t miss the chance to practice your rusty language skills and use the Babel extension to mark the languages you speak and to find other speakers of the same language in your wiki. And finally the cldr extension is a database of language and country translations.
We are aiming to make a new release every month, so that you can easily stay on the cutting edge with the constantly improving language support. The bundle comes with clear installation and upgrade instructions. It is tested against MediaWiki release versions, so you can avoid most of the temporary breaks that would happen if you were using the latest development versions instead.
Because this is our first release, there can be some rough edges. Please provide us a lot of feedback so that we can improve for the next release.

Muir Woods has one tree – plural issues in MediaWiki

While I was having fun with the rest of the Wikimedia I18n team in San Francisco, a stream of plural related bug reports started coming in. The cause is that we have recently scrapped the custom plural rules in MediaWiki in favor of using plural rules from the CLDR database. A temporary fix has been applied to mitigate the reported issues.

The problem manifestation is pretty simple; in some languages in some contexts the message was always one something. For example the category page would say This category has one page regardless of how many pages there were in it. At first I was baffled. After all we had written unit tests for all languages in MediaWiki and they reported no regressions. Turns out we had ignored one particular set of languages: those which don’t always use plurals and had no plural rules defined in MediaWiki. The problems started when those language used plural even though they weren’t supposed to. When plural rules are not defined for a language, those languages use the plural rules as defined for the English language: 1 book, 2 books. In CLDR, however, some languages have been defined to not use any plural rules at all.

We could blame the translators for using plural syntax when they are not supported, or we could blame the CLDR for having no plurals rules for languages which do use plurals in some cases. It is not that simple, however. The typical example is a language which doesn’t have distinct plural forms (like some words in English: 1 fish, 2 fish; but for all nouns), but do use plural quantifiers if the number is not present: one fish, many fish.

As a compromise I have proposed an extension to the plural syntax to allow specifying the output when the number is 0 or 1 regardless of the usual plural rules for that language. Let’s take a real example:

Accepted by {{PLURAL:$1|you|$1 users including you}}.

This works fine in English, because the first form is always for number 1. In Belarusian it doesn’t work, because the first form is used for number 1, but also for numbers 21, 31, 41 etc. It could be solved by the following syntax:

{{PLURAL:$1|1=you|$1 users including you}}.

The slightly confusing part here is that now the second form is actually the singular form. This is more evident in the imaginary Belarus translation:

{{PLURAL:$1|1=you|one|few|many|other}}

"you" is used for number 1, “one" for 21, 31, 41 but not 1, and the remaining forms as they usually are.

The explicit zero form (0=something) can also be useful for English and many other languages to have a different wording – something which is now usually done with separate messages.

The message used above is from the Translate extension. Unfortunately we cannot start using this syntax until we have dropped backwards compatibility with the last MediaWiki version not supporting  this syntax i.e. 1.20, which would be around when MediaWiki 1.22 is released. We are seriously considering to backport this functionality, but we also need to add support for the same syntax in JavaScript first.

During further testing we also found issues in Hebrew plural rules. The position of dual was changed and we didn’t notice it because the unit tests were wrong. This resulted in problems like the login page saying Remember my login for two days. It just helps reminding how bugs in i18n can cause potentially severe issues.

Niklas in Muir Woods.

Niklas in Muir Woods. Testing new counting methods? (Photo by Pau Giner.)

Wikimania videos: the next billion users on Wikipedia and beyond

Wikimedia DC has started publishing the Wikimania videos on YouTube. They are not split by presentation, only by track, but here are some about localisation and internationalisation.

My Wikimania presentation (see my previous post), Translating the wiki way (starts at 28:05; watch on YouTube):

Amir’s Supporting languages, all of them and Siebrand’s A Tale of Language Support and Ask the Language Support People (watch on YouTube):

Santhosh’s Read and Write in your language has not been published yet and nobody seems to know if it will, or if it has been recorded at all.

Alolita’s The next billion users on Wikipedia with Open Source Webfonts and Amir’s The software localization paradox (watch on YouTube):

See also the category on Wikimania wiki for abstracts and slides for these presentations.

-- .