Tag Archives: Wikimania

Midsummer cleanup: YAML and file formats, HHVM, translation memory

Wikimania 2014 is now over and that is a good excuse to write updates about the MediaWiki Translate extension and translatewiki.net.
I’ll start with an update related to our YAML format support, which has always been a bit shaky. Translate supports different libraries (we call them drivers) to parse and generate YAML files. Over time the Translate extension has supported four different drivers:

  • spyc uses spyc, a pure PHP library bundled with the Translate extension,
  • syck uses libsyck which is a C library (hard to find any details) which we call by shelling out to Perl,
  • syck-pecl uses libsyck via a PHP extension,
  • phpyaml uses the libyaml C library via a PHP extension.

The latest change is that I dropped syck-pecl because it does not seem to compile with PHP 5.5 anymore; and I added phpyaml. We tried to use sypc a bit but the output it produced for localisation files was not compatible with Ruby projects: after complaints, I had to find an alternative solution.

Joel Sahleen let me know of phpyaml, which I somehow did not found before: thanks to him we now use the same libyaml library that Ruby projects use, so we should be fully compatible. It is also the fastest driver of the four. Anyone generating YAML files with Translate is highly recommended to use the phpyaml driver. I have not checked how phpyaml works with HHVM but I was told that HHVM ships with a built-in yaml extension.

Speaking of HHVM, the long standing bug which causes HHVM to stop processing requests is still unsolved, but I was able to contribute some information upstream. In further testing we also discovered that emails sent via the MediaWiki JobQueue were not delivered, so there is some issue in command line mode. I have not yet had time to investigate this, so HHVM is currently disabled for web requests and command line.

I have a couple of refactoring projects for Translate going on. The first is about simplifying the StringMangler interface. This has no user visible changes, but the end goal is to make the code more testable and reduce coupling. For example the file format handler classes only need to know their own keys, not how those are converted to MediaWiki titles. The other refactoring I have just started is to split the current MessageCollection. Currently it manages a set of messages, handles message data loading and filters the collection. This might also bring performance improvements: we can be more intelligent and only load data we need.

Théo Mancheron competes in the men's decathlon pole vault final

Aiming high: creating a translation memory that works for Wikipedia; even though a long way from here (photo Marie-Lan Nguyen, CC BY 3.0)

Finally, at Wikimania I had a chance to talk about the future of our translation memory with Nik Everett and David Chan. In the short term, Nik is working on implementing in ElasticSearch an algorithm to sort all search results by edit distance. This should bring translation memory performance on par with the old Solr implementation. After that is done, we can finally retire Solr at Wikimedia Foundation, which is much wanted especially as there are signs that Solr is having problems.

Together with David, I laid out some plans on how to go beyond simply comparing entire paragraphs by edit distance. One of his suggestions is to try doing edit distance over words instead of characters. When dealing with the 300 or so languages of Wikimedia, what is a word is less obvious than what is a character (even that is quite complicated), but I am planning to do some research in this area keeping the needs of the content translation extension in mind.

Wikimania videos: the next billion users on Wikipedia and beyond

Wikimedia DC has started publishing the Wikimania videos on YouTube. They are not split by presentation, only by track, but here are some about localisation and internationalisation.

My Wikimania presentation (see my previous post), Translating the wiki way (starts at 28:05; watch on YouTube):

Amir’s Supporting languages, all of them and Siebrand’s A Tale of Language Support and Ask the Language Support People (watch on YouTube):

Santhosh’s Read and Write in your language has not been published yet and nobody seems to know if it will, or if it has been recorded at all.

Alolita’s The next billion users on Wikipedia with Open Source Webfonts and Amir’s The software localization paradox (watch on YouTube):

See also the category on Wikimania wiki for abstracts and slides for these presentations.

My presentations at Akademy and Wikimania

In July I gave two presentations: one at Akademy 2012 in Tallinn, and one at Wikimania 2012.

Short summary of my Akademy presentation (slides): If you are translating content in MediaWiki and you are not using Translate extension, you are doing it wrong. Statistics, translation and proofreading interface – you get them all with Translate. Because Translate keeps track of changes to pages, you can spend your time translating instead of trying to figure what needs translating or updating.

Also, have a look at UserBase, it has now been updated to include the latest features and fixes of Translate extension, like the ability to group translatable pages into larger groups.

Akademy presenation by Niklas and Claus: click for video. Yes, there’s a a typo.

Short summary of my Wikimania presentation (slides; video not yet available): Stop wasting translators’ time.
Forget signing up to e-mail lists, forget sending files back and forth. Use translation platforms that move files from and to the version control system transparently to the translator.
If you have sentences split into multiple messages, you are doing it wrong. If your i18n framework doesn’t have support for plural, gender and grammar dependent translations, you are doing it wrong. If you are not documenting your interface messages for translators, you are doing it wrong.

Niklas maybe having fun at Library of Congress. Photo tychay, CC-BY-NC-ND

-- .