Midsummer cleanup: YAML and file formats, HHVM, translation memory

Wikimania 2014 is now over and that is a good excuse to write updates about the MediaWiki Translate extension and translatewiki.net.
I’ll start with an update related to our YAML format support, which has always been a bit shaky. Translate supports different libraries (we call them drivers) to parse and generate YAML files. Over time the Translate extension has supported four different drivers:

  • spyc uses spyc, a pure PHP library bundled with the Translate extension,
  • syck uses libsyck which is a C library (hard to find any details) which we call by shelling out to Perl,
  • syck-pecl uses libsyck via a PHP extension,
  • phpyaml uses the libyaml C library via a PHP extension.

The latest change is that I dropped syck-pecl because it does not seem to compile with PHP 5.5 anymore; and I added phpyaml. We tried to use sypc a bit but the output it produced for localisation files was not compatible with Ruby projects: after complaints, I had to find an alternative solution.

Joel Sahleen let me know of phpyaml, which I somehow did not found before: thanks to him we now use the same libyaml library that Ruby projects use, so we should be fully compatible. It is also the fastest driver of the four. Anyone generating YAML files with Translate is highly recommended to use the phpyaml driver. I have not checked how phpyaml works with HHVM but I was told that HHVM ships with a built-in yaml extension.

Speaking of HHVM, the long standing bug which causes HHVM to stop processing requests is still unsolved, but I was able to contribute some information upstream. In further testing we also discovered that emails sent via the MediaWiki JobQueue were not delivered, so there is some issue in command line mode. I have not yet had time to investigate this, so HHVM is currently disabled for web requests and command line.

I have a couple of refactoring projects for Translate going on. The first is about simplifying the StringMangler interface. This has no user visible changes, but the end goal is to make the code more testable and reduce coupling. For example the file format handler classes only need to know their own keys, not how those are converted to MediaWiki titles. The other refactoring I have just started is to split the current MessageCollection. Currently it manages a set of messages, handles message data loading and filters the collection. This might also bring performance improvements: we can be more intelligent and only load data we need.

Théo Mancheron competes in the men's decathlon pole vault final

Aiming high: creating a translation memory that works for Wikipedia; even though a long way from here (photo Marie-Lan Nguyen, CC BY 3.0)

Finally, at Wikimania I had a chance to talk about the future of our translation memory with Nik Everett and David Chan. In the short term, Nik is working on implementing in ElasticSearch an algorithm to sort all search results by edit distance. This should bring translation memory performance on par with the old Solr implementation. After that is done, we can finally retire Solr at Wikimedia Foundation, which is much wanted especially as there are signs that Solr is having problems.

Together with David, I laid out some plans on how to go beyond simply comparing entire paragraphs by edit distance. One of his suggestions is to try doing edit distance over words instead of characters. When dealing with the 300 or so languages of Wikimedia, what is a word is less obvious than what is a character (even that is quite complicated), but I am planning to do some research in this area keeping the needs of the content translation extension in mind.

-- .

Tags: , , , , , , , , , , ,

Leave a Reply