Posts Tagged ‘puppet’

The website anyone can translate

Wednesday, November 14th, 2012

Translatewiki.net has started using Puppet. Puppet is a tool designed to manage the configuration of servers. Like Wikimedia’s, our configuration is public and stored in the translatewiki.net git repository, where anyone can submit patches. I don’t expect a flood of them coming in anytime soon, my motivations for this were different. If you remember, some months back I had to learn some Puppet to write the Solr configuration for Wikimedia deployment. Now I wanted to learn more and gather more experience on using Puppet. It will also greatly help if we ever need to reinstall the translatewiki.net server from scratch (which is quite likely to happen soon). As a bonus it gives transparency and something I can refer people to when they ask how some particular thing is done in translatewiki.net. As time permits, I will be moving more configuration to Puppet.

Mitä isot edellä, sitä pienet perässä. (Internet suggest the closest translation is Monkey see, monkey do.)

I also added the translatewiki.net repository to Ohloh. If you use translatewiki.net as localisation platform, feel free to add it to your stacks by clicking “I use this”, or to embed its widgets in your website. Ohloh also gives some cool stats:

In a Nutshell, translatewiki.net…

Together with the introduction of Puppet, I also switched the webserver of translatewiki.net from lighttpd to nginx. The biggest reason for this is that https was broken for Google Chrome users, but in general nginx feels faster and more robust and the way PHP is used with it is much simpler (php-fpm instead of spawn-cgi). The Wikimedia operations team is supposedly going to test nginx soon, so we will see whether the tide also goes that way.

Efficient translation: Translation memory enabled on all Wikimedia wikis

Friday, September 7th, 2012

I am pleased to announce that a long development project has been released and taken into production. We now have translation memory services enabled on Wikimedia projects (since August 28, in our last sprint).

The translation editor on Wikimania 2013 wiki shows a suggestion from Wikimania 2012 wiki

Users translating for Wikimania 2013 are provided with suggestions from 2012 (right arrow); a click is enough to copy it to the text area (down arrow). See also on Meta, in English interface.

Translation memory is a feature which provides likely translations for a text based on previous translations of similar texts: translators use them to speed up their work and to increase consistency (more in Wikipedia).

If you have translated at translatewiki.net or usebase.kde.org, you may have already noticed it. The translation memory on Wikimedia wikis has been filled with existing translations made with the Translate extension in WMF projects including Meta, mediawiki.org and Wikimania wikis.

Translators from all Wikimedia projects using the Translate extension can now work more efficiently, sharing their work and experience across the boundaries of wikis. Translators on Wikimania 2013 wiki can now find translations already provided for the previous year (see screenshot) and be quicker without sacrificing quality and consistency. Translators of technical documentation on mediawiki.org can benefit from the translation of Wikimedia terminology on Meta-Wiki and vice versa.

Technical challenges

A translation memory service has been in use at translatewiki.net for years, and the process of getting it enabled on Wikimedia was started about a year ago.

Naturally WMF operations is a very different thing from the small shared server translatewiki.net runs on. Yet, there were many unexpected turns that caused delay. The phases here are named retroactively.

Phase1

Originally we used the tmserver component from the translate toolkit. It had its own problems: it was hard to set up, it was an external dependency and the SQLite database engine it used was problematic for updates – it failed if there were multiple processes accessing at the same time. Sometimes the included standalone webserver got stuck and the other option, WSGI, didn’t play nicely with our lighttpd web server.

I did lots of research with Siebrand trying to find other open source translation memories, but failed to find anything that had any active or recent development.

Phase2

The next step was the standalone version. To avoid external dependencies, to make it usable in the WMF infrastructure, and not to require separate services, I started porting the tmserver algorithm from Python to PHP. At the same time I was able to take advantage of MediaWiki’s database abstraction code, which in theory should make it work on SQLite, MySQL and PostgreSQL. At the moment, however, only MySQL is tested and in use at translatewiki.net.

Performance of this new system was mostly the same, though it’s a constant fight for not letting the Levenshtein algorithm, used for ranking in the core, get exponentially slow. The major new feature was the support for shared databases, so that multiple wikis can use the translations made in other wikis for suggestions. A lot of time was spent on this, and also on making the initial bootstrap efficient with use of multiple threads.

Phase3

When we thought everything was ready for deployment on Wikimedia wikis, we waited for feedback from ops and finally we got a simple, yet unwanted reply: “Full-text search with MySQL cannot be used in the WMF cluster (because it depends on the problematic MyISAM storage engine)”. Yay. Back to the drawing board.
Since everything at Wikimedia is using a heavily modified Apache Lucene for full text search, the same was obviously suggested as a solution. So started the development of phase3; if the past predicts anything, this will have been the final rewrite.

I decided not to touch Wikimedia’s version of Lucene, as I already had lots of experience on it due to playing with it for my Master’s thesis (English summary on my blog), and decided to use standard Lucene with a Solr frontend. Solr simplified many things and the development was swift using the PHP Solarium library.

In fact, the most difficult “feature” to develop was the Puppet configuration for Jetty and Solr, and testing it on WMF Labs. So I learned to write Puppet configuration files from scratch and did it mostly myself. Oren Bochman helped a lot with the Labs testing phase. The last hurdle was backporting recent packages of Solr and its dependency Jetty for the Ubuntu that Wikimedia was using on Labs and in production. Luckily I was fortunate enough to get quick help from ops, so I didn’t have to also learn how to make Ubuntu packages.

So somewhat ironically, we went from separate services to standalone and again to a separate service. The first phase is long forgotten, but the standalone and Solr versions complement each other. The former is enabled by default for anyone using the Translate extension, the latter provides superior scalability and hopefully in the future even better suggestions.

Fact is that the Levenshtein based ranking is not the state of the art for translation memories[1] and does not compare to the state of art i18n we are doing with MediaWiki and translatewiki.net.

On to the next adventure!

[1] Paper abstract (full text behind paywall; DOI:10.1007/3-540-39965-8_14).

-- .