Tag Archives: HipHop

translatewiki.net – harder, better, faster, stronger

I am very pleased to announce that translatewiki.net has been migrated to new servers sponsored by netcup GmbH. Yes, that is right, we now have two servers, both of which are more powerful than the old server.

Since the two (virtual) servers are located in the same data center and other nitty gritty details, we are not making them redundant for the sake of load balancing or uptime. Rather, we have split the services: ElasticSearch runs on one server, powering the search, translation search and translation memory; everything else runs on the other server.

In addition to faster servers and continuous performance tweaks, we are now faster thanks to the migration from PHP to HHVM. The Wikimedia Foundation did this a while ago with great results, but HHVM has been crashing and freezing on translatewiki.net for unknown reasons. Fortunately, recently I found a lead that the issue is related to a ini_set function, which I was easily able to work around while the investigation on the root cause continues.

Non-free Google Analytics confirms that we now server pages faster.

Non-free Google Analytics confirms that we now serve pages faster: the small speech bubble indicates migration day to new servers and HHVM. Effect on the actual page load times observed by users seems to be less significant.

We now have again lots of room for growth and I challenge everyone to make us grow with more translations, new projects or other legitimate means, so that we reach a point where we will need to upgrade again ;). That’s all for now, stay tuned for more updates.

Midsummer cleanup: YAML and file formats, HHVM, translation memory

Wikimania 2014 is now over and that is a good excuse to write updates about the MediaWiki Translate extension and translatewiki.net.
I’ll start with an update related to our YAML format support, which has always been a bit shaky. Translate supports different libraries (we call them drivers) to parse and generate YAML files. Over time the Translate extension has supported four different drivers:

  • spyc uses spyc, a pure PHP library bundled with the Translate extension,
  • syck uses libsyck which is a C library (hard to find any details) which we call by shelling out to Perl,
  • syck-pecl uses libsyck via a PHP extension,
  • phpyaml uses the libyaml C library via a PHP extension.

The latest change is that I dropped syck-pecl because it does not seem to compile with PHP 5.5 anymore; and I added phpyaml. We tried to use sypc a bit but the output it produced for localisation files was not compatible with Ruby projects: after complaints, I had to find an alternative solution.

Joel Sahleen let me know of phpyaml, which I somehow did not found before: thanks to him we now use the same libyaml library that Ruby projects use, so we should be fully compatible. It is also the fastest driver of the four. Anyone generating YAML files with Translate is highly recommended to use the phpyaml driver. I have not checked how phpyaml works with HHVM but I was told that HHVM ships with a built-in yaml extension.

Speaking of HHVM, the long standing bug which causes HHVM to stop processing requests is still unsolved, but I was able to contribute some information upstream. In further testing we also discovered that emails sent via the MediaWiki JobQueue were not delivered, so there is some issue in command line mode. I have not yet had time to investigate this, so HHVM is currently disabled for web requests and command line.

I have a couple of refactoring projects for Translate going on. The first is about simplifying the StringMangler interface. This has no user visible changes, but the end goal is to make the code more testable and reduce coupling. For example the file format handler classes only need to know their own keys, not how those are converted to MediaWiki titles. The other refactoring I have just started is to split the current MessageCollection. Currently it manages a set of messages, handles message data loading and filters the collection. This might also bring performance improvements: we can be more intelligent and only load data we need.

Théo Mancheron competes in the men's decathlon pole vault final

Aiming high: creating a translation memory that works for Wikipedia; even though a long way from here (photo Marie-Lan Nguyen, CC BY 3.0)

Finally, at Wikimania I had a chance to talk about the future of our translation memory with Nik Everett and David Chan. In the short term, Nik is working on implementing in ElasticSearch an algorithm to sort all search results by edit distance. This should bring translation memory performance on par with the old Solr implementation. After that is done, we can finally retire Solr at Wikimedia Foundation, which is much wanted especially as there are signs that Solr is having problems.

Together with David, I laid out some plans on how to go beyond simply comparing entire paragraphs by edit distance. One of his suggestions is to try doing edit distance over words instead of characters. When dealing with the 300 or so languages of Wikimedia, what is a word is less obvious than what is a character (even that is quite complicated), but I am planning to do some research in this area keeping the needs of the content translation extension in mind.

Translatewiki.net summer update

It’s been a busy while since last update, but how could I have not worked on translatewiki.net? ;) Here is an update on my current activities.
In this episode:

  • we provide translations for over 70 % of users of the new Wikipedia app,
  • I read a book on networking performance and get needy for speed,
  • ElasticSearch tries to eat all of us and our memory,
  • HHVM finds the place not fancy enough,
  • Finns and Swedes start cooperating.

Performance

Naturally, I have been thinking of ways to further improve translatewiki.net performance. I have been running HHVM as a beta feature at translatewiki.net many months now, but I have kept turning it on and off due to stability issues. It is currently disabled, but my plan is to try the Wikimedia packaged version of HHVM. Those packages only work in Ubuntu 2014.04, so Siebrand and I first have to upgrade the translatewiki.net server from Ubuntu 2012.04, as we plan to later this month (July). (Update: done as of 2014-07-09, 14 UTC.)

Map of some translatewiki.net translators

A global network of translators is not served well enough from a single location

After reading a book about networking performance I finally decided to give a content distribution network (CDN) a try. Not because they can optimize and cache things on the fly [1], nor because the can do spam protection [2], but because CDN can reduce latency, which is usually the main bottleneck of web browsing. We only have single server in Germany, but our users are international. I am close to the server, so I have much better experience than many of our users. I do not have any numbers yet, but I will do some experiments and gather some numbers to see whether CDN helps us.

[1] MediaWiki is already very aggressive in terms of optimizations for resource delivery.
[2] Restricting account creation already eliminated spam on our wiki.

Wikimedia Mobile Apps

Amir and I have been closely working with the Wikimedia Mobile Apps team to ensure that their apps are well supported. In just a couple weeks, the new app was translated in dozens languages and released, with over 7 millions new installations by non-English users (74 % of the total).

In more detail, we finally addressed a longstanding issue in the Android app which prevented translation of strings containing links. I gave Yuvi access to synchronize translations, ensuring that translators have as much time as possible to translate and the apps have the latest updates before being released. We also discussed about how to notify translators before releases to get more translations in time, and about improvements to their i18n frameworks to bring their flexibility more in line with MediaWiki (including plural support).

To put it bluntly, for some reason the mobile i18n frameworks are ugly and hard to work with. Just as an example, Android did not support many languages at all just for one character too much; support is still partial. I can’t avoid comparing this to the extra effort which has been needed to support old versions of Internet Explorer: we would rather be doing other cool things, but the environment is not going to change anytime soon.

Search

I installed and enabled CirrusSearch on translatewiki.net: for the first time, we have a real search engine for all our pages! I had multiple issues, including running a bit tight on memory while indexing all content.

Translate’s translation memory support for ElasticSearch has been almost ready for a while now. It may take a couple months before we’re ready to migrate from Solr (first on translatewiki.net, then Wikimedia sites). I am looking forward to it: as a system administrator, I do not want to run both Solr and ElasticSearch.

I want to say big thanks to Nik for helping both with the translation memory ElasticSearch backend and my CirrusSearch problems.

Wikimedia Sweden launches a new project

I am expecting to see an increased activity and new features at translatewiki.net thanks to a new project by Wikimedia Sweden together with InternetFonden.Se. The project has been announced on the Wikimedia blog, but in short they want to bring more Swedish translators, new projects for translation and possibly open badges to increase translator engagement. They are already looking for feedback, please do share your thoughts.

Performance is a feature

In case you haven’t already noticed, I like working on performance issues and performance improvements. Performance is a thing where you have to consider the whole stack: the speed of the server, efficient algorithms, server side caching, bandwidth and latency, client side caching and client side code. Here is a short recap of what has been done for translatewiki.net lately and some ideas for the future.

Recent improvements

Flame chart visualization

Chrome 29 (or later release) has added a helpful visualization for profiling data. In this image the speed of ULS JavaScript code is evaluated on a fonts heavy page. Comparing to the collapsible tabs feature, it is doing okay.

Server level. A month ago translatewiki.net got a new server with more memory and faster processors. The main benefit is that we can handle more simultaneous users and background tasks without them slowing each other down. At the same time, we upgraded many of the programs to newer versions. The switch from MySQL to MariaDB is the most important one. We haven’t tested it for our use case, but the Wikimedia Foundation found that the switch had overall positive impact on performance.

Web server level. In the beginning of November I configured our nginx web server to enable support for the SPDY protocol. This should greatly reduce latency when browsing over HTTPS. We are considering to switch to HTTPS by default. While tweaking nginx, I also fixed a few settings that relate on the compression and expiry times of JavaScript, SVG images and font assets when delivered to users. I used AWStats to see if our daily bandwidth usage decreased. It has not decreased significantly, but there is a lot of variation between days that make interpreting the data difficult. PageSpeed was used to ensure that caching headers are optimal and WebPagetest to confirm that pages load faster on different browsers in different places.

Application level. The Language Engineering team has recently worked a lot on the performance of Universal Language Selector (ULS) and Translate extensions. A short summary of the things which were done:

  • Reduce the amount of JavaScript and CSS delivered to the browser.
  • Delay the loading of JavaScript and CSS as much as possible (for example till the user opens ULS).
  • Optimize JPG, SVG and PNG images to the last byte with tools like jpegoptim, optipng.
  • Optimize the JavaScript to avoid slow actions (for example repaint events and dom changes). We used Chrome’s JavaScript profiler as well as the experimental tool “show potential scroll bottlenecks” to identify issues and confirm the fixes (thanks Ori).

In addition I fixed a major performance issue in one of the Translate API modules by replacing an inefficient algorithm with a faster one. While investigating that issue, I also noticed that ReplacementArray-strtr was taking 20% or so of MediaWiki run time. There is a less known PHP module FastStringSearch, which was not installed on the new server. Installing that module made a big difference on the MediaWiki profiling table: ReplacementArray-fss is now taking only about 0.20% of MediaWiki run time.

Finally, a thing called module local storage was enabled on Wikimedia wikis few days ago (the title of this post was taken from that discussion). As is usual for translatewiki.net, we were already beta testing that feature a few weeks before it went live on Wikimedia wikis.

Future plans

It is hard to plan the future for further performance improvements, as the bottlenecks and the places where you can make the most difference for the least effort change constantly, together with the technology and your content. I believe that HHVM, a JIT PHP virtual machine, is likely to be the next step which will make a significant difference. It is however not a straightforward thing to jump from a normal PHP intepreter to HHVM, so I will be keeping a close eye on how my colleagues at the Wikimedia Foundation are progressing with the adoption of HHVM.

Another relatively small thing on the horizon is better compression of inline SVG images in CSS style sheets, by avoiding unnecessary base64 encoding. Or something else might happen even before it.

Finally, I’d like to highlight that while the application-level improvements automatically benefit third party users, there really isn’t any coherent documentation on how to improve performance of a MediaWiki site at all levels. Configuring localisation cache, nginx and/or Varnish, tweaking MySQL or MariaDB and installing Memcached or Redis should be part of any capable sysadmin’s skills; but even just tailoring them for MediaWiki, let alone knowing which PHP modules to install, is likely not known by many. For example, I wouldn’t be surprised if there were very few or even no sites using the FastStringSearch module outside of Wikimedia and translatewiki.net.

FOSDEM talk reflections 3/3: HipHop, communities, public procurement

Nikerabbit arrives at the MediaWiki meetup at FOSDEM

Meetup of MediaWiki community. Or Wikimedia tech? How to call the Wikimedia software development ecosystem? (Photo by henna, copyright status unknown.)

This is the third post about FOSDEM 2013; see 1/3: I18n in the WEB, Mozilla i18n and L20n for the first and 2/3: docs, code and community health, stability for the second. Links to the abstracts in the headers.

Scaling PHP with HipHop

HipHop is still alive, and faster than ever. It has evolved from PHP to C translator to a JIT bytecode interpreter system, just like PHP itself is, without JIT of course. The speedups they are seeing are impressive (it was deployed on Facebook about a month ago). Given that they have removed the compile everything before deploy step, it is now much more feasible to use.

I’m considering to give HipHop a try on translatewiki.net later this year, probably after we have upgraded to at least Ubuntu 12.04, where Facebook provides packages for it. It is still a pain in the ass to set up manually, as it was few years ago. Wikimedia Foundation (WMF) has dropped its evaluation, but perhaps they will reconsider after our experiences, and HipHop, or hhvm as it is called now, has indeed changed a lot since then.

It was highlighted that the supported language features and libraries of hhvm and PHP vary to an extent. hhvm provides some nice features like strict type hinting, but it is unlikely I can use those anytime soon, since there is no way to take advantage of these on hhvm without breaking support for normal PHP, which is something that really cannot be done in the MediaWiki ecosystem.

Community/BOF meetup

Almost 20 people were around, a few outside of WMF. Discussions circled around events like the Amsterdam Hackathon and MediaWiki groups. The most interesting part (to me) is how to call the Wikimedia software development ecosystem, so it can be marketed properly. Suggestions ranged from extending the meaning of MediaWiki to cover everything including mobile, gadgets and so on; using Wikipedia as it is the brand most well known; or creating a new Wikitech brand.

There are pros and cons to each of the above, but one thing is true: There is no name that can currently be used to refer to everything technical done around MediaWiki and Wikimedia that would also be understood by potential participants. Also, MediaWiki development is not perceived to be cool anymore, because it’s PHP. But it isn’t. MediaWiki development is also Redis, Varnish, puppet, git, Solr, HipHop, semantic, node.js, mobile, OpenStack, and more. Quim Gil will continue work on coming up with a brand. Curious visitors can also compare this to what KDE did recently when they expanded the meaning of KDE to be not only the desktop, but also the community and everything they do. The change process wasn’t painless for them, and wouldn’t be for us, but at the same time (IMHO) the change has been quite successful and beneficial to KDE.

Qt Project Update

Qt booth at FOSDEM

Qt: a maintainer for each subsystem helps getting your patch reviewed, unlike MediaWiki

Qt is doing well. Qt5 is evolution instead of revolution (what Qt4 was to Qt3). The contributors outside of Nokia (and nowadays Digia) have risen to about one third of all commits. They are using Gerrit like MediaWiki. But unlike MediaWiki they have explicit hierarchy, with maintainers who are responsible for keeping each subsystem in shape. It also means that there is always at least one person you can talk to, to get your patch reviewed, unlike in the MediaWiki community.

Their platform support is also nice: Linux, Windows and OSX are fully supported, while iOS, Android and BlackBerry OS are also working more or less.

Fixing public procurement

Forgive me if I use incorrect terms. In a nutshell there is a law in Finland (coming from the EU) that disallows governmental organizations to request software systems by referring to an exact producer. So for example a hypothetical example “We want Microsoft Office on all our work stations in X department” is illegal, while “We want an office tools suite that includes documents, presentations, …” is legal. Free software people in Finland did an analysis of how many times this law has been violated… quite many… and have been sending letters that ask them to read the rules and fix their procurement.

The talk continued with the observation that there is no entity to enforce this rule, and that it is difficult to get the companies put into disadvantage to sue. One side argued that fixing this particular issue harms doing wide scale education of people on this and other issues. Or when is it useful to sue instead of trying to educate? When are the lost opportunities bigger a harm than bad publicity and money spent on suing? Apparently Microsoft has sued successfully in Finland to gain lots of money without a big PR hit. Open source solutions are usually discarded because the exit costs of the previous system are placed on the new solution instead of the old vendor lock-in solution.

All in all, the target of this kind of work is to further open source use in governments by allowing free competition; they want to do it EU-wide.

The Keeper of Secrets: The Dance of Community Leadership

FOSDEM party crowd

FOSDEM preparty was definitely not a quiet beerless one

This is the first time I’ve seen Leslie Hawthorn speaking. In her talk, which was full of beer jokes (a bit too many to my taste), I caught some points:

  • Don’t be a jerk.
  • Stop gossiping and talk directly to people you have problems with.
  • Don’t ignore difficult people, be brave enough to let them know your honest opinion.
  • Don’t be a jerk while talking about difficult things with people, do it politely and cooperatively.
  • Face to face meetings are essential to community building (my addition: also include possibilities do it in beerless, quiet places).

-- .