Tag Archives: pet projects

Oregano deployment tool

This blog post introduces oregano, a non-complex, non-distributed, non-realtime deployment tool. It currently consists of less than 100 lines of shell script and is licensed under the MIT license.

The problem. For a very long time, we have run translatewiki.net straight from a git clone, or svn checkout before that. For years, we have been the one wiki which systematically run latest master, with few hours of delay. That was not a problem while we were young and wild. But nowadays, due to the fact that we carry dozens of local patches and thanks to the introduction of composer, it is quite likely that git pull --rebase will stop in a merge conflict. As a consequence, updates have become less frequent, but have semi-regularly brought the site down for many minutes until the merge conflicts were manually resolved. This had to change.

The solution. I wrote a simple tool, probably re-inventing the wheel for the hundredth time, which separates the current deployment in two stages: preparation and pushing out new code. Since I have been learning a lot about Salt and its quirks, I named my tool “oregano”.

How it works. Basically, oregano is a simple wrapper for symbolic links and rsync. The idea is that you prepare your code in a directory named workdir. To deploy the current state in workdir, you must first create a read-only copy by running oregano tag. After that, you can run oregano deploy, which will update symbolic links so that your web server sees the new code. You can give the name of the tag with both commands, but by default oregano will name a new tag after the current timestamp, and deploy the most recently created tag. If, after deploying, you find out that the new tag is broken, you can quickly go back to the previously deployed code by running oregano rollback. Below this is shown as a command line tutorial.

mkdir /srv/mediawiki/ # the path does not matter, pick whatever you want

cd /srv/mediawiki

# Get MediaWiki. Everything we want to deploy must be inside workdir
git clone https://github.com/wikimedia/mediawiki workdir

oregano tag
oregano deploy

# Now we can use /srv/mediawiki/targets/deployment where we want to deploy
ln -s /srv/mediawiki/targets/deployment /www/example.com/docroot/mediawiki

# To update and deploy a new version
cd workdir
git pull
# You can run maintenance scripts, change configuration etc. here
nano LocalSettings.php

cd .. # Must be in the directory where workdir is located
oregano tag
oregano deploy

# Whoops, we accidentally introduced a syntax error in LocalSettings.php
oregano rollback

As you can see from above, it is still possible to break the site if you don’t check what you are deploying. For this purpose I might add support for hooks, so that one could run syntax checks whose failure would prevent deploying that code. Hooks would also be handy for sending IRC notifications, which is something our existing scripts do when code is updated: as pushing out code is now a separate step, they are currently incorrect.

By default oregano will keep the 4 newest tags, so make sure you have enough disk space. For translatewiki.net, which has MediaWiki and dozens of extensions, each tag takes about 200M. If you store MediaWiki localisation cache, pre-generated for all languages, inside workdir, then you would need 1.2G for each tag. Currently, at translatewiki.net, we store localisation cache outside workdir, which means it is out of sync with the code. We will see if that causes any issues; we will move it inside workdir if needed. Do note that oregano creates a tag with rsync --cvs-exclude to save space. That also has the caveat that you should not name files or directories as core. Be warned; patches welcome.

The code is in the translatewiki repo but, if there is interest, I can move it to a separate repository in GitHub. Oregano is currently used in translatewiki.net and in a pet project of mine nicknamed InTense. If things go well, expect to hear more about this mysterious pet project in the future.

Midsummer cleanup: YAML and file formats, HHVM, translation memory

Wikimania 2014 is now over and that is a good excuse to write updates about the MediaWiki Translate extension and translatewiki.net.
I’ll start with an update related to our YAML format support, which has always been a bit shaky. Translate supports different libraries (we call them drivers) to parse and generate YAML files. Over time the Translate extension has supported four different drivers:

spyc uses spyc, a pure PHP library bundled with the Translate extension,
syck uses libsyck which is a C library (hard to find any details) which we call by shelling out to P erl,
syck-pecl uses libsyck via a PHP extension,
phpyaml uses the libyaml C library via a PHP extension.

The latest change is that I dropped syck-pecl because it does not seem to compile with PHP 5.5 anymore; and I added phpyaml. We tried to use sypc a bit but the output it produced for localisation files was not compatible with Ruby projects: after complaints, I had to find an alternative solution.

Joel Sahleen let me know of phpyaml, which I somehow did not found before: thanks to him we now use the same libyaml library that Ruby projects use, so we should be fully compatible. It is also the fastest driver of the four. Anyone generating YAML files with Translate is highly recommended to use the phpyaml driver. I have not checked how phpyaml works with HHVM but I was told that HHVM ships with a built-in yaml extension.

Speaking of HHVM, the long standing bug which causes HHVM to stop processing requests is still unsolved, but I was able to contribute some information upstream. In further testing we also discovered that emails sent via the MediaWiki JobQueue were not delivered, so there is some issue in command line mode. I have not yet had time to investigate this, so HHVM is currently disabled for web requests and command line.

I have a couple of refactoring projects for Translate going on. The first is about simplifying the StringMangler interface. This has no user visible changes, but the end goal is to make the code more testable and reduce coupling. For example the file format handler classes only need to know their own keys, not how those are converted to MediaWiki titles. The other refactoring I have just started is to split the current MessageCollection. Currently it manages a set of messages, handles message data loading and filters the collection. This might also bring performance improvements: we can be more intelligent and only load data we need.

Aiming high: creating a translation memory that works for Wikipedia; even though a long way from here (photo Marie-Lan Nguyen, CC BY 3.0)

Finally, at Wikimania I had a chance to talk about the future of our translation memory with Nik Everett and David Chan. In the short term, Nik is working on implementing in ElasticSearch an algorithm to sort all search results by edit distance. This should bring translation memory performance on par with the old Solr implementation. After that is done, we can finally retire Solr at Wikimedia Foundation, which is much wanted especially as there are signs that Solr is having problems.

Together with David, I laid out some plans on how to go beyond simply comparing entire paragraphs by edit distance. One of his suggestions is to try doing edit distance over words instead of characters. When dealing with the 300 or so languages of Wikimedia, what is a word is less obvious than what is a character (even that is quite complicated), but I am planning to do some research in this area keeping the needs of the content translation extension in mind.

Performance is a feature

In case you haven’t already noticed, I like working on performance issues and performance improvements. Performance is a thing where you have to consider the whole stack: the speed of the server, efficient algorithms, server side caching, bandwidth and latency, client side caching and client side code. Here is a short recap of what has been done for translatewiki.net lately and some ideas for the future.

Recent improvements

Chrome 29 (or later release) has added a helpful visualization for profiling data. In this image the speed of ULS JavaScript code is evaluated on a fonts heavy page. Comparing to the collapsible tabs feature, it is doing okay.

Server level. A month ago translatewiki.net got a new server with more memory and faster processors. The main benefit is that we can handle more simultaneous users and background tasks without them slowing each other down. At the same time, we upgraded many of the programs to newer versions. The switch from MySQL to MariaDB is the most important one. We haven’t tested it for our use case, but the Wikimedia Foundation found that the switch had overall positive impact on performance.

Web server level. In the beginning of November I configured our nginx web server to enable support for the SPDY protocol. This should greatly reduce latency when browsing over HTTPS. We are considering to switch to HTTPS by default. While tweaking nginx, I also fixed a few settings that relate on the compression and expiry times of JavaScript, SVG images and font assets when delivered to users. I used AWStats to see if our daily bandwidth usage decreased. It has not decreased significantly, but there is a lot of variation between days that make interpreting the data difficult. PageSpeed was used to ensure that caching headers are optimal and WebPagetest to confirm that pages load faster on different browsers in different places.

Application level. The Language Engineering team has recently worked a lot on the performance of Universal Language Selector (ULS) and Translate extensions. A short summary of the things which were done:

Reduce the amount of JavaScript and CSS delivered to the browser.
Delay the loading of JavaScript and CSS as much as possible (for example till the user opens ULS).
Optimize JPG, SVG and PNG images to the last byte with tools like jpegoptim, optipng.
Optimize the JavaScript to avoid slow actions (for example repaint events and dom changes). We used Chrome’s JavaScript profiler as well as the experimental tool “show potential scroll bottlenecks” to identify issues and confirm the fixes (thanks Ori).

In addition I fixed a major performance issue in one of the Translate API modules by replacing an inefficient algorithm with a faster one. While investigating that issue, I also noticed that ReplacementArray-strtr was taking 20% or so of MediaWiki run time. There is a less known PHP module FastStringSearch, which was not installed on the new server. Installing that module made a big difference on the MediaWiki profiling table: ReplacementArray-fss is now taking only about 0.20% of MediaWiki run time.

Finally, a thing called module local storage was enabled on Wikimedia wikis few days ago (the title of this post was taken from that discussion). As is usual for translatewiki.net, we were already beta testing that feature a few weeks before it went live on Wikimedia wikis.

Future plans

It is hard to plan the future for further performance improvements, as the bottlenecks and the places where you can make the most difference for the least effort change constantly, together with the technology and your content. I believe that HHVM, a JIT PHP virtual machine, is likely to be the next step which will make a significant difference. It is however not a straightforward thing to jump from a normal PHP intepreter to HHVM, so I will be keeping a close eye on how my colleagues at the Wikimedia Foundation are progressing with the adoption of HHVM.

Another relatively small thing on the horizon is better compression of inline SVG images in CSS style sheets, by avoiding unnecessary base64 encoding. Or something else might happen even before it.

Finally, I’d like to highlight that while the application-level improvements automatically benefit third party users, there really isn’t any coherent documentation on how to improve performance of a MediaWiki site at all levels. Configuring localisation cache, nginx and/or Varnish, tweaking MySQL or MariaDB and installing Memcached or Redis should be part of any capable sysadmin’s skills; but even just tailoring them for MediaWiki, let alone knowing which PHP modules to install, is likely not known by many. For example, I wouldn’t be surprised if there were very few or even no sites using the FastStringSearch module outside of Wikimedia and translatewiki.net.

Pet project: Optimizing message index to the last byte

The message index is a crucial component of Translate, so I made an experiment by implementing a trie store for the message index to optimize it. The short story is that I could not get it fast enough for practical use easily. Continue for full story.

Pet projects

A tree in Helsinki (October) showing something tries can’t produce: wonderful fall colours (ruska in Finnish)

For context, in our development team each developer has time for experimentation, outside of the planned development sprint tasks. During that time the developer can try out new technologies, fix issues that are important to them personally or just do something fun and interesting. We call these pet projects and they let us do some cool things.

For example, the insertables I described in my previous blog post are something I did as a pet project. Insertables were actually part of the original translation UX (TUX) design specifications, but they were not implemented because of other priorities. I decided to implement them because users (not managers) were asking for it. I wasn’t convinced initially, but when I saw users translating with tablets I changed my mind. Insertables were a good pet project because they were relatively small and fun a thing to do.

This is all I have to say about pet projects – the non technical readers can skip the rest of this post, where I go into the details of this pet project.

Message index

I probably have introduced the message index in my earlier posts, but let me do it again quickly. I’ll use an example for this. Let’s assume we have a small software called Greeter. It has a localisation file like this:

# l10n/en/greetings.properties
greeting.noon = Good day
greeting.morning = Good morning
greeting.evening = Good evening
greeting.night = Good night

When this kind of file is set up with the Translate extension (for instance in translatewiki.net), each string is stored as a wiki page. Each translation is a separate page, too.

translatewiki.net/wiki/Greeter:greeting.noon/en -> “Good day”
translatewiki.net/wiki/Greeter:greeting.noon/fi -> “Hyvää päivää”

The bolded parts are called page titles in MediaWiki. The message index can be defined simply as a map from the page title of each known message (without the language code) to the message group it belongs to. If we printed it out it would look something like this:

1244:greeting-noon => [greeter]
1244:greeting-morning => [greeter]
1244:greeting-evening => [greeter]
1244:greeting-night => [greeter]

So, every time someone adds a new message for translation, we need to update the message index. Every time someone makes a translation, we need to query the message index. The user is waiting, so both of these actions need to be fast, while using a reasonable amount of memory.

Implementations problems

When we get to the order of 50 000 or even more known messages, creation and accessing of the message index starts to get slow in PHP, even though it’s basically just a lot of strings, and string processing should be fast, right? Not so in PHP, where holding the message index as an array of arrays takes tens of megabytes in memory. An array in php is kind of a mix of hashtable and linked list. It uses more memory for extra features and versatility.. In the case of message index we would gladly like to trade some features for reduced memory usage.

There are many aspects in message index optimization, but so far I haven’t found a solution without downsides. If the whole index was small enough, it could be kept in memory, making things faster; but currently it can only be stored in various kinds of databases, that allow querying the index one title at the time.

Currently at translatewiki.net we are using CDB files, which are immutable databases stored on a file on the file system. This is okay for our use case: the index is accessed from disk; only when the data changes, you have to build the whole thing from scratch and you have to worry about memory usage and speed. The current problem we have with this approach is that it takes a lot of memory to recreate it, and the few second running time is on the borderline of acceptable speed for having user to wait for it. There isn’t too much room for growth.

To reach the current state, I’ve tried using references to store the group names to avoid repeating them and storing the resulting array in a serialized file. I’ve tried storing the whole structure in a database table, which works well to certain amount of messages. This time I’m going to try something else. The idea is to save space by considering that the message keys share a lot of substrings, for instance the messages of a MediaWiki extension having all keys prefixed with the extension’s name. I decided to use tree structures to experiment.

Trees and tries

Disclaimer: I haven’t studied algorithms in depth so I’m just trying to apply what I know.

We can represent all the relationships between message names and their groups as a set of mostly similar strings which may share common prefixes. I could have used a tree, but I decided to use a trie. A trie is a tree where consecutive nodes which only have one child are merged together. Here is an example of how the message index above would look like in a trie (first image), compared to the full tree (second image). As you can see, the trie is more compact compared to the tree because it has less nodes and branches. The trie is also more compact than an array as the common prefixes can be stored only once and we are not using any hashes which are used in arrays. Click for full size.

To create a message index using tries, I started by googling if there are any algorithms already implemented in PHP for constructing tries. I could not find any, so I just converted into PHP a Python script (which was likely converted from Java). Then I implemented a custom binary format that could be stored in a file and a custom lookup that would use the data loaded from the file into a memory.
I tried many options for optimizing the creation of the trie while minimizing the storage consumption.

One of the curious things was that, when inserting a new string to the trie, it is faster to loop over all the current children of the node comparing the first letter of the child against the first letter of the string we are inserting, rather than to use binary search to find the correct insertion point. The latter would mean keeping the list of children sorted and doing less comparisons by using binary search when doing lookups and insertions. I assume this is because inserting at the end of the array is fast, but inserting in the middle of the array (to keep it sorted) is slow because (my guess) PHP either recreates the array or updating the linked list pointers is slow for some other reason.

For the storage format I tried various kinds of indexes of strings to store the substrings only once, but all the pointers to the strings and child nodes also take a lot of space (4 bytes per pointer, where 4 bytes can also store four characters assuming ascii keys). I’m sure more space savings could be gained by experimenting with alignments so that smaller pointers could be used. Maybe it would be possible borrow some of the algorithms designed to optimize finite state automata – I believe those are much better than what I can do on my own.

Here are some numbers (approximate because I ran out of time to measure properly) on how it compares to the CDB message index solution:

Property	CDB	Trie
Size on disk	6 MiB	1.5 MiB (0.5 compressed with gzip)
Time to create	1 second	7 seconds

For now I declare this pet project as something that cannot be used. Maybe some day I will get back to it and try make it good enough for real use, but now I already have other interesting pet projects in my mind. If I get suggestions from you how to reach practical solutions, I will of course try them out sooner. I just want to mention that there a many things that could still be explored: QuickHash, constant hash database or finding ways to store group information so that message index is not needed at all.

Insertables in Translate make translating easier

Insertables are a new tool to easily copy some text from the source language to your translation with one click.

Have you ever translated anything with the Translate extension? Did it contain markup like this?

[http://very.long.url/here link description]
{{GENDER:$1|he|she}} posted $2 on $3

If so, then you know what this is about. Have you ever translated anything with the Translate extension while using a tablet or another device without a physical keyboard? If so, then you likely know why this interesting.

When you translate text written in wiki markup, or software interface strings, you will encounter the examples above, and many more parts which you need to copy verbatim while translating. These parts contain special characters like braces, dollar signs, brackets, pipes and so on. These characters are cumbersome to type on non-English keyboards, where they have been moved to more difficult to reach key combinations in favour of local characters – if they exist in the layout at all. If they don’t exist in the keyboard layout, you need to switch keyboard layouts just to type few characters and then switch it back.

Does this sound cumbersome? Many translators in fact do not do that, but instead they copy and paste the text from the source text. On tablets however, copy and paste itself is a cumbersome thing. Insertables are a solution to this usability issue.

We can automatically identify a part of the translatable text which has the following properties: it should not be changed and it is difficult to type. We can then present these parts of strings as buttons near the translation. Clicking or pressing that button inserts the text into the translation. These buttons complement the insert source text button and are optional to use, like all translation helpers we provide.

Happy translator using the new feature

As of now, we only detect a few types of these insertables: plural, grammar magic words, and variables in MediaWiki style ($1). Read more on Translate documentation for how to contribute more insertables.

-- Niklas Laxström.

It rains like a saavi

About me, me and me