InTense | It rains like a saavi

In IWCLUL talks, Miikka Silfverberg’s mention of collecting words from Wikipedia resonated with my earlier experiences working with Wikipedia dumps, especially the difficulty of it. I talked with some people at the conference and everyone seemed to agree that processing Wikipedia dumps takes a lot of time, which they could spend for something else. I am considering to publish plain text Wikipedia dumps and word frequency lists. While working in the DigiSami project, I familiarized myself with the utilities as well as the Wikimedia Tool Labs, so relatively little effort would be needed. The research value would be low, but it would be worth it, if enough people find these dumps and save time. A recent update is that Parsoid is planning to provide plain text format, so this is likely to become even easier in the future. Still, there might be some work to do collect pages into one archive and decide which parts of page will stay and which will be removed: for example converting an infobox to collection of isolated words is not useful for use cases such as WikiTalk, and it can also easily skew word frequencies.

I talked with Sjur Moshagen about keyboards for less resourced languages. Nowadays they have keyboards for Android and iOS, in addition to keyboards for computers (which already existed). They have some impressing additional features, like automatically adding missing accents to typed words. That would be too complicated to implement in jquery.ime, a project used by Wikimedia that implements keyboards in a browser. At least the aforementioned example uses finite state transducer. Running finite state tools in the browser does not yet feel realistic, even though some solutions exist^*. The alternative of making requests to a remote service would slow down typing, except perhaps with some very clever implementation, which would probably be fragile at best. I have still to investigate whether there is some middle ground to bring the basic keyboard implementations to jquery.ime.

^*Such as jsfst. One issue is that the implementations and the transducers themselves can take lot of space, which means we will run into same issues as when distributing large web fonts at Wikipedia.

I spoke with Tommi Pirinen and Antti Kanner about implementing a dictionary application programming interface (API) for the Bank of Finnish Terminology in Arts and Sciences (BFT). That would allow direct use of BFT resources in translation tools like translatewiki.net and Wikimedia’s Content Translation project. It would also help indirectly, by using a dump for extending word lists in the Apertium machine translation software.

I spoke briefly about language identification with Tommi Jauhiainen who had a poster presentation about the project “The Finno-Ugric languages and the internet”. I had implemented one language detector myself, using an existing library. Curiously enough, many other people met in Wikimedia circles have also made their own implementations. Mine had severe problems classifying languages which are very close to each other. Tommi gave me a link for another language detector, which I would like to test in the future to compare its performance with previous attempts. We also talked about something I call “continuous” language identification, where the detector would detect parts of running text which are in a different language. A normal language detector will be useful for my open source translation memory service project, called InTense. Continuous language identification could be used to post-process Wikipedia articles and tag foreign text so that correct fonts are applied, and possibly also in WikiTalk-like applications, to provide the text-to-speech (TTS) with a hint on how to pronounce those words.

Reasonator is a software that generates visually pleasing summary pages in natural language and structured sections, based on structured data. More specifically, it uses Wikidata, which is the Wikimedia structured data project, developed by Wikimedia Germany. Reasonator works primarily for persons, though other types or subjects are being developed. Its localisation is limited, compared to the about three hundred languages of MediaWiki. Translating software which generates natural language sentences dynamically is very different from the usual software translation, which consists mostly of fixed strings with occasional placeholder which is replaced dynamically when showing text to an user.

It is not a new idea to use grammatical framework (GF), which is a language translation software based on interlingua, for Reasonator. In fact I had proposed this earlier in private discussions to Gerard Meijssen, but this conference renewed my interest in the idea, as I attended the GF workshop held by Aarne Ranta, Inari Listenmaa and Francis Tyers. GF seems to be a good fit here, as it allows limited context and limited vocabulary translation to many languages simultaneously; vice versa, Wikidata will contain information like gender of people, which can be fed to GF to get proper grammar in the generated translations. It would be very interesting to have a prototype of a Reasonator-like software using GF as the backend. The downside of GF is that (I assume) it is not easy for our regular translators to work with, so work is needed to make it easier and more accessible. The hypothesis is that with GF backend we would get a better language support (as in grammatically correct and flexible) with less effort on the long run. That would mean providing access to all the Wikidata topics even in smaller languages, without the effort of manually writing articles.

This blog post introduces oregano, a non-complex, non-distributed, non-realtime deployment tool. It currently consists of less than 100 lines of shell script and is licensed under the MIT license.

The problem. For a very long time, we have run translatewiki.net straight from a git clone, or svn checkout before that. For years, we have been the one wiki which systematically run latest master, with few hours of delay. That was not a problem while we were young and wild. But nowadays, due to the fact that we carry dozens of local patches and thanks to the introduction of composer, it is quite likely that git pull --rebase will stop in a merge conflict. As a consequence, updates have become less frequent, but have semi-regularly brought the site down for many minutes until the merge conflicts were manually resolved. This had to change.

The solution. I wrote a simple tool, probably re-inventing the wheel for the hundredth time, which separates the current deployment in two stages: preparation and pushing out new code. Since I have been learning a lot about Salt and its quirks, I named my tool “oregano”.

How it works. Basically, oregano is a simple wrapper for symbolic links and rsync. The idea is that you prepare your code in a directory named workdir. To deploy the current state in workdir, you must first create a read-only copy by running oregano tag. After that, you can run oregano deploy, which will update symbolic links so that your web server sees the new code. You can give the name of the tag with both commands, but by default oregano will name a new tag after the current timestamp, and deploy the most recently created tag. If, after deploying, you find out that the new tag is broken, you can quickly go back to the previously deployed code by running oregano rollback. Below this is shown as a command line tutorial.

mkdir /srv/mediawiki/ # the path does not matter, pick whatever you want

cd /srv/mediawiki

# Get MediaWiki. Everything we want to deploy must be inside workdir
git clone https://github.com/wikimedia/mediawiki workdir

oregano tag
oregano deploy

# Now we can use /srv/mediawiki/targets/deployment where we want to deploy
ln -s /srv/mediawiki/targets/deployment /www/example.com/docroot/mediawiki

# To update and deploy a new version
cd workdir
git pull
# You can run maintenance scripts, change configuration etc. here
nano LocalSettings.php

cd .. # Must be in the directory where workdir is located
oregano tag
oregano deploy

# Whoops, we accidentally introduced a syntax error in LocalSettings.php
oregano rollback

As you can see from above, it is still possible to break the site if you don’t check what you are deploying. For this purpose I might add support for hooks, so that one could run syntax checks whose failure would prevent deploying that code. Hooks would also be handy for sending IRC notifications, which is something our existing scripts do when code is updated: as pushing out code is now a separate step, they are currently incorrect.

By default oregano will keep the 4 newest tags, so make sure you have enough disk space. For translatewiki.net, which has MediaWiki and dozens of extensions, each tag takes about 200M. If you store MediaWiki localisation cache, pre-generated for all languages, inside workdir, then you would need 1.2G for each tag. Currently, at translatewiki.net, we store localisation cache outside workdir, which means it is out of sync with the code. We will see if that causes any issues; we will move it inside workdir if needed. Do note that oregano creates a tag with rsync --cvs-exclude to save space. That also has the caveat that you should not name files or directories as core. Be warned; patches welcome.

The code is in the translatewiki repo but, if there is interest, I can move it to a separate repository in GitHub. Oregano is currently used in translatewiki.net and in a pet project of mine nicknamed InTense. If things go well, expect to hear more about this mysterious pet project in the future.

It rains like a saavi

About me, me and me

Tag Archives: InTense

IWCLUL 3/3: conversations and ideas

Oregano deployment tool