Niklas Laxström

Doing stuff with language and translation.

Author Archives: Niklas Laxström

About Niklas Laxström

Doing stuff with language and translation.

Report on Wikimedia Hackathon 2017 in Vienna

Long time no post! Let’s fix that with another report from my travels. This one is mostly about work.

Wikimedia hackathon was held in Vienna in May 2017. It is an event many MediaWiki developers come to meet and work together on various kinds of things – the more experiences developers helping the newcomers. For me this was one of the best events of this kind which I have attended because it was very well organized and I had a good balance between working on things and helping others.

The main theme for me for this hackathon was translatewiki.net. This was great, because recently I have not had as much time to work on improving translatewiki.net as I used to. But this does not mean there hasn’t been any, I just haven’t made any noise about it. For example, we have greatly increased automation for importing and exporting translations, new projects are being added, the operating systems have been updates, and so on. But let’s talk about what happened during the hackathon.

I worked with with Nemo_bis and Siebrand (they did most of the work) to go over backlog of support requests from translatewiki.net. We addressed more than half of 101 open support requests, also by adding support for 7 locales. Relatedly, we also helped a couple of people to start translating at translatewiki.net or to contribute to CLDR language database for their language.

Siebrand and I held an open post-mortem (anyone could join) about a 25 hours downtime that happened to translatewiki.net just before the event. There we reflected how we handled the situation, and how we could we do better in the future. The main take-aways are better communication (twitter, status page), server upgrade and using it for increased redundancy and periodically doing restoration practices to ensure we can restore quickly if the need arises.

Amir revived his old project that allows translating messages using a chat application (the prototype uses Telegram). Many people (at least Amir, Taras, Mt.Du and I) worked on different aspects on that project. I installed the MediaWiki OAuth extension (without which it would not be possible to do the translations using the correct user name) to translatewiki.net, and gave over-the-shoulder help for the coders.

Hackathon attendees working on their computers. Photo CC-BY-SA 3.0 by Nemo_bis

As always, some bugs were found and fixed during the hackathon. I fixed an issue in Translate where the machine translation suggestions using the Apertium service hosted by Wikimedia Foundation were not showing up. I also reported an issue with our discussion extension (LiquidThreads) having two toolbars instead of one. This was quickly fixed by Bartosz and Ed.

Finally, I would advertise a presentation about MediaWiki best practices I gave in the Fantastic MediaWikis track. It summarizes a few of the best practices I have come up during my experience maintaining translatewiki.net and many other MediaWiki sites. It has tips about deployment, job queue configuration and short main page URLs and slides are available.

As a small bonus, I finally updated my blog to use https, so that I could write and that you could read this post safely knowing that nobody else but me could have put all the bad puns in the post.

Matkakuvaelma Intiasta

Kävin Intiassa yhdistetyllä työ- ja lomamatkalla. Tässä lyhyt kuvallinen kertomus.

Bangalore

Ensimmäinen viikko koostui työtiimimme tapaamisesta. Yövyimme Royal Orchid Centralissa, joka oli myös suosittu biletyspaikka. Yhtenä päivänä huoneeseen oli jätetty lappu, jossa ilmoitettiin DJ-juhlista ja pahoiteltiin siitä mahdollisesti aiheutuvaa melua (joka onneksi ei kuulunut huoneeseeni).

Lämpötila oli kuulemma poikkeuksellisen korkea, noin 40 asteen hujakoilla kuumimmillaan. Kokoustiloissa meitä oli auttamassa kaksi kattotuuletinta.

Viikolla pääsin tutustumaan monipuolisesti paikalliseen ruokakulttuuriin, joka ei ollut yhtä tulinen kuin aikaisemmilla reissuillani, tai sitten olin vain paremmin tottunut siihen. Ruoka oli aina hyvää lähtien paikallisesta herkuista ”valkoinen huone” -nimisessä paikassa syömääni juustohampurilaiseen, joka tosin oli omasta mielestäni enemmänkin BBQ-hampurilainen.

Jääteetä tarjoiltuna kylpyammeen muotoisessa astiassa.

Jääteetä tarjoiltuna kylpyammeen muotoisessa astiassa.

Delhi

Töiden jälkeen muutama meistä suuntasi päiväksi Delhiin, jossa oli vielä Bangaloreakin lämpimämpää lämpötilan kohotessa parhaimmillaan 43 asteeseen. Delhissä yövyimme vanhassa talossa, joka oli korjattu vierastaloksi. Sen aula oli täynnä erilaisia maalauksia, ja huoneen terassilta näkyi mangopuu mangoineen. Tutuistuimme Humayunin mausoleumiin, joka olikin hienon näköinen koska sitä oli äskettäin restauroitu. Söimme raania eli mausteista lampaanviulua.

Agra

Delhistä matka jatkui noin kolmen tunnin automatkalla Agraan, jossa tutuistuimme tietenkin Taj Mahaliin ja Agran linnoitukseen. Ja tottakai täälläkin oli kuuma ja aurinko porotti pilvettömältä taivaalta. Aurinkorasva sentään esti ihon palamiselta.

Kaksi kohdetta yhdellä kuvalla! Kuva Agran linnoituksesta kohti Taj Mahalia.

Kaksi kohdetta yhdellä kuvalla! Kuva Agran linnoituksesta kohti Taj Mahalia.

Nainital

Agrasta matka jatkui takaisin Delhiin, josta yöpymisen jälkeen alkoi noin seitsemän tunnin ajo Himalajan vuoristoon Nainitaliin. Ei voi muuta kuin todeta, että tiellä voi tapahtua mitä vain: kaistamerkinnät ovat viitteellisiä, autoja on tiellä poikittain, ajaa väärään suuntaan, ohittaminen tapahtuu kuten venäjällä ja töötti on ahkerassa käytössä. Lisäksi vuoristossa tiet ovat lisäksi hyvin kapeita ja mutkaisia ja niissä näkyy merkkejä maansortumista. Tällä kertaa en nähnyt tiellä norsuja, mutta apinoita, koiria, lehmiä, kameleita, muuleja ja hevosia kylläkin. Yksi matkan jännittävistä tilanteista tapahtui, kun meitä edellä ajanut rekka oli pysähtynyt ja yksi rekan kyydissä olleista teräsköysistä tarttui kiinni vastaantulevaan kulkuneuvoon. Paikalla olleet apinat tietysti nauroivat partoihinsa.

Harvinainen näky. Lehmä tien vieressä eikä tiellä.

Harvinainen näky. Lehmä tien vieressä eikä tiellä.

Nainital on ”pieni” kylä vuoristossa järven rannalla. Nimi tarkoittaa silmän järveä ja tarun mukaan se on syntnyt paikkaan mihin Satin silmä putosi.

Näkymä hotellihuoneesta aamupäivällä.

Näkymä hotellihuoneesta aamupäivällä.

Kävimme järvellä soutelemassa (soutajan kanssa), mutta sinne olisi päässyt myös tutunnäköisellä menopelillä.

Kävimme järvellä soutelemassa (soutajan kanssa), mutta sinne olisi päässyt myös tutunnäköisellä menopelillä.

Binsar

Nainitalista matka huipentui matkan hoteimpaan ja cooleimpaan paikkaan. Tällä korkeudella lämpötila alkoi lähestyä suomalaista kesää. Noin kolmen tunnin ajomatka toi meidät vielä ylemmäksi Binsarin villieläinten turvapaikkaan. Sen sisällä on brittiläisten rakentama Grand Oak Manor, joka on (mielestäni hyvinkin kunnioittaen) muutettu hotellikäyttöön.

Loppumatkan ajaksi auto vaihtui jeepksi.

Loppumatkan ajaksi auto vaihtui jeepiksi.

Kodikas hotellihuone.

Kodikas hotellihuone.

Hotellista löytyi myös sauna. Mitenköhän se on saatu huipulle? Ei taida lämmetä hotellin käyttämällä aurinkosähköllä.

Hotellista löytyi myös sauna. Mitenköhän se on saatu huipulle? Ei taida lämmetä hotellin käyttämällä aurinkosähköllä.

Binsarissa tarkoituksemme oli rentoutua, nähdä lumiset vuorenhuiput ja villieläimiä. Miten sitten kävikään…

Eksoottisia villieläimiä osa 1/3. Sitruunaperhonen.

Eksoottisia villieläimiä osa 1/4. Sitruunaperhonen.

Eksoottisia villieläimiä 2/4. Iloinen lisko.

Eksoottisia villieläimiä 2/4. Tarkkaileva lisko.

Eksoottisia villieläimiä 3/4. Kuvan keskeltä löytyy heinäsirkka.

Eksoottisia villieläimiä 3/4. Kuvan keskeltä löytyy heinäsirkka.

Eksoottisia (villi?)eläimiä 4/4. Intialainen koira.

Eksoottisia (villi?)eläimiä 4/4. Intialainen koira.

Leopardit ja kauluskarhut jäi näkemättä, mutta ehkä turvallisempaa niin. Omaa jännitystä vierailuun toi lähistöllä olevat metsäpalot, jotka sekä peittivät näkymän vuorille savullaan että myös sulkivat tänne tulevan tien väliaikaisesti.

Metsäpalo kävi hyvinkin lähellä. Osa paloista oli hallittuja, joiden tarkoitus oli estää hallitsemattomia paloja leviämästä.

Metsäpalo kävi hyvinkin lähellä. Osa paloista oli hallittuja, joiden tarkoitus oli estää hallitsemattomia paloja leviämästä.

Kaikki meni onneksi hyvin, ja pääsimme lähtemään ajallaan palaneita alueita ihmetellen. Tästä matka jatkui takaisin kohti Delhiä, ajo sinne kesti noin 10 tuntia, josta oli sitten lento kotiin.

Vuoristoliikennettä ja apina.

Vuoristoliikennettä ja apina.

On savuista ja kuivaa.

On savuista ja kuivaa.

Liikennettä Delhissä. Myös tällä tiellä riitti väärään suuntaan menijöitä, rullaluistelija ja tien ylittäjiä uhkarohkeista pieniin koululaisiin. Bussin takaosan yllä näkyy hevoskärryt.

Liikennettä Delhissä. Myös tällä tiellä riitti väärään suuntaan menijöitä, rullaluistelija ja tien ylittäjiä uhkarohkeista pieniin koululaisiin. Bussin takaosan yllä näkyy hevoskärryt.

 

MediaWiki short urls with nginx and main page without redirect

This post has been updated 2015-09-06 with simplified code suggested by Krinkle and again in 2017-04-04.

Google PageSpeed Insights writes:

Redirects trigger an additional HTTP request-response cycle and delay page rendering. In the best case, each redirect will add a single round trip (HTTP request-response), and in the worst it may result in multiple additional round trips to perform the DNS lookup, TCP handshake, and TLS negotiation in addition to the additional HTTP request-response cycle. As a result, you should minimize use of redirects to improve site performance.

Let’s consider the situation where you run MediaWiki as the main thing on your domain. When user goes to your domain example.com, MediaWiki by default will issue a redirect to example.com/wiki/Main_Page, assuming you have configured the recommended short urls.

In addition the short url page writes:

Note that we do not recommend doing a HTTP redirect to your wiki path or main page directly. As redirecting to the main page directly will hard-code variable parts of your wiki’s page setup into your server config. And redirecting to the wiki path will result in two redirects. Simply rewrite the root path to MediaWiki and it will take care of the 301 redirect to the main page itself.

So are we stuck with a suboptimal solution? Fortunately, there is a way and it is not even that complicated. I will share example snippets from translatewiki.net configuration how to do it.

Configuring nginx

For nginx, the only thing we need in addition the default wiki short url rewrite is to rewrite / so that it is forwarded to MediaWiki. The configuration below assumes MediaWiki is installed in the w directory under the document root.

location ~ ^/wiki/ {
	rewrite ^ /w/index.php;
}

location = / {
	rewrite ^ /w/index.php;
}

Whole file for the curious.

Configuring MediaWiki

First, in our LocalSettings.php we have the short url configuration:

$wgArticlePath      = "/wiki/$1";
$wgScriptPath       = "/w";

In addition we use hooks to tell MediaWiki to make / the URL for the main page, not to be redirected:

$wgHooks['GetLocalURL'][] = function ( &$title, &$url, $query ) {
	if ( $title->isExternal() && $query != '' && $title->isMainPage() ) {
		$url = '/';
	}
};

// Tell MediaWiki that "/" should not be redirected
$wgHooks['TestCanonicalRedirect'][] = function ( $request ) {
	return $request->getRequestURL() !== '/';
};

This has the added benefit that all MediaWiki generated links to the main page point to the domain root, so you only have one canonical url for the wiki main page. The if block in the middle with strpos checks for problematic characters ? and & and forces them to use the long URLs, because otherwise they would not work correctly with this nginx rewrite rule.

And that’s it. With these changes you can have your main page displayed on your domain without redirect, also keeping it short for users to copy and share. This method should work for most versions of MediaWiki, including MediaWiki 1.26 which forcefully redirects everything that doesn’t match the canonical URL as seen by MediaWiki.

translatewiki.net – harder, better, faster, stronger

I am very pleased to announce that translatewiki.net has been migrated to new servers sponsored by netcup GmbH. Yes, that is right, we now have two servers, both of which are more powerful than the old server.

Since the two (virtual) servers are located in the same data center and other nitty gritty details, we are not making them redundant for the sake of load balancing or uptime. Rather, we have split the services: ElasticSearch runs on one server, powering the search, translation search and translation memory; everything else runs on the other server.

In addition to faster servers and continuous performance tweaks, we are now faster thanks to the migration from PHP to HHVM. The Wikimedia Foundation did this a while ago with great results, but HHVM has been crashing and freezing on translatewiki.net for unknown reasons. Fortunately, recently I found a lead that the issue is related to a ini_set function, which I was easily able to work around while the investigation on the root cause continues.

Non-free Google Analytics confirms that we now server pages faster.

Non-free Google Analytics confirms that we now serve pages faster: the small speech bubble indicates migration day to new servers and HHVM. Effect on the actual page load times observed by users seems to be less significant.

We now have again lots of room for growth and I challenge everyone to make us grow with more translations, new projects or other legitimate means, so that we reach a point where we will need to upgrade again ;). That’s all for now, stay tuned for more updates.

14 more languages “fully” translated this week

This week, MediaWiki’s priority messages have been fully translated in 14 more languages by about a dozen translators, after we checked our progress. Most users in those languages now see the interface of Wikimedia wikis entirely translated.

In two months since we updated the list of priority translations, languages 99+ % translated went from 17 to 60. No encouragement was even needed: those 60 languages are “organically” active, translators quickly rushed to use the new tool we gave them. Such regular and committed translators deserve a ton of gratitude!

However, we want to do better. We did something simple: tell MediaWiki users that they can make a difference, even if they don’t know. «With a couple hours’ work or less, you can make sure that nearly all visitors see the wiki interface fully translated.» The results we got in few hours speak for themselves:

Special:TranslationStats graph of daily registrations

This week’s peak of new translator daily registrations was ten times the usual

Special:TranslationStats of daily active translators

Many were eager to help: translation activity jumped immediately

Thanks especially to CERminator, David1010, EileenSanda, KartikMistry, Njardarlogar, Pymouss, Ranveig, Servien, StanProg, Sudo77(new), TomášPolonec and Чаховіч Уладзіслаў, who completed priority messages in their languages.

For the curious, the steps to solicit activity were:

There is a long tail of users who see talk page messages only after weeks or months, so for most of those 60 languages we hope to get more translations later. It will be harder to reach the other hundreds languages, for which there are only 300 active users in Wikimedia according to interface language preferences: about 100 incubating languages do not have a single known speaker on any wiki!

We will need a lot of creativity and word spreading, but the lesson is simple: show people the difference that their contribution can make for free knowledge; the response will be great. Also, do try to reach the long tail of users and languages: if you do it well, you can communicate effectively to a large audience of silent and seemingly unresponsive users on hundreds Wikimedia projects.

IWCLUL 3/3: conversations and ideas

In IWCLUL talks, Miikka Silfverberg’s mention of collecting words from Wikipedia resonated with my earlier experiences working with Wikipedia dumps, especially the difficulty of it. I talked with some people at the conference and everyone seemed to agree that processing Wikipedia dumps takes a lot of time, which they could spend for something else. I am considering to publish plain text Wikipedia dumps and word frequency lists. While working in the DigiSami project, I familiarized myself with the utilities as well as the Wikimedia Tool Labs, so relatively little effort would be needed. The research value would be low, but it would be worth it, if enough people find these dumps and save time. A recent update is that Parsoid is planning to provide plain text format, so this is likely to become even easier in the future. Still, there might be some work to do collect pages into one archive and decide which parts of page will stay and which will be removed: for example converting an infobox to collection of isolated words is not useful for use cases such as WikiTalk, and it can also easily skew word frequencies.

I talked with Sjur Moshagen about keyboards for less resourced languages. Nowadays they have keyboards for Android and iOS, in addition to keyboards for computers (which already existed). They have some impressing additional features, like automatically adding missing accents to typed words. That would be too complicated to implement in jquery.ime, a project used by Wikimedia that implements keyboards in a browser. At least the aforementioned example uses finite state transducer. Running finite state tools in the browser does not yet feel realistic, even though some solutions exist*. The alternative of making requests to a remote service would slow down typing, except perhaps with some very clever implementation, which would probably be fragile at best. I have still to investigate whether there is some middle ground to bring the basic keyboard implementations to jquery.ime.

*Such as jsfst. One issue is that the implementations and the transducers themselves can take lot of space, which means we will run into same issues as when distributing large web fonts at Wikipedia.

I spoke with Tommi Pirinen and Antti Kanner about implementing a dictionary application programming interface (API) for the Bank of Finnish Terminology in Arts and Sciences (BFT). That would allow direct use of BFT resources in translation tools like translatewiki.net and Wikimedia’s Content Translation project. It would also help indirectly, by using a dump for extending word lists in the Apertium machine translation software.

I spoke briefly about language identification with Tommi Jauhiainen who had a poster presentation about the project “The Finno-Ugric languages and the internet”. I had implemented one language detector myself, using an existing library. Curiously enough, many other people met in Wikimedia circles have also made their own implementations. Mine had severe problems classifying languages which are very close to each other. Tommi gave me a link for another language detector, which I would like to test in the future to compare its performance with previous attempts. We also talked about something I call “continuous” language identification, where the detector would detect parts of running text which are in a different language. A normal language detector will be useful for my open source translation memory service project, called InTense. Continuous language identification could be used to post-process Wikipedia articles and tag foreign text so that correct fonts are applied, and possibly also in WikiTalk-like applications, to provide the text-to-speech (TTS) with a hint on how to pronounce those words.

Reasonator entry for Kimmo KoskenniemiReasonator is a software that generates visually pleasing summary pages in natural language and structured sections, based on structured data. More specifically, it uses Wikidata, which is the Wikimedia structured data project, developed by Wikimedia Germany. Reasonator works primarily for persons, though other types or subjects are being developed. Its localisation is limited, compared to the about three hundred languages of MediaWiki. Translating software which generates natural language sentences dynamically is very different from the usual software translation, which consists mostly of fixed strings with occasional placeholder which is replaced dynamically when showing text to an user.

It is not a new idea to use grammatical framework (GF), which is a language translation software based on interlingua, for Reasonator. In fact I had proposed this earlier in private discussions to Gerard Meijssen, but this conference renewed my interest in the idea, as I attended the GF workshop held by Aarne Ranta, Inari Listenmaa and Francis Tyers. GF seems to be a good fit here, as it allows limited context and limited vocabulary translation to many languages simultaneously; vice versa, Wikidata will contain information like gender of people, which can be fed to GF to get proper grammar in the generated translations. It would be very interesting to have a prototype of a Reasonator-like software using GF as the backend. The downside of GF is that (I assume) it is not easy for our regular translators to work with, so work is needed to make it easier and more accessible. The hypothesis is that with GF backend we would get a better language support (as in grammatically correct and flexible) with less effort on the long run. That would mean providing access to all the Wikidata topics even in smaller languages, without the effort of manually writing articles.

IWCLUL 2/3: morphology, OCR, a corpus vs. Wiktionary

More on IWCLUL: now on the sessions. The first session of the day was by the invited speaker Kimmo Koskenniemi. He is applying his two-level formalism in a new area, old literary Finnish (example of old literary Finnish). By using two-level rules for old written Finnish together with OMorFi, he is able to automatically convert old text to standard Finnish dictionary forms, which can be used, in the main example, as an input text to an search engine. He uses weighted transducers to rank the most likely equivalent modern day words. For example the contemporary spelling of wijsautta is viisautta, which is an inflected form of the noun viisaus (wisdom). He only takes the dictionary forms, because otherwise there are too many unrelated suggestions. This avoids the usual problems of too many unrelated morphological analyses: I had the same problen in my master’s thesis when I attempted using OMorFi to improve Wikimedia’s search system, which was still using Lucene at that time.

Jeremy Bradley gave presentation about an online Mari corpus. Their goal was to make a modern English-language textbook for Mari, for people who do not have access to native speakers. I was happy to see they used a free/copyleft Creative Commons license. I asked him whether they considered Wiktionary. He told me he had discussed with a person from Wiktionary who was against an import. I will be reaching my contacts and see whether an another attempt will succeed. The automatic transliteration between Latin, Cyrillic and IPA was nice, as I have been entertaining the idea of doing transliteration from Swedish to Finnish for WikiTalk, to make it able to function in Swedish as well by only using Finnish speech components. One point sticks with me: they had to add information about verb complements themselves, as they were not recorded in their sources. I can sympathize with them based on my own language learning experiences.

Stig-Arne Grönroos’ presentation on Low-resource active learning of North Sámi morphological segmentation did not contain any surprises for me after having been exposed to this topic previously. All efforts to support languages where we have to cope with limited resources are welcome and needed. Intermediate results are better than working with nothing while waiting for a full morphological analyser, for example. It is not completely obvious to me how this tool can be used in other language technology applications, so I will be happy to see an example.

Miikka Silfverberg presented about OCR, using OMorFi: can morphological analyzers improve the quality of optical character recognition? To summarize heavily, OCR performed worse when OMorFi was used, compared to just taking the top N most common words from Wikipedia. I understood this is not exactly the same problem of large number of readings generated by morphological analyser, rather something different but related.

Prioritizing MediaWiki’s translation strings

After a very long wait, MediaWiki’s top 500 most important messages are back at translatewiki.net with fresh data. This list helps translators prioritize their work to get most out of their effort.

What are the most important messages

In this blog post the term message means a translatable string in a software; technically, when a message is shown to users, they see different strings depending on the interface language.

MediaWiki software includes almost 5.000 messages (~40.000 words), or almost 24.000 messages (~177.000 words) if we include extensions. Since 2007, we make a list of about 500 messages which are used most frequently.

Why? If translators can translate few hundreds words per hour, and translating messages is probably slower than translating running text, it will take weeks to translate everything. Most of our volunteer translators do not have that much time.

Assuming that the messages follow a long tail pattern, a small number of messages are shown* to users very often, like the Edit button at the top of page in MediaWiki. On the other hand, most messages are only shown on rare error conditions or are part of disabled or restricted features. Thus it makes sense to translate the most visible messages first.

Concretely, translators and i18n fans can monitor the progress of MediaWiki localisation easily, by finding meaningful numbers in our statistics page; and we have an clear minimum service level for new locales added to MediaWiki. In particular, the Wikimedia Language committee requires that at very least all the most important messages are translated in a language before that language is given a Wikimedia project subdomain. This gives an incentive to kickstart the localisation in new languages, ensures that users see Wikimedia projects mostly in their own language and avoids linguistic colonialism.

Screenshot of fi.wiktionary.org

The screenshot shows an example page with messages replaced by their key instead of their string content. Click for full size.

Some history and statistics

The usage of the list for monitoring was fantastically impactful in 2007 and 2009 when translatewiki.net was still ramping up, because it gave translators concrete goals and it allowed to streamline the language proposal mechanism which had been trapped into a dilemma between a growing number of requests for language subdomains and a growing number of seemingly-dead open subdomains. There is some more background on translatewiki.net.

Languages with over 99 % most used messages translated were:

There is much more to do, but we now have a functional tool to motivate translators! To reach the peak of 2011, the least translated language among the first 181 will have to translate 233 messages, which is a feasible task. The 300th language is 30 % translated and needs 404 more translations. If we reached such a number, we could confidently say that we really have Wikimedia projects in 280+ languages, however small.

* Not necessarily seen: I’m sure you don’t read the whole sidebar and footer every time you load a page in Wikipedia.

Process

At Wikimedia, first, for about 30 minutes we logged all requests to fetch certain messages by their key. We used this as a proxy variable to measure how often a particular message is shown to the user, which again is a proxy of how often a particular message is seen by the user. This is in no way an exact measurement, but I believe it good enough for the purpose. After the 30 minutes, we counted how many times each key was requested and we sorted by frequency. The result was a list containing about 17.000 different keys observed in over 15 million calls. This concluded the first phase.

In the second phase, we applied a rigorous human cleanup to the list with the help of a script, as follows:

  1. We removed all keys not belonging to MediaWiki or any extension. There are lots of keys which can be customized locally, but which don’t correspond to messages to translate.
  2. We removed all messages which were tagged as “ignored” in our system. These messages are not available for translation, usually because they have no linguistic content or are used only for local site-specific customization.
  3. We removed messages called less than 100 times in the time span and other messages with no meaningful linguistic content, like messages where there are only dashes or other punctuation which usually don’t need any changes in translation.
  4. We removed any messages we judged to be technical or not shown often to humans, even though they appeared high in this list. This includes some messages which are only seen inside comments in the generated HTML and some messages related to APIs or EXIF features.

Finally, some coding work was needed by yours truly to let users select those messages for translation at translatewiki.net.

Discoveries

In this process some points emerged that are worth highlighting.

  • 310 messages (62 %) of the previous list (from 2011) are in the new list as well. Superseded user login messages have now been removed.
  • Unsurprisingly, there are new entries from new highly visible extensions like MobileFrontend, Translate, Collection and Echo. However, except a dozen languages, translators didn’t manage to keep up with such messages in absence of a list.
  • I just realized that we are probably missing some high visibility messages only used in the JavaScript side. That is something we should address in the future.
  • We slightly expanded the list from 500 to 600 messages, after noticing there were few or no “important” messages beyond this point. This will also allow some breathing space to remove messages which get removed.
  • We did not follow a manual passage as in the original list, which included «messages that are not that often used, but important for a proper look and feel for all users: create account, sign on, page history, delete page, move page, protect page, watchlist». A message like “watchlist” got removed, which may raise suspicions: but it’s “just” the HTML title of Special:Watchlist, more or less as important as the the name “Special:Watchlist” itself, which is not included in the list either (magic words, namespaces or special pages names are not included). All in all, the list seems plausible.

Conclusion

Finally, the aim was to make this process reproducible so that we could do it yearly, or even more often. I hope this blog post serves as a documentation to achieve that.

I want to thank Ori Livneh for getting the key counts and Nemo for curating the list.

IWCLUL event report 1/3: the story

IWCLUL is short for International Workshop on Computational Linguistics for Uralic Languages. I attended the conference, held on January 16th 2015, and presented a joint paper with Antti on Multilingual Semantic MediaWiki for Finno-Ugric dictionaries at the poster session.

I attentively observe the glimmering city lights of Tromsø as the plane lands in darkness to orientate myself to the maps I studied on my computer before the trip. At the airport I receive a kind welcome by Trond, in Finnish, together with a group of other people going to the conference. While he is driving us into our hotels, Trond elaborates the sights of the island we pass by. I and Antti, who is co-author of our paper about Multilingual Semantic MediaWiki, check in to the hotel and joke about the tendency of forgetting posters in different places.

Next morning I meet Stig-Arne at breakfast. We decide to go see the local cable car. We wander around the city center until we finally find a place where they sell bus tickets. We had asked a few people but they gave conflicting different directions. We take the bus and then Fjellheisen, the cable car, to the top. The sights are wonderful even in winter. I head back, do some walking in the center. I buy some postcards and use that as an excuse to get inside and warm up.

On Friday, on the conference day, almost by miracle, we end up in the conference place without too many issues, despite seeing no signs in the University of Tromsø campus. More information of the conference itself will be provided in the following parts. And the poster? We forgot to take it with us from the social event after the conference.

Travels

You are in Israel. With a group of friends. You are the only one who doesn’t yet have a train ticket. You try to buy a train ticket. There is a ticket machine, but it does look a bit different from what you are used to. To pay with a credit card you need to control the reading power of the card reader manually. You adjust the power knob, but before you have time to think what happened it gets overloaded and what remains of your credit card is a charred piece and smoke.

You are now stuck in Israel with no money. You run from one light rail to another with the group of friends. Due to jetlag you move slowly and almost miss the switch.

You try to get some people to help you but they are very reluctant. You are supposed to go to a hotel in another town and you need money to pay for the room. Yet, nobody promises even to give money for a train ticket.

Later, the group of friends go away. You are alone. Buy you have a laptop. You use the laptop to send an email with a title “[URGENT]: stuck in Israel”. The body of the email is colorful and has physical texture. As you continue to type, the text stops making any sense, deteriorating to single letters here and there.

You try to Google any info about the Finnish consulate without luck. The battery is almost empty. Suddenly you hear weird noises. You hear a voice not unlike Gollum. Your eyes enlarge wide open as you stare a humanoid monster. You toss away the laptop quickly but calmly next to the rails and flee up to small stairs leading to a bathroom. You lock the door but you can’t help but to realize the lock wont