Monthly Archives: March 2012

Report from the Multilingual Web Workshop

I attended the W3C Workshop about multilingual web with Gerard Meijssen for the Wikimedia Localisation team. Aside from the long list of new things you will learn in every conference, this time I was surprised by the number of links that appeared between things I already knew. For example META-SHARE was mentioned multiple times in different contexts.

Presentations. The workshop was split into two days. The first day was packed with short presentations from participants. Some observations:

  • Keynote about semantic web and how it can help us to reach multilingual web.
  • Microsoft presented their translation toolkit. It didn’t seem to include translation management at all: “You can then send the empty translation file by email”. Also in the example application mph was not localised to km/h.
  • There was a poster presentation about open source language guessers. We do tag the language used in Wikipedia pages, but still most of the guessers didn’t get it right. To me this says that there is training data out there, but nobody bothers to use it.
  • New language related features (bidirectional text, ruby, translate-flag) in HTML5 ignited lots of discussion: they were welcomed but people wanted to do more.
  • XSL-FO is still years ahead of CSS by having direction neutral start, end, before and after keywords. That is the one of the few features I like in that language.
  • Some WTF-moments: “unicode languages”, using flags for languages and locales and one of the best practices for bidirectional text was to “avoid using it”.

Open linked data. There is a big demand for all kinds of linguistic data. One of the discussion groups in the second day was about open linked data. It was emphasized that open data means that the data is in a standard format, not tied into one application. But for me an explicit open license is more important, since it allows converting the proprietary format into other formats and *distribute* them.

Open linked data. Links are another side of open linked data. Links were said to be as important as the data itself, something easy to agree with. What would for example Wikipedia be without links? The number of links is increasing, but currently the links are clustered into centers. Links are crucial to discover what data is actually available, but projects like META-SHARE do their part too. For me this compares closely to the UNIX philosophy of having each tool do one thing and do it well.
An example of this idea is in the Bank of Finnish Terminology in Arts and Sciences. Contributors are encouraged to write short definitions for terms, while long explanations are better suited to be included in Wikipedia. We are also using Semantic MediaWiki to increase the links inside the data itself.

Open linked data. A type of linked open data I would like to see is translation memory data. This is also something the open source and open content projects including Wikimedia and can contribute, since we have lots of translations that can be used to build translation memories and parallel corpuses. Have you ever wanted to compare the same text in 50+ languages? We have it. I also see nice post-processing possibilities to increase the usefulness of the data by doing sentence or even word level alignment; we only have paragraph alignment for now.

Updates on translation review feature of Translate extension

About three months ago I blogged about the translation review feature that we developed for the Translate extension. It is time to have a look at how it has been received. Thanks to Siebrand Mazeland we can now draw a graphs for review and reviewer activity. This feature came just in time for the Gnome 3.4 Finnish Translation Sprint that I’m organizing. If you look at its main page, you can see graphs for translation and review activity. The activity isn’t exactly over the top, so if you speak or can translate into Finnish, please join and help us.

I’m aware of three places using this feature:, Wikimedia Foundation and the translation sprint mentioned above. In the review ability is not as open as I originally envisioned it to be: only experienced translators can get it by request.  Only about 2% of over 3500 registered translators currently have the review right in For the other two places, everyone who can translate can also review.

When looking at the graphs for we can without doubt see that translation reviewing activity is not yet anywhere near close to the translation activity, and we should consider that there is a huge backlog or previous translations that should also be reviewed. We don’t even see a steady growth in the review activity (around the change of the year we had a translation sprint which temporarily increased translation and review activity to higher than normal levels). We don’t have graphs for Wikimedia projects yet, but looking at the logs the review features seems to be relatively in more active use there. I would personally like to see all new translations from now on to be reviewed at least by one other user.

The next step would be to add a review level column to Special:LanguageStats and Special:MessageGroupStats pages. That would need some idea on how to convey both quantity and coverage. For example, a hundred translators reviewing the same message doesn’t mean that the review coverage is good. Perhaps we should just start with coverage and bring quantity later. This could be a nice small project for someone who wants to help to develop the Translate extension with help from us.

-- .