Report from the Multilingual Web Workshop

I attended the W3C Workshop about multilingual web with Gerard Meijssen for the Wikimedia Localisation team. Aside from the long list of new things you will learn in every conference, this time I was surprised by the number of links that appeared between things I already knew. For example META-SHARE was mentioned multiple times in different contexts.

Presentations. The workshop was split into two days. The first day was packed with short presentations from participants. Some observations:

  • Keynote about semantic web and how it can help us to reach multilingual web.
  • Microsoft presented their translation toolkit. It didn’t seem to include translation management at all: “You can then send the empty translation file by email”. Also in the example application mph was not localised to km/h.
  • There was a poster presentation about open source language guessers. We do tag the language used in Wikipedia pages, but still most of the guessers didn’t get it right. To me this says that there is training data out there, but nobody bothers to use it.
  • New language related features (bidirectional text, ruby, translate-flag) in HTML5 ignited lots of discussion: they were welcomed but people wanted to do more.
  • XSL-FO is still years ahead of CSS by having direction neutral start, end, before and after keywords. That is the one of the few features I like in that language.
  • Some WTF-moments: “unicode languages”, using flags for languages and locales and one of the best practices for bidirectional text was to “avoid using it”.

Open linked data. There is a big demand for all kinds of linguistic data. One of the discussion groups in the second day was about open linked data. It was emphasized that open data means that the data is in a standard format, not tied into one application. But for me an explicit open license is more important, since it allows converting the proprietary format into other formats and *distribute* them.

Open linked data. Links are another side of open linked data. Links were said to be as important as the data itself, something easy to agree with. What would for example Wikipedia be without links? The number of links is increasing, but currently the links are clustered into centers. Links are crucial to discover what data is actually available, but projects like META-SHARE do their part too. For me this compares closely to the UNIX philosophy of having each tool do one thing and do it well.
An example of this idea is in the Bank of Finnish Terminology in Arts and Sciences. Contributors are encouraged to write short definitions for terms, while long explanations are better suited to be included in Wikipedia. We are also using Semantic MediaWiki to increase the links inside the data itself.

Open linked data. A type of linked open data I would like to see is translation memory data. This is also something the open source and open content projects including Wikimedia and translatewiki.net can contribute, since we have lots of translations that can be used to build translation memories and parallel corpuses. Have you ever wanted to compare the same text in 50+ languages? We have it. I also see nice post-processing possibilities to increase the usefulness of the data by doing sentence or even word level alignment; we only have paragraph alignment for now.

-- .