Monthly Archives: August 2012

Language validation in MediaWiki

Validating language codes like en or fi or chr might seem to be an easy task at first. You would expect this problem is already solved in MediaWiki, but that is far from the truth.

In fact, we are not even handling language codes, but language tags as defined by IETF. The linked standard brings together many standards, like the two and three letter language codes from ISO 639 standards, script names and region names, and more. This means that we have to handle language tags like pt-BR, sr-Latn, be-x-old and of course in the mix are invalid tags like de-formal and tokipona, and deprecated language codes like bat-smg (better: sgs).

The language tags are case insensitive, but there is preferred casing for different parts. MediaWiki has wfBCP47() which handles the “pretty-formatting”.

Let me list the language tag validation functions that already exists…

  • Language::isValidCode() – Contrary to its name, this function only checks that the language tag doesn’t contain certain characters which are not valid in page names or unsafe in html. Recently we had some issues with XSS exploits when code expected language codes to be html safe.
  • Language::isValidBuiltinCode() – This is slightly more strict, it only accepts language tags which consist of letters a-z, numbers 0-9 and hyphens.
…and what I think should exist – these will be probably implemented very soon:
  • Language::isKnownLanguageTag() – Checks that the language tag is known to MediaWiki. This basically means that we know the name of the language in English or in another language. Sources of known language codes are the built-in Names.php, the codes optionally added through the CLDR extension and the list of language names in English (pending merge).
  • Language::isSupportedLanguageTag() – Checks whether any localisation is available for that language tag in MediaWiki (MessagesXx.php exists).
I can also imagine a use case for:
  • Language::isWellFormedLanguageTag() – Checks whether the language tag is well formed. Like isKnownLanguageTag but less tight and more flexible. Would accept non-sense stuff like fi-Cyrl-JA-x-foo that semantically makes no sense but is valid according to the rules.

Wikimania videos: the next billion users on Wikipedia and beyond

Wikimedia DC has started publishing the Wikimania videos on YouTube. They are not split by presentation, only by track, but here are some about localisation and internationalisation.

My Wikimania presentation (see my previous post), Translating the wiki way (starts at 28:05; watch on YouTube):

Amir’s Supporting languages, all of them and Siebrand’s A Tale of Language Support and Ask the Language Support People (watch on YouTube):

Santhosh’s Read and Write in your language has not been published yet and nobody seems to know if it will, or if it has been recorded at all.

Alolita’s The next billion users on Wikipedia with Open Source Webfonts and Amir’s The software localization paradox (watch on YouTube):

See also the category on Wikimania wiki for abstracts and slides for these presentations.

My presentations at Akademy and Wikimania

In July I gave two presentations: one at Akademy 2012 in Tallinn, and one at Wikimania 2012.

Short summary of my Akademy presentation (slides): If you are translating content in MediaWiki and you are not using Translate extension, you are doing it wrong. Statistics, translation and proofreading interface – you get them all with Translate. Because Translate keeps track of changes to pages, you can spend your time translating instead of trying to figure what needs translating or updating.

Also, have a look at UserBase, it has now been updated to include the latest features and fixes of Translate extension, like the ability to group translatable pages into larger groups.

Akademy presenation by Niklas and Claus: click for video. Yes, there’s a a typo.

Short summary of my Wikimania presentation (slides; video not yet available): Stop wasting translators’ time.
Forget signing up to e-mail lists, forget sending files back and forth. Use translation platforms that move files from and to the version control system transparently to the translator.
If you have sentences split into multiple messages, you are doing it wrong. If your i18n framework doesn’t have support for plural, gender and grammar dependent translations, you are doing it wrong. If you are not documenting your interface messages for translators, you are doing it wrong.

Niklas maybe having fun at Library of Congress. Photo tychay, CC-BY-NC-ND

-- .