Language validation in MediaWiki

Validating language codes like en or fi or chr might seem to be an easy task at first. You would expect this problem is already solved in MediaWiki, but that is far from the truth.

In fact, we are not even handling language codes, but language tags as defined by IETF. The linked standard brings together many standards, like the two and three letter language codes from ISO 639 standards, script names and region names, and more. This means that we have to handle language tags like pt-BR, sr-Latn, be-x-old and of course in the mix are invalid tags like de-formal and tokipona, and deprecated language codes like bat-smg (better: sgs).

The language tags are case insensitive, but there is preferred casing for different parts. MediaWiki has wfBCP47() which handles the “pretty-formatting”.

Let me list the language tag validation functions that already exists…

  • Language::isValidCode() – Contrary to its name, this function only checks that the language tag doesn’t contain certain characters which are not valid in page names or unsafe in html. Recently we had some issues with XSS exploits when code expected language codes to be html safe.
  • Language::isValidBuiltinCode() – This is slightly more strict, it only accepts language tags which consist of letters a-z, numbers 0-9 and hyphens.
…and what I think should exist – these will be probably implemented very soon:
  • Language::isKnownLanguageTag() – Checks that the language tag is known to MediaWiki. This basically means that we know the name of the language in English or in another language. Sources of known language codes are the built-in Names.php, the codes optionally added through the CLDR extension and the list of language names in English (pending merge).
  • Language::isSupportedLanguageTag() – Checks whether any localisation is available for that language tag in MediaWiki (MessagesXx.php exists).
I can also imagine a use case for:
  • Language::isWellFormedLanguageTag() – Checks whether the language tag is well formed. Like isKnownLanguageTag but less tight and more flexible. Would accept non-sense stuff like fi-Cyrl-JA-x-foo that semantically makes no sense but is valid according to the rules.

-- .

One Response to “Language validation in MediaWiki”

  1. […] months ago I wrote about Language tag validation in MediaWiki. A nice person named Siebrand Mazeland decided to improve the situation. As of now we have three […]