Validating language codes like
chr might seem to be an easy task at first. You would expect this problem is already solved in MediaWiki, but that is far from the truth.
In fact, we are not even handling language codes, but language tags as defined by IETF. The linked standard brings together many standards, like the two and three letter language codes from ISO 639 standards, script names and region names, and more. This means that we have to handle language tags like
be-x-old and of course in the mix are invalid tags like
tokipona, and deprecated language codes like
The language tags are case insensitive, but there is preferred casing for different parts. MediaWiki has
wfBCP47() which handles the “pretty-formatting”.
Let me list the language tag validation functions that already exists…
- Language::isValidCode() – Contrary to its name, this function only checks that the language tag doesn’t contain certain characters which are not valid in page names or unsafe in html. Recently we had some issues with XSS exploits when code expected language codes to be html safe.
- Language::isValidBuiltinCode() – This is slightly more strict, it only accepts language tags which consist of letters a-z, numbers 0-9 and hyphens.
- Language::isKnownLanguageTag() – Checks that the language tag is known to MediaWiki. This basically means that we know the name of the language in English or in another language. Sources of known language codes are the built-in Names.php, the codes optionally added through the CLDR extension and the list of language names in English (pending merge).
- Language::isSupportedLanguageTag() – Checks whether any localisation is available for that language tag in MediaWiki (MessagesXx.php exists).
- Language::isWellFormedLanguageTag() – Checks whether the language tag is well formed. Like isKnownLanguageTag but less tight and more flexible. Would accept non-sense stuff like
fi-Cyrl-JA-x-foothat semantically makes no sense but is valid according to the rules.