Posts Tagged ‘CLDR’

MediaWiki i18n explained: {{PLURAL}}

Sunday, January 5th, 2014

This post explains how MediaWiki handles plural rules to developers who need to work with it. In other words, how a string like “This wiki {{PLURAL:$1|0=does not have any pages|has one page|has $1 pages}}” becomes “This wiki has 425 pages”.

Rules

As mentioned before we have adopted a data-based approach. Our plural rules come from Unicode CLDR (Common Locale Data repository) in XML format and are stored in languages/data/plurals.xml. These rules are supplemented by local overrides in languages/data/plurals-mediawiki.xml for languages not supported by CLDR or where we are yet to unify our existing local rules to match CLDR rules.

As a short recap, translators handle plurals by writing all possible different forms explicitly. That means that there are different forms for singular, dual, plural, etc., depending on what grammatical numbers the language has. There might be more forms because of other grammatical reasons, for example in Russian the grammatical case of the noun varies depending on the number. The rules from CLDR put all numbers into different boxes, each box corresponding to one form provided by the translator.

Preprocessing

The plural rules are stored in localisation cache (not to be confused with message cache and many other caches in MediaWiki) with other language specific localisation data. The localisation cache can be stored in different places depending on configuration. The default is to use the SQL database, but they can also be in CDB files as they are at the Wikimedia Foundation and translatewiki.net.

The whole process starts 1) when the user runs php maintenance/rebuildLocalisationCache.php, or 2) during a web request, if the cache is stale and automatic cache rebuild is allowed (as by default).

The code proceeds as follows:

LocalisationCache::readSourceFilesAndRegisterDeps

  • LocalisationCache::getPluralRules fills pluralRules
    • LocalisationCache::loadPluralFiles loads both xml files, merges them and stores the result in in-process cache
  • LocalisationCache::getComplisedPluralRules fills compiledPluralRules
    • LocalisationCache::loadPluralFiles returns rules from in-process cache
    • CLDRPluralRuleEvaluator::compile compiles the standard notation into RPN notation
  • LocalisationCache::getPluralTypes fills pluralRuleTypes

So now for the given language we have three lists (see table 1). The pluralRules are used in frontend (JavaScript) and the compiledPluralRules are used in the backend (PHP) with a custom evaluator. Tim Starling wrote the custom evaluator for performance reasons. The pluralRuleTypes stores the map between numerical indexes and CLDR keywords, which are not used in MediaWiki plural syntax. Please note that Russian has four plural forms: the fourth form, called other, is used when none of the other rules match and is not stored anywhere.

Table 1: Stored plural data for Russian
pluralRuleTypes pluralRules compiledPluralRules
“one” “n mod 10 is 1 and n mod 100 is not 11” “n 10 mod 1 is n 100 mod 11 is-not and”
“few” “n mod 10 in 2..4 and n mod 100 not in 12..14” “n 10 mod 2 4 .. in n 100 mod 12 14 .. not-in and”
“many” “n mod 10 is 0 or n mod 10 in 5..9 or n mod 100 in 11..14” “n 10 mod 0 is n 10 mod 5 9 .. in or n 100 mod 11 14 .. in or”

The cache also stores the magic word PLURAL, defined in languages/messages/MessageEn.php and translated to other languages, so in Finnish language wikis they can use {{MONIKKO:$1|$1 talo|$1 taloa}} if they so want. For compatibility reasons, in all interface translations these magic words are used in English.

Invocation on backend

There are roughly three ways to trigger plural parsing:

  1. using the plural syntax in a wiki page,
  2. calling the plural parser with Message object with output format text,
  3. using the plural syntax in a message with output format parse, which calls full wikitext parser as in 1.

In all cases, we will get into Parser::replaceVariables, which expands all magic words and templates (anything enclosed in double braces; sometimes also called {{ constructs). It will load the possible translated magic words and see if the {{thing}} in the wikitext or message matches a known magic word. If not, the {{thing}} is considered a template call. If the plural magic word matches, the parser will call CoreParserFunctions::plural which will take the arguments, make them into an array, call the correct language object with Language::convertPlural( number, forms ): see table 2 for function call trace.

In the Language class we first handle explicit plural forms explained in a previous post on explicit zero and one form. If any explicit plural form doesn’t match, they are removed and we will continue on with the other forms, calling Language::getPluralRuleIndexNumber( number ), which first loads the compiled plural rules into the in-process cache, then calls CLDRPluralRuleEvaluator::evaluateCompiled which returns the box the number belongs to. Finally we take the matching form given by the translator, or the last form provided. Then the return value is substituted in place of the plural magic word call.

Table 2: Function call list for plural magic word
Message::parse Message::text
  • Message::toString
  • Message::parseText
  • MessageCache::parse
  • Parser::parse
  • Parser::internalParse
  • Message::toString
  • Message::transformText
  • MessageCache::transform
  • Parser::transformMsg
  • Parser::preprocess
  • [The above lists converge here]
  • Parser::replaceVariables
  • PPFrame_DOM::expand
  • Parser::braceSubstitution
  • Parser::callParserFunction
  • call_user_func_array
  • CoreParserFunctions::plural
  • Language::convertPlural
  • [Plural rule evaluation]

Invocation on frontend

The resource loader module mediawiki.language.data (implemented in class ResourceLoaderLanguageDataModule) is responsible for loading the plural rules from localisation cache and delivering them together with other language data to JavaScript.

The resource loader module mediawiki.jqueryMsg provides yet another limited wikitext parser which can handle plural, links and few other things. The module mediawiki (global mediaWiki, usually aliased to mw) provides the messaging interface with functions like mw.msg() or mw.message().text(). Those will not handle plural without the aforementioned mediawiki.jqueryMsg module. Translated magic words are not supported at the frontend.

If a plural magic word is found, then it will call the frontend convertPlural method. These are provided in few hops by the module mediawiki.language which depends on mediawiki.language.data and mediawiki.cldr. The latter depends on mediawiki.libs.pluralruleparser, which evaluates the (non-compiled) CLDR plural rules to reach the same result as in the PHP side and is hosted at GitHub, written by Santhosh Thottingal of the Wikimedia Language Engineering team.

-- .