Archive for January, 2014

First day at work

Wednesday, January 8th, 2014

Officially I started January 1st, but apart from getting an account today was the first real thing at the university. Still feels great – the “oh my what did I sign up to” feeling has still time to come. ;)

After having the WMF daily standup, I have a usual breakfast and head to city center, where our research group of four had a meeting. To my surprise, the eduroam network worked immediately. I had configured it at home earlier based on a guide on the site of some university of Switzerland, if I remember correctly: my university didn’t provide good help for how to set it up with Fedora and KDE.

Institute of Behavioural Sciences, University of Helsinki

The building on the left is part of Institute of Behavioural Sciences. It is just next to the building (not visible) where I started my university studies in 2005. (Photo CC BY-NC-ND by Irmeli Aro.)

On my side, preparations for the IWSDS conference are now the highest priority. I have until Monday to prepare my first ever poster presentation. I found PowerPoint and InDesign templates from the university’s website (ugh proprietary tools). Then there are few days to get it printed before I fly on Thursday. After the travel I will make a website for the project to allow it to get some visibility and find out about the next steps as well as how to proceed with studies.

After this topic, I got to hear about other part of the research, collection of data in Sami languages. I connected them with Wikimedia Suomi who has expressed interest to work with Sami people.

After the meeting, we went hunting for so-called WBS codes which are needed in various places to target the expenses, for example for poster printing and travel plans. (In case someone knows where the abbreviation WBS comes from, there are at least two people in the world who are interested to know.) The people I met there were all very friendly and helpful.

On my way home I met an old friend from Päivölä&university (Mui Jouni!) in the metro. There was also a surprise ticket inspection – 25% inspection rate for my trips this year based on 4 observations. I guess I need more observations before this is statistically significant ;)

One task left for me when I got home was to do the mandatory travel plan. This needs to be done through university’s travel management software, which is not directly accessible. After trying without success to access it first through their web based VPN proxy, second with openvpn via NetworkManager via “some random KDE GUI for that” on my laptop and, third, even with a proprietary VPN application on my Android phone I gave up for today – it’s likely that the VPN connection itself is not the problem and the issue is somewhere else.

It’s still not known from where I will get a room (I’m employed in a different department from where I’m doing my PhD). Though I will likely work from home often as I am used to.

MediaWiki i18n explained: {{PLURAL}}

Sunday, January 5th, 2014

This post explains how MediaWiki handles plural rules to developers who need to work with it. In other words, how a string like “This wiki {{PLURAL:$1|0=does not have any pages|has one page|has $1 pages}}” becomes “This wiki has 425 pages”.

Rules

As mentioned before we have adopted a data-based approach. Our plural rules come from Unicode CLDR (Common Locale Data repository) in XML format and are stored in languages/data/plurals.xml. These rules are supplemented by local overrides in languages/data/plurals-mediawiki.xml for languages not supported by CLDR or where we are yet to unify our existing local rules to match CLDR rules.

As a short recap, translators handle plurals by writing all possible different forms explicitly. That means that there are different forms for singular, dual, plural, etc., depending on what grammatical numbers the language has. There might be more forms because of other grammatical reasons, for example in Russian the grammatical case of the noun varies depending on the number. The rules from CLDR put all numbers into different boxes, each box corresponding to one form provided by the translator.

Preprocessing

The plural rules are stored in localisation cache (not to be confused with message cache and many other caches in MediaWiki) with other language specific localisation data. The localisation cache can be stored in different places depending on configuration. The default is to use the SQL database, but they can also be in CDB files as they are at the Wikimedia Foundation and translatewiki.net.

The whole process starts 1) when the user runs php maintenance/rebuildLocalisationCache.php, or 2) during a web request, if the cache is stale and automatic cache rebuild is allowed (as by default).

The code proceeds as follows:

LocalisationCache::readSourceFilesAndRegisterDeps

  • LocalisationCache::getPluralRules fills pluralRules
    • LocalisationCache::loadPluralFiles loads both xml files, merges them and stores the result in in-process cache
  • LocalisationCache::getComplisedPluralRules fills compiledPluralRules
    • LocalisationCache::loadPluralFiles returns rules from in-process cache
    • CLDRPluralRuleEvaluator::compile compiles the standard notation into RPN notation
  • LocalisationCache::getPluralTypes fills pluralRuleTypes

So now for the given language we have three lists (see table 1). The pluralRules are used in frontend (JavaScript) and the compiledPluralRules are used in the backend (PHP) with a custom evaluator. Tim Starling wrote the custom evaluator for performance reasons. The pluralRuleTypes stores the map between numerical indexes and CLDR keywords, which are not used in MediaWiki plural syntax. Please note that Russian has four plural forms: the fourth form, called other, is used when none of the other rules match and is not stored anywhere.

Table 1: Stored plural data for Russian
pluralRuleTypes pluralRules compiledPluralRules
“one” “n mod 10 is 1 and n mod 100 is not 11” “n 10 mod 1 is n 100 mod 11 is-not and”
“few” “n mod 10 in 2..4 and n mod 100 not in 12..14” “n 10 mod 2 4 .. in n 100 mod 12 14 .. not-in and”
“many” “n mod 10 is 0 or n mod 10 in 5..9 or n mod 100 in 11..14” “n 10 mod 0 is n 10 mod 5 9 .. in or n 100 mod 11 14 .. in or”

The cache also stores the magic word PLURAL, defined in languages/messages/MessageEn.php and translated to other languages, so in Finnish language wikis they can use {{MONIKKO:$1|$1 talo|$1 taloa}} if they so want. For compatibility reasons, in all interface translations these magic words are used in English.

Invocation on backend

There are roughly three ways to trigger plural parsing:

  1. using the plural syntax in a wiki page,
  2. calling the plural parser with Message object with output format text,
  3. using the plural syntax in a message with output format parse, which calls full wikitext parser as in 1.

In all cases, we will get into Parser::replaceVariables, which expands all magic words and templates (anything enclosed in double braces; sometimes also called {{ constructs). It will load the possible translated magic words and see if the {{thing}} in the wikitext or message matches a known magic word. If not, the {{thing}} is considered a template call. If the plural magic word matches, the parser will call CoreParserFunctions::plural which will take the arguments, make them into an array, call the correct language object with Language::convertPlural( number, forms ): see table 2 for function call trace.

In the Language class we first handle explicit plural forms explained in a previous post on explicit zero and one form. If any explicit plural form doesn’t match, they are removed and we will continue on with the other forms, calling Language::getPluralRuleIndexNumber( number ), which first loads the compiled plural rules into the in-process cache, then calls CLDRPluralRuleEvaluator::evaluateCompiled which returns the box the number belongs to. Finally we take the matching form given by the translator, or the last form provided. Then the return value is substituted in place of the plural magic word call.

Table 2: Function call list for plural magic word
Message::parse Message::text
  • Message::toString
  • Message::parseText
  • MessageCache::parse
  • Parser::parse
  • Parser::internalParse
  • Message::toString
  • Message::transformText
  • MessageCache::transform
  • Parser::transformMsg
  • Parser::preprocess
  • [The above lists converge here]
  • Parser::replaceVariables
  • PPFrame_DOM::expand
  • Parser::braceSubstitution
  • Parser::callParserFunction
  • call_user_func_array
  • CoreParserFunctions::plural
  • Language::convertPlural
  • [Plural rule evaluation]

Invocation on frontend

The resource loader module mediawiki.language.data (implemented in class ResourceLoaderLanguageDataModule) is responsible for loading the plural rules from localisation cache and delivering them together with other language data to JavaScript.

The resource loader module mediawiki.jqueryMsg provides yet another limited wikitext parser which can handle plural, links and few other things. The module mediawiki (global mediaWiki, usually aliased to mw) provides the messaging interface with functions like mw.msg() or mw.message().text(). Those will not handle plural without the aforementioned mediawiki.jqueryMsg module. Translated magic words are not supported at the frontend.

If a plural magic word is found, then it will call the frontend convertPlural method. These are provided in few hops by the module mediawiki.language which depends on mediawiki.language.data and mediawiki.cldr. The latter depends on mediawiki.libs.pluralruleparser, which evaluates the (non-compiled) CLDR plural rules to reach the same result as in the PHP side and is hosted at GitHub, written by Santhosh Thottingal of the Wikimedia Language Engineering team.

-- .