Posts Tagged ‘documentation’

MediaWiki i18n explained: {{PLURAL}}

Sunday, January 5th, 2014

This post explains how MediaWiki handles plural rules to developers who need to work with it. In other words, how a string like “This wiki {{PLURAL:$1|0=does not have any pages|has one page|has $1 pages}}” becomes “This wiki has 425 pages”.

Rules

As mentioned before we have adopted a data-based approach. Our plural rules come from Unicode CLDR (Common Locale Data repository) in XML format and are stored in languages/data/plurals.xml. These rules are supplemented by local overrides in languages/data/plurals-mediawiki.xml for languages not supported by CLDR or where we are yet to unify our existing local rules to match CLDR rules.

As a short recap, translators handle plurals by writing all possible different forms explicitly. That means that there are different forms for singular, dual, plural, etc., depending on what grammatical numbers the language has. There might be more forms because of other grammatical reasons, for example in Russian the grammatical case of the noun varies depending on the number. The rules from CLDR put all numbers into different boxes, each box corresponding to one form provided by the translator.

Preprocessing

The plural rules are stored in localisation cache (not to be confused with message cache and many other caches in MediaWiki) with other language specific localisation data. The localisation cache can be stored in different places depending on configuration. The default is to use the SQL database, but they can also be in CDB files as they are at the Wikimedia Foundation and translatewiki.net.

The whole process starts 1) when the user runs php maintenance/rebuildLocalisationCache.php, or 2) during a web request, if the cache is stale and automatic cache rebuild is allowed (as by default).

The code proceeds as follows:

LocalisationCache::readSourceFilesAndRegisterDeps

  • LocalisationCache::getPluralRules fills pluralRules
    • LocalisationCache::loadPluralFiles loads both xml files, merges them and stores the result in in-process cache
  • LocalisationCache::getComplisedPluralRules fills compiledPluralRules
    • LocalisationCache::loadPluralFiles returns rules from in-process cache
    • CLDRPluralRuleEvaluator::compile compiles the standard notation into RPN notation
  • LocalisationCache::getPluralTypes fills pluralRuleTypes

So now for the given language we have three lists (see table 1). The pluralRules are used in frontend (JavaScript) and the compiledPluralRules are used in the backend (PHP) with a custom evaluator. Tim Starling wrote the custom evaluator for performance reasons. The pluralRuleTypes stores the map between numerical indexes and CLDR keywords, which are not used in MediaWiki plural syntax. Please note that Russian has four plural forms: the fourth form, called other, is used when none of the other rules match and is not stored anywhere.

Table 1: Stored plural data for Russian
pluralRuleTypes pluralRules compiledPluralRules
“one” “n mod 10 is 1 and n mod 100 is not 11” “n 10 mod 1 is n 100 mod 11 is-not and”
“few” “n mod 10 in 2..4 and n mod 100 not in 12..14” “n 10 mod 2 4 .. in n 100 mod 12 14 .. not-in and”
“many” “n mod 10 is 0 or n mod 10 in 5..9 or n mod 100 in 11..14” “n 10 mod 0 is n 10 mod 5 9 .. in or n 100 mod 11 14 .. in or”

The cache also stores the magic word PLURAL, defined in languages/messages/MessageEn.php and translated to other languages, so in Finnish language wikis they can use {{MONIKKO:$1|$1 talo|$1 taloa}} if they so want. For compatibility reasons, in all interface translations these magic words are used in English.

Invocation on backend

There are roughly three ways to trigger plural parsing:

  1. using the plural syntax in a wiki page,
  2. calling the plural parser with Message object with output format text,
  3. using the plural syntax in a message with output format parse, which calls full wikitext parser as in 1.

In all cases, we will get into Parser::replaceVariables, which expands all magic words and templates (anything enclosed in double braces; sometimes also called {{ constructs). It will load the possible translated magic words and see if the {{thing}} in the wikitext or message matches a known magic word. If not, the {{thing}} is considered a template call. If the plural magic word matches, the parser will call CoreParserFunctions::plural which will take the arguments, make them into an array, call the correct language object with Language::convertPlural( number, forms ): see table 2 for function call trace.

In the Language class we first handle explicit plural forms explained in a previous post on explicit zero and one form. If any explicit plural form doesn’t match, they are removed and we will continue on with the other forms, calling Language::getPluralRuleIndexNumber( number ), which first loads the compiled plural rules into the in-process cache, then calls CLDRPluralRuleEvaluator::evaluateCompiled which returns the box the number belongs to. Finally we take the matching form given by the translator, or the last form provided. Then the return value is substituted in place of the plural magic word call.

Table 2: Function call list for plural magic word
Message::parse Message::text
  • Message::toString
  • Message::parseText
  • MessageCache::parse
  • Parser::parse
  • Parser::internalParse
  • Message::toString
  • Message::transformText
  • MessageCache::transform
  • Parser::transformMsg
  • Parser::preprocess
  • [The above lists converge here]
  • Parser::replaceVariables
  • PPFrame_DOM::expand
  • Parser::braceSubstitution
  • Parser::callParserFunction
  • call_user_func_array
  • CoreParserFunctions::plural
  • Language::convertPlural
  • [Plural rule evaluation]

Invocation on frontend

The resource loader module mediawiki.language.data (implemented in class ResourceLoaderLanguageDataModule) is responsible for loading the plural rules from localisation cache and delivering them together with other language data to JavaScript.

The resource loader module mediawiki.jqueryMsg provides yet another limited wikitext parser which can handle plural, links and few other things. The module mediawiki (global mediaWiki, usually aliased to mw) provides the messaging interface with functions like mw.msg() or mw.message().text(). Those will not handle plural without the aforementioned mediawiki.jqueryMsg module. Translated magic words are not supported at the frontend.

If a plural magic word is found, then it will call the frontend convertPlural method. These are provided in few hops by the module mediawiki.language which depends on mediawiki.language.data and mediawiki.cldr. The latter depends on mediawiki.libs.pluralruleparser, which evaluates the (non-compiled) CLDR plural rules to reach the same result as in the PHP side and is hosted at GitHub, written by Santhosh Thottingal of the Wikimedia Language Engineering team.

FOSDEM talk reflections 2/3: docs, code and community health, stability

Monday, February 11th, 2013

This is the second post about FOSDEM 2013; see 1/3: I18n in the WEB, Mozilla i18n and L20n for the first. Links to the abstracts in the headers.

Open Sourcing Documentation

Don’t keep documentation to yourself, release it with an open license. Others might see the forest from the trees when you are too close to the problem. Translators can translate the documentation, but this needs proper tools, something which was not mentioned in the presentation.

Also mentioned in the presentation, webplatform.org (greetings sent to Ryan Lane who helped to build it) and the problems of Mozilla having to support both webplatform and its own MDN, where the latter has wider scope and more restrictive license than the former.

Coping with the proliferation of tools within your community

FOSDEM entrance

Entering the MediaWiki community and contributing to MediaWiki is hard for many reasons

I got the impression that XWiki is the everything and a kitchen sink of wikis. It has several nice points related to having everything in one place (one wiki):

  • Only one place to have account and one place to sign in.
  • Can search everything like bugs, commits, IRC logs and documentation at once.

In my opinion it’s a nice starting point for projects, but in the end it’s also about the quality of individual tools that matters when projects grow. I don’t see how MediaWiki could replace Gerrit or Bugzilla with something provided by XWiki.

XWiki is among one of the candidates to take over MediaWiki if MediaWiki is not able to revitalize its community by improving the extension and gadget development ecosystem. XWiki being written in Java can be a deterrent for some over PHP (but the opposite is certainly true, too).

How we made the Jenkins community

It’s not enough to lower the barriers, the barriers must be removed wherever possible. While Wikipedia is/was quite open and easy to use (not going too deep into that), MediaWiki development had and still has many barriers. It was not long ago that getting commit access took ages and you had to present your CV. Nowadays commit access is easier to get because we use Gerrit to review code before it is merged. But Gerrit is quite a complex beast and we still don’t have GitHub integration to accept drive-by patches. Lack of documentation and difficulty of getting patches reviewed is also a problem, not to mention the lack of a shared vision where MediaWiki should go.

Another problem that I think is not being realized is that MediaWiki code, while not as bad as it was, is in my humble opinion not improving fast enough. MediaWiki core is a huge monolithic piece of code, entangled in many places to the extent that it is impossible to extend without refactoring the affected code first. This greatly affects the innovation and creation of new extensions and thus is a problem for the MediaWiki ecosystem. The core code needs to become cleaner and more modular. The modularity and quality of Jenkins APIs seemed to be a major reason for the applaudable growth of its community.

Wish: Can we have an integrated search that provides results from mediawiki.org, Bugzilla, IRC logs, mailing lists and other relevant places and not just a service on labs known to the lucky few? There is also lots of stuff in etherpads, Google documents and even on blogs that will probably never be searchable.

Improving Stability of Mozilla Products

Mozilla booth at FOSDEM

Something we lack compared to Mozilla is the crash metrics (note the diversity at their booth)

This talk made me wish for information on what actual problems MediaWiki users and developers are facing. Unfortunately, due to the nature of PHP, fatal errors and warnings can be caused by so many things, including syntax errors users made in their LocalSettings.php, that the data would need lots of filtering. There are also privacy concerns, but it should be possible to do something by collecting exceptions and JavaScript errors and aggregating them for some selected few to see. This would help to prioritize bugs, as Bugzilla and IRC channels are bad indicators of how many people are actually encountering a particular issue.

I’d like to be optimistic, but the best thing we have so far is translatewiki.net, which collects PHP errors, warnings and exceptions as well as JavaScript errors. However, aside from the PHP warnings and notices that are announced on IRC in #mediawiki-i18n, that information is only available to me and few others, and after all translatewiki.net is just a minor piece of the total system. One of my personal wishes is to have more priority given to the issues affecting translatewiki.net, as experience learns us that issues discovered on translatewiki.net, will most often also surface as issues in Wikimedia wikis. Mozilla can state things like: This bug affects 10% of all active users or hundred million users every day. Can we please have that, too?

It would be so awesome if the Wikimedia Foundation could release similar kinds of information to the wider public.

Shorts

IonMonkey: Yet Another JIT Compiler for JavaScript?

Why is yet another JIT compiler needed? Because the new one is better! Sometimes it makes sense. IonMonkey is based on well-understood techniques like static single assigment.

WebRTC: Real time web communication

The open source technology stack to kill Skype and Google Hangout is in process, but will still take a while to reach a browser near you. Check out http://reveal.rs.af.cm/ with a recent version of Google Chrome for a preview of what can also be accomplished with WebRTC.

PDF.js – Firefox’s HTML5 PDF Viewer

This is something that the Wikimedia Foundation could perhaps use too, though limited browser support is an issue.

The presentation had some nice tidbits on how the PDF standard includes everything and a kitchen sink like a 3D model viewer. It also explained what is easy to port to HTML5 (many things can be done with Canvas specification) and what is not. HTML5 specifications are going to see additions due to this work.

Changesets evolution with Mercurial

A cool idea: Tracking the changes to the commit history itself, so you can alter the history without deleting any parts of it. Questionable though if this will see wide spread use, and it’s only (coming to) in Mercurial now, not in Git.

-- .