Monthly Archives: February 2013

FOSDEM talk reflections 3/3: HipHop, communities, public procurement

Nikerabbit arrives at the MediaWiki meetup at FOSDEM

Meetup of MediaWiki community. Or Wikimedia tech? How to call the Wikimedia software development ecosystem? (Photo by henna, copyright status unknown.)

This is the third post about FOSDEM 2013; see 1/3: I18n in the WEB, Mozilla i18n and L20n for the first and 2/3: docs, code and community health, stability for the second. Links to the abstracts in the headers.

Scaling PHP with HipHop

HipHop is still alive, and faster than ever. It has evolved from PHP to C translator to a JIT bytecode interpreter system, just like PHP itself is, without JIT of course. The speedups they are seeing are impressive (it was deployed on Facebook about a month ago). Given that they have removed the compile everything before deploy step, it is now much more feasible to use.

I’m considering to give HipHop a try on translatewiki.net later this year, probably after we have upgraded to at least Ubuntu 12.04, where Facebook provides packages for it. It is still a pain in the ass to set up manually, as it was few years ago. Wikimedia Foundation (WMF) has dropped its evaluation, but perhaps they will reconsider after our experiences, and HipHop, or hhvm as it is called now, has indeed changed a lot since then.

It was highlighted that the supported language features and libraries of hhvm and PHP vary to an extent. hhvm provides some nice features like strict type hinting, but it is unlikely I can use those anytime soon, since there is no way to take advantage of these on hhvm without breaking support for normal PHP, which is something that really cannot be done in the MediaWiki ecosystem.

Community/BOF meetup

Almost 20 people were around, a few outside of WMF. Discussions circled around events like the Amsterdam Hackathon and MediaWiki groups. The most interesting part (to me) is how to call the Wikimedia software development ecosystem, so it can be marketed properly. Suggestions ranged from extending the meaning of MediaWiki to cover everything including mobile, gadgets and so on; using Wikipedia as it is the brand most well known; or creating a new Wikitech brand.

There are pros and cons to each of the above, but one thing is true: There is no name that can currently be used to refer to everything technical done around MediaWiki and Wikimedia that would also be understood by potential participants. Also, MediaWiki development is not perceived to be cool anymore, because it’s PHP. But it isn’t. MediaWiki development is also Redis, Varnish, puppet, git, Solr, HipHop, semantic, node.js, mobile, OpenStack, and more. Quim Gil will continue work on coming up with a brand. Curious visitors can also compare this to what KDE did recently when they expanded the meaning of KDE to be not only the desktop, but also the community and everything they do. The change process wasn’t painless for them, and wouldn’t be for us, but at the same time (IMHO) the change has been quite successful and beneficial to KDE.

Qt Project Update

Qt booth at FOSDEM

Qt: a maintainer for each subsystem helps getting your patch reviewed, unlike MediaWiki

Qt is doing well. Qt5 is evolution instead of revolution (what Qt4 was to Qt3). The contributors outside of Nokia (and nowadays Digia) have risen to about one third of all commits. They are using Gerrit like MediaWiki. But unlike MediaWiki they have explicit hierarchy, with maintainers who are responsible for keeping each subsystem in shape. It also means that there is always at least one person you can talk to, to get your patch reviewed, unlike in the MediaWiki community.

Their platform support is also nice: Linux, Windows and OSX are fully supported, while iOS, Android and BlackBerry OS are also working more or less.

Fixing public procurement

Forgive me if I use incorrect terms. In a nutshell there is a law in Finland (coming from the EU) that disallows governmental organizations to request software systems by referring to an exact producer. So for example a hypothetical example “We want Microsoft Office on all our work stations in X department” is illegal, while “We want an office tools suite that includes documents, presentations, …” is legal. Free software people in Finland did an analysis of how many times this law has been violated… quite many… and have been sending letters that ask them to read the rules and fix their procurement.

The talk continued with the observation that there is no entity to enforce this rule, and that it is difficult to get the companies put into disadvantage to sue. One side argued that fixing this particular issue harms doing wide scale education of people on this and other issues. Or when is it useful to sue instead of trying to educate? When are the lost opportunities bigger a harm than bad publicity and money spent on suing? Apparently Microsoft has sued successfully in Finland to gain lots of money without a big PR hit. Open source solutions are usually discarded because the exit costs of the previous system are placed on the new solution instead of the old vendor lock-in solution.

All in all, the target of this kind of work is to further open source use in governments by allowing free competition; they want to do it EU-wide.

The Keeper of Secrets: The Dance of Community Leadership

FOSDEM party crowd

FOSDEM preparty was definitely not a quiet beerless one

This is the first time I’ve seen Leslie Hawthorn speaking. In her talk, which was full of beer jokes (a bit too many to my taste), I caught some points:

  • Don’t be a jerk.
  • Stop gossiping and talk directly to people you have problems with.
  • Don’t ignore difficult people, be brave enough to let them know your honest opinion.
  • Don’t be a jerk while talking about difficult things with people, do it politely and cooperatively.
  • Face to face meetings are essential to community building (my addition: also include possibilities do it in beerless, quiet places).

FOSDEM talk reflections 2/3: docs, code and community health, stability

This is the second post about FOSDEM 2013; see 1/3: I18n in the WEB, Mozilla i18n and L20n for the first. Links to the abstracts in the headers.

Open Sourcing Documentation

Don’t keep documentation to yourself, release it with an open license. Others might see the forest from the trees when you are too close to the problem. Translators can translate the documentation, but this needs proper tools, something which was not mentioned in the presentation.

Also mentioned in the presentation, webplatform.org (greetings sent to Ryan Lane who helped to build it) and the problems of Mozilla having to support both webplatform and its own MDN, where the latter has wider scope and more restrictive license than the former.

Coping with the proliferation of tools within your community

FOSDEM entrance

Entering the MediaWiki community and contributing to MediaWiki is hard for many reasons

I got the impression that XWiki is the everything and a kitchen sink of wikis. It has several nice points related to having everything in one place (one wiki):

  • Only one place to have account and one place to sign in.
  • Can search everything like bugs, commits, IRC logs and documentation at once.

In my opinion it’s a nice starting point for projects, but in the end it’s also about the quality of individual tools that matters when projects grow. I don’t see how MediaWiki could replace Gerrit or Bugzilla with something provided by XWiki.

XWiki is among one of the candidates to take over MediaWiki if MediaWiki is not able to revitalize its community by improving the extension and gadget development ecosystem. XWiki being written in Java can be a deterrent for some over PHP (but the opposite is certainly true, too).

How we made the Jenkins community

It’s not enough to lower the barriers, the barriers must be removed wherever possible. While Wikipedia is/was quite open and easy to use (not going too deep into that), MediaWiki development had and still has many barriers. It was not long ago that getting commit access took ages and you had to present your CV. Nowadays commit access is easier to get because we use Gerrit to review code before it is merged. But Gerrit is quite a complex beast and we still don’t have GitHub integration to accept drive-by patches. Lack of documentation and difficulty of getting patches reviewed is also a problem, not to mention the lack of a shared vision where MediaWiki should go.

Another problem that I think is not being realized is that MediaWiki code, while not as bad as it was, is in my humble opinion not improving fast enough. MediaWiki core is a huge monolithic piece of code, entangled in many places to the extent that it is impossible to extend without refactoring the affected code first. This greatly affects the innovation and creation of new extensions and thus is a problem for the MediaWiki ecosystem. The core code needs to become cleaner and more modular. The modularity and quality of Jenkins APIs seemed to be a major reason for the applaudable growth of its community.

Wish: Can we have an integrated search that provides results from mediawiki.org, Bugzilla, IRC logs, mailing lists and other relevant places and not just a service on labs known to the lucky few? There is also lots of stuff in etherpads, Google documents and even on blogs that will probably never be searchable.

Improving Stability of Mozilla Products

Mozilla booth at FOSDEM

Something we lack compared to Mozilla is the crash metrics (note the diversity at their booth)

This talk made me wish for information on what actual problems MediaWiki users and developers are facing. Unfortunately, due to the nature of PHP, fatal errors and warnings can be caused by so many things, including syntax errors users made in their LocalSettings.php, that the data would need lots of filtering. There are also privacy concerns, but it should be possible to do something by collecting exceptions and JavaScript errors and aggregating them for some selected few to see. This would help to prioritize bugs, as Bugzilla and IRC channels are bad indicators of how many people are actually encountering a particular issue.

I’d like to be optimistic, but the best thing we have so far is translatewiki.net, which collects PHP errors, warnings and exceptions as well as JavaScript errors. However, aside from the PHP warnings and notices that are announced on IRC in #mediawiki-i18n, that information is only available to me and few others, and after all translatewiki.net is just a minor piece of the total system. One of my personal wishes is to have more priority given to the issues affecting translatewiki.net, as experience learns us that issues discovered on translatewiki.net, will most often also surface as issues in Wikimedia wikis. Mozilla can state things like: This bug affects 10% of all active users or hundred million users every day. Can we please have that, too?

It would be so awesome if the Wikimedia Foundation could release similar kinds of information to the wider public.

Shorts

IonMonkey: Yet Another JIT Compiler for JavaScript?

Why is yet another JIT compiler needed? Because the new one is better! Sometimes it makes sense. IonMonkey is based on well-understood techniques like static single assigment.

WebRTC: Real time web communication

The open source technology stack to kill Skype and Google Hangout is in process, but will still take a while to reach a browser near you. Check out http://reveal.rs.af.cm/ with a recent version of Google Chrome for a preview of what can also be accomplished with WebRTC.

PDF.js – Firefox’s HTML5 PDF Viewer

This is something that the Wikimedia Foundation could perhaps use too, though limited browser support is an issue.

The presentation had some nice tidbits on how the PDF standard includes everything and a kitchen sink like a 3D model viewer. It also explained what is easy to port to HTML5 (many things can be done with Canvas specification) and what is not. HTML5 specifications are going to see additions due to this work.

Changesets evolution with Mercurial

A cool idea: Tracking the changes to the commit history itself, so you can alter the history without deleting any parts of it. Questionable though if this will see wide spread use, and it’s only (coming to) in Mercurial now, not in Git.

FOSDEM talk reflections 1/3: I18n in the WEB, Mozilla i18n and L20n

FOSDEM 2013 t-shirt

FOSDEM 2013 was attended by several Wikimedians.

Now that I’ve slept over the presentations I attended at FOSDEM, it’s a good time to think about what I heard and how it related to what I am doing. It is also a good time before I forget what I heard. I didn’t get to talk to that many people this year, mostly running from one talk to another.

There will be three parts to the series of these blog posts. I will start with i18n related topics and then other presentations roughly in the order I saw them (headers link to abstracts).  There will also be a follow-up post on the gettext format detailing the good and bad sides from today’s point of view. Stay tuned!

An Integrated Localization Environment

Mozilla keeps pushing new i18n stuff, though the general feeling of this and other related talks is that they either have not defined what is the issue they are fixing, or they have defined in a way that is completely different from what we are working on.

While we are trying to make it as easy as possible for translators to translate (in a technical sense, they already have enough of complexity due to language itself), the ILE proposed in this talk is essentially a IDE (integrated development environment) – a glorified text editor that programmers often use for programming. It has features like highlighting syntax via colors and automatic completion for translation file syntax.

But do translators really care about particular syntax of translation in a file, or are they in fact more happy if they do not need to care about files and version control systems at all, while at the same time having access to aids like translation memories and change tracking in an interface created by UX designers, as we have in translatewiki.net?

“It helps to see the messages above and below to understand the context”
You can see the related messages close to each other in almost any translation tool, even though showing related messages next to each other is not a replacement for proper documentation of context for each message.

“I don’t see how form based translation tools would cope with more complex localisation file formats like L20n”
I don’t think the solution to facilitating proper localisation is to turn the localisation itself into programming. The cases where more complex logic is needed are actually relatively few and I think it is worthwhile to keep the common case as simple as possible while supporting also the more complex cases in a standardized, data driven way, like using the CLDR.

L20n

Mozilla presenters at FOSDEM


Mozilla keeps pushing new i18n stuff: who is the user they are designing new tools for?

This talk was an update to the similar presentation on L20n last year. What I said on the previous post about turning localisation into programming applies here too.

It is nice that you specify grammatical gender for things, but this format does not really solve the problem that many variables actually come from user input, for which we cannot specify this information.

It is nice that you can make custom plural rules, but in almost all cases the standard set of plural rules that comes from standards like CLDR is enough.

It is nice that you can mix gender and plural and even many plural in one message using nested hashes (arrays in PHP), but it is not nice at all that you have to translate the message N*M*O times as the number of variables increase. I firmly believe that inline syntax like {{GENDER:$1|he|she}} eats {{PLURAL:$2|apple|apples}} is superior in this regard.

If we strip the plural, gender, time formatting etc. support from L20n, we actually just get a complex file format for storing things, something which we already have many variants of. The aforementioned features are usually provided by the i18n library (or definitely should be; unfortunately this is not always the case) so what they have done is actually moving the complexity of language from i18n libraries and software developers to translators. Aiming at “keep common case simple, but support complex cases where needed”, I don’t think this is as presented a good trade-off between simplicity and flexibility.

webL10n: client-side i18n / l10n library

This talk was about adapting some nice parts of L20n to .properties format. The result is somewhat more complex than plain .properties and not as flexible as L20n. Even having gender and plural in the same message is problematic in this format.

I’d like to highlight two ideas in webl10n. Sidenote: Why call it l10n when it is actually an i18n library for developers, similar to jquery.i18n.

The first idea is that you can have html like this:

<div l10n-data-id=retro>
<div>Please <a href="login/">log in</a></div>
</div>

And the translators see this:

retro = <div>Please <a>log in</a></div>

The translation, when displayed, is properly merged to the original html so that the classes and link targets are preserved. I don’t know what happens if the translation is outdated and the structure is changed, but I guess we just should not use outdated translations with this system. When escaping is handled properly, this is a very nice way to handle what we call lego messages, where the text of the link is in a separate message, because due to escaping we can’t have link and link text in the same message.

Another idea is that if you have HTML like this:

<input type="search" placeholder="Search messages" title="Message search box">

You can turn it to this.

<input type="search" data-l10n-id="searchbox">

And translators will see this (using .properties format here)

searchbox.placeholder=Search messages
searchbox.title=Message search box

This simplifies the html the developers need to write.

Finally, take a look also at Pau’s Design talks at FOSDEM 2013.

New language stuff for developers and users

MLEB 2013.01 has been released by Amir Aharoni. Lots of development has been happening in Translate due to the work on the new translation interfaces. If you are a developer, please also checkout the latest new and changed Web APIs and give us a shout in #mediawiki-i18n @freenode if you see something obviously wrong or missing. Also included in this release are bug fixes for Universal Language Selector, while the other included extensions didn’t see many changes.

Some months ago I wrote about Language tag validation in MediaWiki. A nice person named Siebrand Mazeland decided to improve the situation. As of now we have three new methods developed by the Wikimedia Language engineering team:

  • isSupportedLanguage
  • isWellFormedLanguageTag
  • isKnownLanguageTag

Unless these methods are backported to MediaWiki 1.19 and MediaWiki 1.20, it will take a while before these are being used in extensions, but after a while we should see faster and more readable code.

How I debug performance issues in MediaWiki

The earlier post does not describe how I usually do performance improvements. Usually it starts with debugging the less innocent-looking messages by our IRC bot rakkaus, which relays PHP error messages to the IRC channel. An example:

[01-Nov-2012 20:16:25 UTC] PHP Fatal error: Maximum execution time of 30 seconds exceeded in /www/translatewiki.net/w/extensions/Translate/ttmserver/TTMServer.php on line 100

After this I have to use the timestamp to match our webserver access log and try if I can reproduce the issue by loading the same url. PHP is very unhelpful in this regard: fatal errors don’t give the request url nor stacktrace. Sometimes it is a command line script like the job runner initiated via cron. For those cases I’ve implemented a simple logging of all maintenance script executions, but they are still annoying to debug. Once I am able to reproduce the issue on the production environment, I try to reproduce it also on my development environment. Oh boy, it is fun if that is not possible. If I can, however, I will usually start by looking the per-request profiling included in the page source, with output like this:

0.0558 8.5M Connected to database 0 at localhost 0.0562 8.5M Query sandwiki (14) (slave): SELECT /* SqlBagOStuff::getMulti Nike */ keyname,value,exptime FROM `bw_objectcache` WHERE keyname = ‘sw:messages:fi’

Here we see that it takes 56 milliseconds before MediaWiki even connects to the database, and the first thing it does is to load messages for the current user language. What usually follows is old style debugging where I add echo and var_dump statements until I have understood what is happening and what is inefficient. After that, the creative phase begins: finding a way to make it faster. Usually there is some sort of bug in the code that causes it to do unnecessary work. Rarely the bad performance is actually caused by slow algorithms. This kind of makes sense: the datasets we are processing are usually small, and when they are bigger, it is usually written in an efficient way in the first place.

I love performance tuning, but I have to be prudent to pick the right things to optimize, because it is also a great time sink, and as a busy person I am entitled only to few time sinks at a time.

-- .