Tag Archives: Search

FOSDEM talk reflections 2/3: docs, code and community health, stability

This is the second post about FOSDEM 2013; see 1/3: I18n in the WEB, Mozilla i18n and L20n for the first. Links to the abstracts in the headers.

Open Sourcing Documentation

Don’t keep documentation to yourself, release it with an open license. Others might see the forest from the trees when you are too close to the problem. Translators can translate the documentation, but this needs proper tools, something which was not mentioned in the presentation.

Also mentioned in the presentation, webplatform.org (greetings sent to Ryan Lane who helped to build it) and the problems of Mozilla having to support both webplatform and its own MDN, where the latter has wider scope and more restrictive license than the former.

Coping with the proliferation of tools within your community

FOSDEM entrance

Entering the MediaWiki community and contributing to MediaWiki is hard for many reasons

I got the impression that XWiki is the everything and a kitchen sink of wikis. It has several nice points related to having everything in one place (one wiki):

  • Only one place to have account and one place to sign in.
  • Can search everything like bugs, commits, IRC logs and documentation at once.

In my opinion it’s a nice starting point for projects, but in the end it’s also about the quality of individual tools that matters when projects grow. I don’t see how MediaWiki could replace Gerrit or Bugzilla with something provided by XWiki.

XWiki is among one of the candidates to take over MediaWiki if MediaWiki is not able to revitalize its community by improving the extension and gadget development ecosystem. XWiki being written in Java can be a deterrent for some over PHP (but the opposite is certainly true, too).

How we made the Jenkins community

It’s not enough to lower the barriers, the barriers must be removed wherever possible. While Wikipedia is/was quite open and easy to use (not going too deep into that), MediaWiki development had and still has many barriers. It was not long ago that getting commit access took ages and you had to present your CV. Nowadays commit access is easier to get because we use Gerrit to review code before it is merged. But Gerrit is quite a complex beast and we still don’t have GitHub integration to accept drive-by patches. Lack of documentation and difficulty of getting patches reviewed is also a problem, not to mention the lack of a shared vision where MediaWiki should go.

Another problem that I think is not being realized is that MediaWiki code, while not as bad as it was, is in my humble opinion not improving fast enough. MediaWiki core is a huge monolithic piece of code, entangled in many places to the extent that it is impossible to extend without refactoring the affected code first. This greatly affects the innovation and creation of new extensions and thus is a problem for the MediaWiki ecosystem. The core code needs to become cleaner and more modular. The modularity and quality of Jenkins APIs seemed to be a major reason for the applaudable growth of its community.

Wish: Can we have an integrated search that provides results from mediawiki.org, Bugzilla, IRC logs, mailing lists and other relevant places and not just a service on labs known to the lucky few? There is also lots of stuff in etherpads, Google documents and even on blogs that will probably never be searchable.

Improving Stability of Mozilla Products

Mozilla booth at FOSDEM

Something we lack compared to Mozilla is the crash metrics (note the diversity at their booth)

This talk made me wish for information on what actual problems MediaWiki users and developers are facing. Unfortunately, due to the nature of PHP, fatal errors and warnings can be caused by so many things, including syntax errors users made in their LocalSettings.php, that the data would need lots of filtering. There are also privacy concerns, but it should be possible to do something by collecting exceptions and JavaScript errors and aggregating them for some selected few to see. This would help to prioritize bugs, as Bugzilla and IRC channels are bad indicators of how many people are actually encountering a particular issue.

I’d like to be optimistic, but the best thing we have so far is translatewiki.net, which collects PHP errors, warnings and exceptions as well as JavaScript errors. However, aside from the PHP warnings and notices that are announced on IRC in #mediawiki-i18n, that information is only available to me and few others, and after all translatewiki.net is just a minor piece of the total system. One of my personal wishes is to have more priority given to the issues affecting translatewiki.net, as experience learns us that issues discovered on translatewiki.net, will most often also surface as issues in Wikimedia wikis. Mozilla can state things like: This bug affects 10% of all active users or hundred million users every day. Can we please have that, too?

It would be so awesome if the Wikimedia Foundation could release similar kinds of information to the wider public.


IonMonkey: Yet Another JIT Compiler for JavaScript?

Why is yet another JIT compiler needed? Because the new one is better! Sometimes it makes sense. IonMonkey is based on well-understood techniques like static single assigment.

WebRTC: Real time web communication

The open source technology stack to kill Skype and Google Hangout is in process, but will still take a while to reach a browser near you. Check out http://reveal.rs.af.cm/ with a recent version of Google Chrome for a preview of what can also be accomplished with WebRTC.

PDF.js – Firefox’s HTML5 PDF Viewer

This is something that the Wikimedia Foundation could perhaps use too, though limited browser support is an issue.

The presentation had some nice tidbits on how the PDF standard includes everything and a kitchen sink like a 3D model viewer. It also explained what is easy to port to HTML5 (many things can be done with Canvas specification) and what is not. HTML5 specifications are going to see additions due to this work.

Changesets evolution with Mercurial

A cool idea: Tracking the changes to the commit history itself, so you can alter the history without deleting any parts of it. Questionable though if this will see wide spread use, and it’s only (coming to) in Mercurial now, not in Git.

Efficient translation: Translation memory enabled on all Wikimedia wikis

I am pleased to announce that a long development project has been released and taken into production. We now have translation memory services enabled on Wikimedia projects (since August 28, in our last sprint).

The translation editor on Wikimania 2013 wiki shows a suggestion from Wikimania 2012 wiki

Users translating for Wikimania 2013 are provided with suggestions from 2012 (right arrow); a click is enough to copy it to the text area (down arrow). See also on Meta, in English interface.

Translation memory is a feature which provides likely translations for a text based on previous translations of similar texts: translators use them to speed up their work and to increase consistency (more in Wikipedia).

If you have translated at translatewiki.net or usebase.kde.org, you may have already noticed it. The translation memory on Wikimedia wikis has been filled with existing translations made with the Translate extension in WMF projects including Meta, mediawiki.org and Wikimania wikis.

Translators from all Wikimedia projects using the Translate extension can now work more efficiently, sharing their work and experience across the boundaries of wikis. Translators on Wikimania 2013 wiki can now find translations already provided for the previous year (see screenshot) and be quicker without sacrificing quality and consistency. Translators of technical documentation on mediawiki.org can benefit from the translation of Wikimedia terminology on Meta-Wiki and vice versa.

Technical challenges

A translation memory service has been in use at translatewiki.net for years, and the process of getting it enabled on Wikimedia was started about a year ago.

Naturally WMF operations is a very different thing from the small shared server translatewiki.net runs on. Yet, there were many unexpected turns that caused delay. The phases here are named retroactively.


Originally we used the tmserver component from the translate toolkit. It had its own problems: it was hard to set up, it was an external dependency and the SQLite database engine it used was problematic for updates – it failed if there were multiple processes accessing at the same time. Sometimes the included standalone webserver got stuck and the other option, WSGI, didn’t play nicely with our lighttpd web server.

I did lots of research with Siebrand trying to find other open source translation memories, but failed to find anything that had any active or recent development.


The next step was the standalone version. To avoid external dependencies, to make it usable in the WMF infrastructure, and not to require separate services, I started porting the tmserver algorithm from Python to PHP. At the same time I was able to take advantage of MediaWiki’s database abstraction code, which in theory should make it work on SQLite, MySQL and PostgreSQL. At the moment, however, only MySQL is tested and in use at translatewiki.net.

Performance of this new system was mostly the same, though it’s a constant fight for not letting the Levenshtein algorithm, used for ranking in the core, get exponentially slow. The major new feature was the support for shared databases, so that multiple wikis can use the translations made in other wikis for suggestions. A lot of time was spent on this, and also on making the initial bootstrap efficient with use of multiple threads.


When we thought everything was ready for deployment on Wikimedia wikis, we waited for feedback from ops and finally we got a simple, yet unwanted reply: “Full-text search with MySQL cannot be used in the WMF cluster (because it depends on the problematic MyISAM storage engine)”. Yay. Back to the drawing board.
Since everything at Wikimedia is using a heavily modified Apache Lucene for full text search, the same was obviously suggested as a solution. So started the development of phase3; if the past predicts anything, this will have been the final rewrite.

I decided not to touch Wikimedia’s version of Lucene, as I already had lots of experience on it due to playing with it for my Master’s thesis (English summary on my blog), and decided to use standard Lucene with a Solr frontend. Solr simplified many things and the development was swift using the PHP Solarium library.

In fact, the most difficult “feature” to develop was the Puppet configuration for Jetty and Solr, and testing it on WMF Labs. So I learned to write Puppet configuration files from scratch and did it mostly myself. Oren Bochman helped a lot with the Labs testing phase. The last hurdle was backporting recent packages of Solr and its dependency Jetty for the Ubuntu that Wikimedia was using on Labs and in production. Luckily I was fortunate enough to get quick help from ops, so I didn’t have to also learn how to make Ubuntu packages.

So somewhat ironically, we went from separate services to standalone and again to a separate service. The first phase is long forgotten, but the standalone and Solr versions complement each other. The former is enabled by default for anyone using the Translate extension, the latter provides superior scalability and hopefully in the future even better suggestions.

Fact is that the Levenshtein based ranking is not the state of the art for translation memories[1] and does not compare to the state of art i18n we are doing with MediaWiki and translatewiki.net.

On to the next adventure!

[1] Paper abstract (full text behind paywall; DOI:10.1007/3-540-39965-8_14).

-- .