Category Archives: MediaWiki

New UIs in MediaWiki Translate extension

I’m not a designer. Yet, I am a designer. During the many years of development of the Translate extension, I have done about all things related to the development of a software project: coding, translating, documenting, testing, system administration, marketing and user interface (UI) design among those. My UI design skills are limited to personal interest and one university course. But I try to pay attention to the UIs I create, and I listen for feedback. For once we got some good feedback about the issues in the current UIs and some suggestions about how to improve it. Based on this feedback I have done two significant changes to Special:Translate – the main translation interface of the Translate extension. The first significant change is to split the page into a few different tasks: translating, proofreading, statistics and export. I implemented these as tabs. Typically the user starts from language statistics and selects the project he wants to translate or proofread. This has the following benefits:

  • The tasks are clearly separated: users can see at a glance what are the things that can be done with the intreface.
  • Switching between tasks is seamless: previously there was no easy way to go back to language statistics from translating or proofreading.
  • There are less visible options at a time: the UI just looks nicer and takes less space.

The second change is an embedded translation editor. This feature is still in beta phase, and if we get enough positive feedback about it, we will switch over from the old popup based editor. You can test the editor by going to Special:Translate and double clicking the text you want to translate. This should prevent the hassle of moving and resizing dialogs. On the other hand it has some problems with the editor moving on the screen when you advance to next message, and it also stands out worse in the middle of the surrounding context. I’m investigating if and how we can mitigate these issues. I’ve already changed some stylings to make the editor stand out more and the whole table appear less heavy. As a bonus the embedded editor feels faster, because I’ve added some preloading. This means that when you save your translation and go to the next message, it will show up instantly because it has already been loaded.

Exploring the state(s) of open source search stack supporting Finnish

In July 2011, before starting my Wikimedia job, I completed my master’s thesis. Finally I spent some time to polish and submit it, which means that I will graduate!

In my thesis I investigated the feasibility of using a Finnish morphology implementation with the Lucene search system. With the same Lucene-search package that is used by the Wikimedia Foundation I built two search indexes: one with the existing Porter stemming algorithm and the other one with morphological analysis. The corpus I used was the current text dump of Finnish Wikipedia.

Finnish is among the group of languages with relatively vibrant and extensive morphology. For you English speakers, this means that instead of using prepositions, our words actually change depending on the context they are in. This makes exact pattern matching in searching mostly useless, because it only matches a fraction of the inflected forms. In Finnish nouns, verbs and adjectives can each have over a thousand of different forms when combining all the cases, plural markers, possessive suffixes and other clitics.

Simple stemmers have no or very limited vocabulary and they strip letters off the words according to rules. Morphological analyser instead comes with an extensive word list and can find all the possible interpretations of a given inflected word and only those. The morphology is based on the Omorfi interpretative finite state transducer, which returns the basic dictionary forms of the inflected words given as input. The transducer I used was brand new. Omorfi is the first open implementation of Finnish morphology.

From a technical perspective I came up with seven requirements for the new algorithm and its implementation (thanks to help from Roan and Ariel at Wikimedia) before it can be deployed in Wikimedia:

  1. it has to be open source,
  2. the code must be reviewed,
  3. the performance should be on par with the current system,
  4. it must be stable, no crashing or bugs requiring reindexing whole wikis,
  5. it must be easily installable with dependencies,
  6. searching must not be harder and the search interface must not change,
  7. it must return improved search results.

Now I will tell how well it met these requirements.

  1. Omorfi and the lookup utility I use to drive the transducer are both open source (GPL and Apache).
  2. Code review might be tricky due to lack of resources in Wikimedia. However we’re not at this stage yet.
  3. Indexing time is from five to ten times slower, but searches are about as fast and search index size grew only by 10 to 20 percent. Since indexing is done only once, it’s not such a big deal. The speed can be improved though, the lookup utility is not optimized.
  4. I got some out of memory errors and crashes while developing the system – the components I used were very new and I usually was their first user.
  5. The lookup utility is a simple Java library and the transducer is just a file – easy to install or bundle.
  6. The search syntax and interface has not changed at all.
  7. And the most important point: the quality of search results. The Wikimedia Foundation provided me with a corpus of actual search queries: I ran them on both indexes and I analysed the variations in the results they gave. I got very mixed results here, with many searches performing significantly better and many significantly worse. This is probably explained by a major implementation mistake I found in my own implementation. The alternatives proposed by the morphology sometimes got full weight when they matched the searched keyword. For example searching for tee (tea) returned many pages which contained the inflected word form teiden which can be genitive plural of tee or tie (road) or word teesi (thesis) which was interpreted as tee with possessive suffix (your tea). The problem could be solved by marking the interpreted words with a % prefix, so that they wouldn’t get as much weight as real exact matches in the document. I was not able to execute this fix during my thesis, however it would be the first thing to try among the ample possibilities of further research.

Even with the problems I encountered in my research, I believe this approach is viable and could – with further improvements – replace the current stemmer algorithm.
This was the first time that open content, open search engine and open Finnish morphology were put together.

The thesis (PDF) is written in Finnish, but I’m happy to tell you more about it. Just ask!

New translation memories near you soon

In the last sprint I developed a translation memory server in PHP almost from scratch. Well, it’s not really a server. It’s run inside MediaWiki during client requests. It closely follows the logic of tmserver from translatetoolkit, which uses Python and SQLite.

The logic of how it works is pretty simple: you store all definitions and translations in a database. Then you can query suggestions for a certain text. We use string length and fulltext search to filter the initial list of candidate messages down. After that we use a text similarity algorithm to rank the suggestions and do the final filtering. The logic is explained in more detail in the Translate extension help.

PHP provides a text matching function, but we (Santhosh) had to implement pure PHP fallback for strings longer than 255 bytes or strings containing anything else than ASCII. The pure PHP version is much slower, although that is offset a little because it’s more efficient when there are fewer characters in a string than bytes. But more importantly, it works correctly even when not handling English text. The faster implementation is used when possible. Before we did some optimizations to the matching process, it was the slowest part. After those optimizations the time is now bound by database access. The functions implement the Levenshtein edit distance algorithm.

End users won’t see much difference. Wanting a translation memory on Wikimedia wikis was the original reason for reimplementing translation memory in PHP, and in the coming sprints we are going to enable it on wikis where Translate is enabled (meta-wiki, mediawiki.org, incubator and wikimania2012 currently). It is just over 300 lines of code [1] including comments and in addition there are database table definitions [2].

Now, having explained what was done and why, I can reveal the cool stuff, if you are still reading. There will also be a MediaWiki API module that allows querying the translation memory. There is a simple switch in the configuration to choose whether the memory is public or private. In the future this will allow querying translation memories from other sites, too.

Putting that another pair of eyes into good use

This blog post is about the MediaWiki Translate extension and explains how we got to develop a new set of translation review tools.

One of the core principles at translatewiki.net is that the time of translators is a prestige resource. We show our appreciation to translators by providing tools that let them concentrate 100% on the task at hand and let the (volunteer) staff handle the boring tasks.

It is well known that good translators take pride of their and others work. This may result in a urge to review all translations made by other translators. I consider myself being that kind of translator. The good news is that in recent months the Translate extension has got massively better at supporting reviewing of translations. Some weeks ago we added a new listing where you can click a button to accept a translation. When the list is empty, you know that all translations have either been made or fixed by you, or you have accepted someone elses’ translations.

This is all nice and dandy, but if you want to review new translations as they come in it is not practical. You’d either have to watch the list of recent translations or subscribe to the feed of them. From here you can get to the individual messages, but it takes many clicks to get to the page where you see the button to accept the translation. And iterating over each of the hundreds of message groups to see if there is anything to accept is not practical either.

The solution: a special message group which lists the recent translations in a given language. Since only some of the translators are allowed to review, on the right you can see a screenshot of how it looks like – click to enlarge. One could bookmark this page and have a look at it a few times per week. For me this is a real time saver, and I’m sure others will find it useful too.

To get this implemented, I originally anticipated that some heavy refactoring was needed and I estimated about one and a half day for it. In the end it took only about half a day – I was positively surprised how painless the refactoring was. The problem was that the class which fetches all the messages from the database assumed they all belong in the same MediaWiki namespace. In translatewiki.net we have over ten namespaces for translations of different projects, so it had to be fixed. I’d say this is a prime example of the saying Premature optimization is the root of all evil by Donald Knuth.

In the future we need to link this page from suitable places to make this feature discoverable and also to make sure that more than the current 66 users out of 3000+ translators get the right to use this feature.

MediaWiki grows up – no more playing with Lego

User interface messages built from pieces of text or leaving some parts out of a message are what is called Lego messages. The end result of this practice is not a glittering Lego castle. The end result is more like a shady shack with a leaking roof.

Major Lego message usage in MediaWiki will soon be in the past as I have refactored the MediaWiki logging system and brought the code to match with what we expect from internationalisation today. Instead of snippets “moved X to Y” translators can now work with full sentences like “U moved X to Y”. It makes it possible to change the message to “Page X was moved to Y by U”. Consider the languages where sentences don’t begin with the subject. It must have been as awkward as “moved U X to Y” would be in English.

There is more: translations can now take the gender of the user who performed the action into account. English almost always gets away from taking sides in interface messages, but that is not the case in many other languages.

We already have many translations using these new possibilities:

  • English: Nike moved page Hapsen to Saalen
  • Welsh: Symudwyd y dudalen Hapsen i Saalen gan Nike
  • Russian (male): Nike переименовал страницу Hapsen в Saalen
  • Russian (female): Никa переименовала страницу Hapsen в Saalen

WebWorld 2011 – wrap-up

Unfortunately my time machine is broken, so instead of telling what cool features are coming you have to bear with summary of what I did during the WebWorld sprint.

As you know UserBase Wiki uses the Translate extension to translate the wiki content. I can now cross off a common feature request from my todo list: moving and deleting translated pages. Since each language has its own page and the system uses even more pages behind the scenes, the normal move and delete actions of MediaWiki were insufficient. With some hackish code I was able to hijack those actions and replace them with my own. It is now possible to move or delete a page with all of its translations with few clicks. You can also choose to delete only one translation, which is useful if the translator accidentally used a wrong language.

For those who are addicted to stats, Special:LanguageStats now has a row which states the overall translation coverage. The number can be off a small amount for a reason unknown to me. I have to investigate why and fixit, since statistics never lie :)

And there is one more nicety regarding Special:MyLanguage, which takes care of redirecting users to their preferred language translation of a page, assuming such a translation exists. If the given page does not exist at all, the link using Special:MyLanguage is now red just like normal links to non-existing pages are.

The sprint itself was  productive. There were problems that needed to be solved, and I think we all did a good job tackling the many issues. We also managed to create some new problems: Ingo needs to learn how to not have toys stuck in inconvenient high places :) And I like the logo very much :)

Webworld 2011 – MediaWiki and UserBase

Greetings to the Planet KDE readers, where this should be my first blog post. My nickname on the net is Nikerabbit. I’ve been developing MediaWiki for many years, and I’m the author of the Translate extension. The Translate extension is used at translatewiki.net which is a wiki site and community that does open source software translation. Translate extension can also be used to translate wikipages, which is the way it is used on userbase.kde.org.

That is also the reason why I am now here at the KDE WebWorld sprint. I have updated the Translate extension on UserBase and fixed a number of bugs in it. Mostly minor things that can confuse normal users – they are no longer automatically directed to pages that are of no use to them. And UserBase now has its own translation memory. It’s the same kind we use in translatewiki.net: a very simple one provided by the translate toolkit. Currently those two are independent of each other, but maybe in the future we can find a way to use each others translations.

While working on UserBase issues here, I realized some problems in MediaWiki. First of all the Translate extension is supposed to be compatible with two MediaWiki versions: the latest stable version (1.16, used on UseBase) and the latest development version (1.19alpha, used on translatewiki.net). I am not going to talk about why there are two unreleased versions and third one being in development. Anyway, a lot of development has happened between those versions, including major new features and big rewrites. I’m spending a considerable time keeping Translate extension compatible with 1.16 when developing new features for it. It also makes the code more complex and doubles the testing required. It is not made easier by the fact that not all changes are documented in appropriate places. For example there is no mentions in hooks.txt that the parameters for SkinSubPageSubtitle hook have been changed at some point.

Another thing is that MediaWiki has so many ways to tweak it, most of them undocumented and not easily discoverable. For example all the messages in MediaWiki: namespace, some of which may even be empty by default. There is no way you can find a suitable message unless you already know that such message exists and how it is called. The same applies to configuration variables. They are at least documented in DefaultSettings.php and some also in mediawiki.org, but again it is hard to find some specific thing that could help you (if it even occurs to you that such thing might exist).

This means that people can’t really find out everything you can do with MediaWiki and they either end up not doing some things at all or creating something new from scratch.

translatewiki.net celebrates – so do I

Oh boy time flies. Translatewiki.net turns six years next Saturday. This is the first time we celebrate its birthday. How did it happen?

It was 2005, my last year at upper secondary school when I set up a MediaWiki for myself to do some school work. I was 17, and in the fall of the same year I started studying at a university. Can you imagine how awkward it was to attend university under age of majority (18 years in Finland)? Anyway, I think the wiki was originally called Nukawiki, then Betawiki and finally translatewiki.net. The wiki has gone through many updates. It probably started with Mediawiki 1.4 which boasts in release notes that User interface language can be changed by the user. It’s also gone through many computers starting from my laptop and gradually to more powerful, more dedicated servers.

Already before the summer of 2006, when I started my obligatory military service which lasted six months, I was using the wiki to translate MediaWiki into Finnish and fix i18n problems. In 2006 we started inviting other translators to join. In February 2007 I started translating FreeCol into Finnish and soon they moved all translation related activities into our wiki. One of the initial translators was Siebrand, who has had enormous influence on the direction the project has taken since he joined.

In other words translatewiki.net was a small hobby project for an entirely different purpose, then I used it to scratch a personal itch, and nowadays it is a thriving community with thousands of members. We are already huge in many metrics, we are still growing and there doesn’t seem to be any boundaries for our size. I just cannot imagine how many people the work of translatewiki.net has impacted. For me this means an opportunity, but more importantly a challenge. How do we improve our service while scaling up? How can we provide better tools for translators, for ourselves and for projects that use us? We have been successful thus far, because we have been very efficient – it is almost scary how few people (albeit very dedicated) can keep everything running smoothly.

Translatewiki.net has had and still does have huge impact to my life. It is just not because it is a huge time sink for me. It is a manifestation of the many skills I’ve learned during my life. It feels wrong to say that it is my hobby, because sometimes it feels that studying is the hobby here. Nevertheless my master thesis is nearing completion. I already have a job in mind and I can’t say that translatewiki.net didn’t affect that.

I’m sincerely grateful to each and everyone who has helped translatewiki.net to become what it is today.

Translatewiki.net is happy

Many of the issues that have been annoying us all in translatewiki.net have been fixed lately. To show my appreciation on behalf of translatewiki.net I’d like to highlight these fixes.

Issue one: saving messages in talk pages fails. If you just pressed “Save” once, you got an error message about broken session data. The reply was saved only if you clicked “Save” a second time. I don’t know how many messages we lost due to this. I lost a couple because after replying to a thread, I went to do some other things. Many of my replies were delayed, because I didn’t notice immediately that the save failed. What was worse, usually one would have to scroll down the page to even see the error message! I’m very happy that it is fixed now. Many thanks to Andrew Garrett!

Issue two: portions of changes were not shown at all when viewing differences between two versions. Not as annoying as the first item, this was still nasty and confusing us. I submitted a test case for this bug in wikidiff2 extension and fortunately Tim Starling was able to reproduce it. Soon after he committed a fix. Thanks Tim!

Issue three: message groups for projects which store all translations into single file (like Pywikipediabot) were stuck in “has changes” status. This bug only annoyed the project leaders of translatewiki.net. After some encouragement Robert Leverington came up with a fix and found a serious bug in code which determines if there has been any changes into the messages. The fix affects all message groups. To Robert: good catch and big thanks.

Issue four: Microsoft® Translator, one of the translation services we use to suggest translations for our translators next to Google Translate, Apertium and our own tmserver, is often incorrectly identified to be down. Brian Wolff and Sam Reed have helped to investigate the issue, but it is not yet fully fixed.

Finally many thanks to those who help us to keep translatewiki.net running from day to day, you are many. A special thanks goes out to netcup.de – Webhosting, vServer, Servermanagement who has provided us with their flagship product “vCloud 8000”, which allows us to serve our pages faster than ever before. We need lots of help with challenges that range from coding to writing and design. Don’t hesitate to ask us how you could help us!

Chatty bots and minimizing disruptions in continuous integration

Those who use IRC are probably familiar with bots. Esstentially bot is a client which is a not human. This time I’m talking about specific kind of bots, let’s call them reporting bots. Their purpose is to alert the channel about recent happenings in (near) real time. Open source project channel usually at least have a bot that reports every new commit and bug report filed.

Also the translatewiki.net channel #mediawiki-i18n has reporting bots. We have one CIA bot reporting any i18n related commit to any of our supported projects. I have to mention that the ability to have own ruleset for picking and formatting commits is just awesome. There is also another bot, rakkaus (“love” in Finnish).

Its purpose is to report issues with the site. To accomplish this we pipe the output of error_log, which contains PHP warnings, database errors and MediaWiki exceptions, to the bot. It worked mostly fine, except that bot would flood everyone when the log was growing fast. Few days ago it went too far. We had a database error (a deadlock), which was reported by the bot… including the database query… which happened to contain few hundred kilobytes of serialized and compressed data–in other words binary garbage. Guess how happy we are were when we save channel full of that??

Okay, something had to be done. And so I did. I wrote a short PHP script which:

  • Reads new data every 10 second
  • Takes the last line, truncates it to suitable length and forwards only the snippet and notifies how many lines were skipped in the log

And now everything is nice again :) The script is not yet in SVN, but I will commit it later.

By the way, this bot is half of the reason why we might complain to you in few minutes after you committed code which breaks something in MediaWiki. Fortunately MediaWiki has taken steps to prevent committing code which doesn’t even compile, so we can skip some of the useless mistakes caused by carelessness.

Because we care about the users using translatewiki.net, we want to minimize any disruptions. The measures we have taken are:

  • Even though we update code often, we can rollback easily. With small updates it is easy to identify the cause and chances are it is fixed very fast too.
  • I personally am doing code review, trying to spot most issues before they reach us.