Monthly Archives: February 2012

My take-away from Open Advice

I told my friend Nemo that I have been reading the recently published Open Advice book and he basically forced me to write a review about it. This isn’t really a review, but instead something the book made me think. When I started reading the book I expected to get some simple tips on how I could do things better or on new things I could do. Well, I didn’t get those, but I got something else.

The book consists of many short stories of open source from different starting points – each story is written by a different author. It was nice to notice that among the writers there were many who I’ve met or at least whose name and work I knew. Most of the stories didn’t tell anything new to me, and the section about translation was annoyingly short of content. The book is worth reading, especially since each story is short, which makes it easy to read.

When I read what follows in Markus Krötzsch’s Out of the lab, into the Wild, I started thinking.

When a certain density of users is reached, support starts to happen from user to user. This is always a magical moment for a project, and a sure sign that it is on a good path.

I have been developing the Translate extension (and by extension translatewiki.net too) for many years now, but apart from seeing it being used more and more, I haven’t really stopped to think what it means for a software project to grow up and be successful. So I made up some milestones:

  1. You write something for yourself
  2. Other people find it useful and start using it
  3. The users of your software are providing peer to peer help
  4. Other developers are able take over maintenance and development of the software

Now we have something we can measure. I started writing Translate over five years ago. Some years later there were already tens of translators using it. This year the Translate extension is used in many Wikimedia projects as well as in KDE UserBase in addition to translatewiki.net. Lots of new people need to learn how to use the Translate extension from a management point of view, and more and more often they get an answer not from me but from someone else or by reading the documentation.

So what about step 4? Until very recently Translate has been my world and my world only apart from some patch contributions. But I have now taken it as my personal goal to change this. And what a lucky person I am! The Wikimedia Localisation Team – which I am member of – has the development of Translate extension as one of their major goals. Even better, we are an agile team, which means that each and every developer of the team should be able to do any development task in the team. To achieve this we divide tasks among team members so that nobody works only on their own favourite project. In addition we are explicitly reserving time for knowledge transfer, which happens through code review, proofreading the documentation one of us has written, explicit sessions where a team member covers a topic they know well and pair programming. This has already been going on for some months and it is not going to stop.

In addition to schooling the other developers in our team, I also plan to keep expanding the documentation, adding more tutorials and organizing tasks suitable for new developers, so that it is easy for interested volunteer developers to start contributing to Translate. Because in the end knowledge is useless if the developer has no reason to develop, and the best reason to develop is to scratch your own itch. I believe those developers are to be found among the users of the Translate extension who have a slightly different and new use case which needs development work.

I haven’t yet finished my plans on the fifth step (world domination), so stay tuned for coming blog posts.

New UIs in MediaWiki Translate extension

I’m not a designer. Yet, I am a designer. During the many years of development of the Translate extension, I have done about all things related to the development of a software project: coding, translating, documenting, testing, system administration, marketing and user interface (UI) design among those. My UI design skills are limited to personal interest and one university course. But I try to pay attention to the UIs I create, and I listen for feedback. For once we got some good feedback about the issues in the current UIs and some suggestions about how to improve it. Based on this feedback I have done two significant changes to Special:Translate – the main translation interface of the Translate extension. The first significant change is to split the page into a few different tasks: translating, proofreading, statistics and export. I implemented these as tabs. Typically the user starts from language statistics and selects the project he wants to translate or proofread. This has the following benefits:

  • The tasks are clearly separated: users can see at a glance what are the things that can be done with the intreface.
  • Switching between tasks is seamless: previously there was no easy way to go back to language statistics from translating or proofreading.
  • There are less visible options at a time: the UI just looks nicer and takes less space.

The second change is an embedded translation editor. This feature is still in beta phase, and if we get enough positive feedback about it, we will switch over from the old popup based editor. You can test the editor by going to Special:Translate and double clicking the text you want to translate. This should prevent the hassle of moving and resizing dialogs. On the other hand it has some problems with the editor moving on the screen when you advance to next message, and it also stands out worse in the middle of the surrounding context. I’m investigating if and how we can mitigate these issues. I’ve already changed some stylings to make the editor stand out more and the whole table appear less heavy. As a bonus the embedded editor feels faster, because I’ve added some preloading. This means that when you save your translation and go to the next message, it will show up instantly because it has already been loaded.

Exploring the state(s) of open source search stack supporting Finnish

In July 2011, before starting my Wikimedia job, I completed my master’s thesis. Finally I spent some time to polish and submit it, which means that I will graduate!

In my thesis I investigated the feasibility of using a Finnish morphology implementation with the Lucene search system. With the same Lucene-search package that is used by the Wikimedia Foundation I built two search indexes: one with the existing Porter stemming algorithm and the other one with morphological analysis. The corpus I used was the current text dump of Finnish Wikipedia.

Finnish is among the group of languages with relatively vibrant and extensive morphology. For you English speakers, this means that instead of using prepositions, our words actually change depending on the context they are in. This makes exact pattern matching in searching mostly useless, because it only matches a fraction of the inflected forms. In Finnish nouns, verbs and adjectives can each have over a thousand of different forms when combining all the cases, plural markers, possessive suffixes and other clitics.

Simple stemmers have no or very limited vocabulary and they strip letters off the words according to rules. Morphological analyser instead comes with an extensive word list and can find all the possible interpretations of a given inflected word and only those. The morphology is based on the Omorfi interpretative finite state transducer, which returns the basic dictionary forms of the inflected words given as input. The transducer I used was brand new. Omorfi is the first open implementation of Finnish morphology.

From a technical perspective I came up with seven requirements for the new algorithm and its implementation (thanks to help from Roan and Ariel at Wikimedia) before it can be deployed in Wikimedia:

  1. it has to be open source,
  2. the code must be reviewed,
  3. the performance should be on par with the current system,
  4. it must be stable, no crashing or bugs requiring reindexing whole wikis,
  5. it must be easily installable with dependencies,
  6. searching must not be harder and the search interface must not change,
  7. it must return improved search results.

Now I will tell how well it met these requirements.

  1. Omorfi and the lookup utility I use to drive the transducer are both open source (GPL and Apache).
  2. Code review might be tricky due to lack of resources in Wikimedia. However we’re not at this stage yet.
  3. Indexing time is from five to ten times slower, but searches are about as fast and search index size grew only by 10 to 20 percent. Since indexing is done only once, it’s not such a big deal. The speed can be improved though, the lookup utility is not optimized.
  4. I got some out of memory errors and crashes while developing the system – the components I used were very new and I usually was their first user.
  5. The lookup utility is a simple Java library and the transducer is just a file – easy to install or bundle.
  6. The search syntax and interface has not changed at all.
  7. And the most important point: the quality of search results. The Wikimedia Foundation provided me with a corpus of actual search queries: I ran them on both indexes and I analysed the variations in the results they gave. I got very mixed results here, with many searches performing significantly better and many significantly worse. This is probably explained by a major implementation mistake I found in my own implementation. The alternatives proposed by the morphology sometimes got full weight when they matched the searched keyword. For example searching for tee (tea) returned many pages which contained the inflected word form teiden which can be genitive plural of tee or tie (road) or word teesi (thesis) which was interpreted as tee with possessive suffix (your tea). The problem could be solved by marking the interpreted words with a % prefix, so that they wouldn’t get as much weight as real exact matches in the document. I was not able to execute this fix during my thesis, however it would be the first thing to try among the ample possibilities of further research.

Even with the problems I encountered in my research, I believe this approach is viable and could – with further improvements – replace the current stemmer algorithm.
This was the first time that open content, open search engine and open Finnish morphology were put together.

The thesis (PDF) is written in Finnish, but I’m happy to tell you more about it. Just ask!

New translation memories near you soon

In the last sprint I developed a translation memory server in PHP almost from scratch. Well, it’s not really a server. It’s run inside MediaWiki during client requests. It closely follows the logic of tmserver from translatetoolkit, which uses Python and SQLite.

The logic of how it works is pretty simple: you store all definitions and translations in a database. Then you can query suggestions for a certain text. We use string length and fulltext search to filter the initial list of candidate messages down. After that we use a text similarity algorithm to rank the suggestions and do the final filtering. The logic is explained in more detail in the Translate extension help.

PHP provides a text matching function, but we (Santhosh) had to implement pure PHP fallback for strings longer than 255 bytes or strings containing anything else than ASCII. The pure PHP version is much slower, although that is offset a little because it’s more efficient when there are fewer characters in a string than bytes. But more importantly, it works correctly even when not handling English text. The faster implementation is used when possible. Before we did some optimizations to the matching process, it was the slowest part. After those optimizations the time is now bound by database access. The functions implement the Levenshtein edit distance algorithm.

End users won’t see much difference. Wanting a translation memory on Wikimedia wikis was the original reason for reimplementing translation memory in PHP, and in the coming sprints we are going to enable it on wikis where Translate is enabled (meta-wiki, mediawiki.org, incubator and wikimania2012 currently). It is just over 300 lines of code [1] including comments and in addition there are database table definitions [2].

Now, having explained what was done and why, I can reveal the cool stuff, if you are still reading. There will also be a MediaWiki API module that allows querying the translation memory. There is a simple switch in the configuration to choose whether the memory is public or private. In the future this will allow querying translation memories from other sites, too.

-- .