In July 2011, before starting my Wikimedia job, I completed my master’s thesis. Finally I spent some time to polish and submit it, which means that I will graduate!
In my thesis I investigated the feasibility of using a Finnish morphology implementation with the Lucene search system. With the same Lucene-search package that is used by the Wikimedia Foundation I built two search indexes: one with the existing Porter stemming algorithm and the other one with morphological analysis. The corpus I used was the current text dump of Finnish Wikipedia.
Finnish is among the group of languages with relatively vibrant and extensive morphology. For you English speakers, this means that instead of using prepositions, our words actually change depending on the context they are in. This makes exact pattern matching in searching mostly useless, because it only matches a fraction of the inflected forms. In Finnish nouns, verbs and adjectives can each have over a thousand of different forms when combining all the cases, plural markers, possessive suffixes and other clitics.
Simple stemmers have no or very limited vocabulary and they strip letters off the words according to rules. Morphological analyser instead comes with an extensive word list and can find all the possible interpretations of a given inflected word and only those. The morphology is based on the Omorfi interpretative finite state transducer, which returns the basic dictionary forms of the inflected words given as input. The transducer I used was brand new. Omorfi is the first open implementation of Finnish morphology.
From a technical perspective I came up with seven requirements for the new algorithm and its implementation (thanks to help from Roan and Ariel at Wikimedia) before it can be deployed in Wikimedia:
- it has to be open source,
- the code must be reviewed,
- the performance should be on par with the current system,
- it must be stable, no crashing or bugs requiring reindexing whole wikis,
- it must be easily installable with dependencies,
- searching must not be harder and the search interface must not change,
- it must return improved search results.
Now I will tell how well it met these requirements.
- Omorfi and the lookup utility I use to drive the transducer are both open source (GPL and Apache).
- Code review might be tricky due to lack of resources in Wikimedia. However we’re not at this stage yet.
- Indexing time is from five to ten times slower, but searches are about as fast and search index size grew only by 10 to 20 percent. Since indexing is done only once, it’s not such a big deal. The speed can be improved though, the lookup utility is not optimized.
- I got some out of memory errors and crashes while developing the system – the components I used were very new and I usually was their first user.
- The lookup utility is a simple Java library and the transducer is just a file – easy to install or bundle.
- The search syntax and interface has not changed at all.
- And the most important point: the quality of search results. The Wikimedia Foundation provided me with a corpus of actual search queries: I ran them on both indexes and I analysed the variations in the results they gave. I got very mixed results here, with many searches performing significantly better and many significantly worse. This is probably explained by a major implementation mistake I found in my own implementation. The alternatives proposed by the morphology sometimes got full weight when they matched the searched keyword. For example searching for tee (tea) returned many pages which contained the inflected word form teiden which can be genitive plural of tee or tie (road) or word teesi (thesis) which was interpreted as tee with possessive suffix (your tea). The problem could be solved by marking the interpreted words with a % prefix, so that they wouldn’t get as much weight as real exact matches in the document. I was not able to execute this fix during my thesis, however it would be the first thing to try among the ample possibilities of further research.
Even with the problems I encountered in my research, I believe this approach is viable and could – with further improvements – replace the current stemmer algorithm.
This was the first time that open content, open search engine and open Finnish morphology were put together.
The thesis (PDF) is written in Finnish, but I’m happy to tell you more about it. Just ask!
Pingback: Efficient translation: Translation memory enabled on all Wikimedia wikis « It rains like a saavi
Pingback: IWCLUL 2/3: morphology, OCR, a corpus vs. Wiktionary « It rains like a saavi