Tag Archives: WikiTalk

IWCLUL 2/3: morphology, OCR, a corpus vs. Wiktionary

More on IWCLUL: now on the sessions. The first session of the day was by the invited speaker Kimmo Koskenniemi. He is applying his two-level formalism in a new area, old literary Finnish (example of old literary Finnish). By using two-level rules for old written Finnish together with OMorFi, he is able to automatically convert old text to standard Finnish dictionary forms, which can be used, in the main example, as an input text to an search engine. He uses weighted transducers to rank the most likely equivalent modern day words. For example the contemporary spelling of wijsautta is viisautta, which is an inflected form of the noun viisaus (wisdom). He only takes the dictionary forms, because otherwise there are too many unrelated suggestions. This avoids the usual problems of too many unrelated morphological analyses: I had the same problen in my master’s thesis when I attempted using OMorFi to improve Wikimedia’s search system, which was still using Lucene at that time.

Jeremy Bradley gave presentation about an online Mari corpus. Their goal was to make a modern English-language textbook for Mari, for people who do not have access to native speakers. I was happy to see they used a free/copyleft Creative Commons license. I asked him whether they considered Wiktionary. He told me he had discussed with a person from Wiktionary who was against an import. I will be reaching my contacts and see whether an another attempt will succeed. The automatic transliteration between Latin, Cyrillic and IPA was nice, as I have been entertaining the idea of doing transliteration from Swedish to Finnish for WikiTalk, to make it able to function in Swedish as well by only using Finnish speech components. One point sticks with me: they had to add information about verb complements themselves, as they were not recorded in their sources. I can sympathize with them based on my own language learning experiences.

Stig-Arne Grönroos’ presentation on Low-resource active learning of North Sámi morphological segmentation did not contain any surprises for me after having been exposed to this topic previously. All efforts to support languages where we have to cope with limited resources are welcome and needed. Intermediate results are better than working with nothing while waiting for a full morphological analyser, for example. It is not completely obvious to me how this tool can be used in other language technology applications, so I will be happy to see an example.

Miikka Silfverberg presented about OCR, using OMorFi: can morphological analyzers improve the quality of optical character recognition? To summarize heavily, OCR performed worse when OMorFi was used, compared to just taking the top N most common words from Wikipedia. I understood this is not exactly the same problem of large number of readings generated by morphological analyser, rather something different but related.

You can write a paper about that

“You can write a paper” is kind of a running joke in the language engineering team when the discussion sways so far from the original topic that it is no longer helping to get the work done. But sometimes sidelines turn out to be interesting and fruitful. When I was presented an opportunity to do a PhD related to wikis, languages and translation I could not pass it. And because of the joke, I can claim full innocence – they told me to! ;)

The results are in and…. I got accepted! Screams with joy and then quickly shies away hoping nobody noticed.

What does this mean?

Doctoral hat

The doctoral hat is the ultimate goal, right?

If you are a reader of this blog, the topics might get even more incomprehensible. Or the posts might be even more insightful and based on research instead of gut feelings. Hopefully, it doesn’t mean that I won’t have time to write more blog posts.

Practically, I will be starting at the beginning of January with the goal of writing a PhD dissertation and of graduating in about four years. The proposed topic for my dissertation is Supporting creation and interaction of open content with language technology, as part of the project “Finno-Ugric Digital Natives: Linguistic support for Finno-Ugric digital communities in generating online content”. As with my MA, I’ll do this at the University of Helsinki.

Initially I will be working three days a week on that and keep helping the language engineering team as well. We’ll see how it goes.

The first thing I will do is to participate in IWSDS (Workshop on Spoken Dialog Systems) held in January at Napa, California, USA. I will be presenting a paper about multilingual WikiTalk.

-- .