Tag Archives: IWCLUL

IWCLUL 3/3: conversations and ideas

In IWCLUL talks, Miikka Silfverberg’s mention of collecting words from Wikipedia resonated with my earlier experiences working with Wikipedia dumps, especially the difficulty of it. I talked with some people at the conference and everyone seemed to agree that processing Wikipedia dumps takes a lot of time, which they could spend for something else. I am considering to publish plain text Wikipedia dumps and word frequency lists. While working in the DigiSami project, I familiarized myself with the utilities as well as the Wikimedia Tool Labs, so relatively little effort would be needed. The research value would be low, but it would be worth it, if enough people find these dumps and save time. A recent update is that Parsoid is planning to provide plain text format, so this is likely to become even easier in the future. Still, there might be some work to do collect pages into one archive and decide which parts of page will stay and which will be removed: for example converting an infobox to collection of isolated words is not useful for use cases such as WikiTalk, and it can also easily skew word frequencies.

I talked with Sjur Moshagen about keyboards for less resourced languages. Nowadays they have keyboards for Android and iOS, in addition to keyboards for computers (which already existed). They have some impressing additional features, like automatically adding missing accents to typed words. That would be too complicated to implement in jquery.ime, a project used by Wikimedia that implements keyboards in a browser. At least the aforementioned example uses finite state transducer. Running finite state tools in the browser does not yet feel realistic, even though some solutions exist*. The alternative of making requests to a remote service would slow down typing, except perhaps with some very clever implementation, which would probably be fragile at best. I have still to investigate whether there is some middle ground to bring the basic keyboard implementations to jquery.ime.

*Such as jsfst. One issue is that the implementations and the transducers themselves can take lot of space, which means we will run into same issues as when distributing large web fonts at Wikipedia.

I spoke with Tommi Pirinen and Antti Kanner about implementing a dictionary application programming interface (API) for the Bank of Finnish Terminology in Arts and Sciences (BFT). That would allow direct use of BFT resources in translation tools like translatewiki.net and Wikimedia’s Content Translation project. It would also help indirectly, by using a dump for extending word lists in the Apertium machine translation software.

I spoke briefly about language identification with Tommi Jauhiainen who had a poster presentation about the project “The Finno-Ugric languages and the internet”. I had implemented one language detector myself, using an existing library. Curiously enough, many other people met in Wikimedia circles have also made their own implementations. Mine had severe problems classifying languages which are very close to each other. Tommi gave me a link for another language detector, which I would like to test in the future to compare its performance with previous attempts. We also talked about something I call “continuous” language identification, where the detector would detect parts of running text which are in a different language. A normal language detector will be useful for my open source translation memory service project, called InTense. Continuous language identification could be used to post-process Wikipedia articles and tag foreign text so that correct fonts are applied, and possibly also in WikiTalk-like applications, to provide the text-to-speech (TTS) with a hint on how to pronounce those words.

Reasonator entry for Kimmo KoskenniemiReasonator is a software that generates visually pleasing summary pages in natural language and structured sections, based on structured data. More specifically, it uses Wikidata, which is the Wikimedia structured data project, developed by Wikimedia Germany. Reasonator works primarily for persons, though other types or subjects are being developed. Its localisation is limited, compared to the about three hundred languages of MediaWiki. Translating software which generates natural language sentences dynamically is very different from the usual software translation, which consists mostly of fixed strings with occasional placeholder which is replaced dynamically when showing text to an user.

It is not a new idea to use grammatical framework (GF), which is a language translation software based on interlingua, for Reasonator. In fact I had proposed this earlier in private discussions to Gerard Meijssen, but this conference renewed my interest in the idea, as I attended the GF workshop held by Aarne Ranta, Inari Listenmaa and Francis Tyers. GF seems to be a good fit here, as it allows limited context and limited vocabulary translation to many languages simultaneously; vice versa, Wikidata will contain information like gender of people, which can be fed to GF to get proper grammar in the generated translations. It would be very interesting to have a prototype of a Reasonator-like software using GF as the backend. The downside of GF is that (I assume) it is not easy for our regular translators to work with, so work is needed to make it easier and more accessible. The hypothesis is that with GF backend we would get a better language support (as in grammatically correct and flexible) with less effort on the long run. That would mean providing access to all the Wikidata topics even in smaller languages, without the effort of manually writing articles.

IWCLUL 2/3: morphology, OCR, a corpus vs. Wiktionary

More on IWCLUL: now on the sessions. The first session of the day was by the invited speaker Kimmo Koskenniemi. He is applying his two-level formalism in a new area, old literary Finnish (example of old literary Finnish). By using two-level rules for old written Finnish together with OMorFi, he is able to automatically convert old text to standard Finnish dictionary forms, which can be used, in the main example, as an input text to an search engine. He uses weighted transducers to rank the most likely equivalent modern day words. For example the contemporary spelling of wijsautta is viisautta, which is an inflected form of the noun viisaus (wisdom). He only takes the dictionary forms, because otherwise there are too many unrelated suggestions. This avoids the usual problems of too many unrelated morphological analyses: I had the same problen in my master’s thesis when I attempted using OMorFi to improve Wikimedia’s search system, which was still using Lucene at that time.

Jeremy Bradley gave presentation about an online Mari corpus. Their goal was to make a modern English-language textbook for Mari, for people who do not have access to native speakers. I was happy to see they used a free/copyleft Creative Commons license. I asked him whether they considered Wiktionary. He told me he had discussed with a person from Wiktionary who was against an import. I will be reaching my contacts and see whether an another attempt will succeed. The automatic transliteration between Latin, Cyrillic and IPA was nice, as I have been entertaining the idea of doing transliteration from Swedish to Finnish for WikiTalk, to make it able to function in Swedish as well by only using Finnish speech components. One point sticks with me: they had to add information about verb complements themselves, as they were not recorded in their sources. I can sympathize with them based on my own language learning experiences.

Stig-Arne Grönroos’ presentation on Low-resource active learning of North Sámi morphological segmentation did not contain any surprises for me after having been exposed to this topic previously. All efforts to support languages where we have to cope with limited resources are welcome and needed. Intermediate results are better than working with nothing while waiting for a full morphological analyser, for example. It is not completely obvious to me how this tool can be used in other language technology applications, so I will be happy to see an example.

Miikka Silfverberg presented about OCR, using OMorFi: can morphological analyzers improve the quality of optical character recognition? To summarize heavily, OCR performed worse when OMorFi was used, compared to just taking the top N most common words from Wikipedia. I understood this is not exactly the same problem of large number of readings generated by morphological analyser, rather something different but related.

IWCLUL event report 1/3: the story

IWCLUL is short for International Workshop on Computational Linguistics for Uralic Languages. I attended the conference, held on January 16th 2015, and presented a joint paper with Antti on Multilingual Semantic MediaWiki for Finno-Ugric dictionaries at the poster session.

I attentively observe the glimmering city lights of Tromsø as the plane lands in darkness to orientate myself to the maps I studied on my computer before the trip. At the airport I receive a kind welcome by Trond, in Finnish, together with a group of other people going to the conference. While he is driving us into our hotels, Trond elaborates the sights of the island we pass by. I and Antti, who is co-author of our paper about Multilingual Semantic MediaWiki, check in to the hotel and joke about the tendency of forgetting posters in different places.

Next morning I meet Stig-Arne at breakfast. We decide to go see the local cable car. We wander around the city center until we finally find a place where they sell bus tickets. We had asked a few people but they gave conflicting different directions. We take the bus and then Fjellheisen, the cable car, to the top. The sights are wonderful even in winter. I head back, do some walking in the center. I buy some postcards and use that as an excuse to get inside and warm up.

On Friday, on the conference day, almost by miracle, we end up in the conference place without too many issues, despite seeing no signs in the University of Tromsø campus. More information of the conference itself will be provided in the following parts. And the poster? We forgot to take it with us from the social event after the conference.

-- .