Wikipedia | It rains like a saavi

In IWCLUL talks, Miikka Silfverberg’s mention of collecting words from Wikipedia resonated with my earlier experiences working with Wikipedia dumps, especially the difficulty of it. I talked with some people at the conference and everyone seemed to agree that processing Wikipedia dumps takes a lot of time, which they could spend for something else. I am considering to publish plain text Wikipedia dumps and word frequency lists. While working in the DigiSami project, I familiarized myself with the utilities as well as the Wikimedia Tool Labs, so relatively little effort would be needed. The research value would be low, but it would be worth it, if enough people find these dumps and save time. A recent update is that Parsoid is planning to provide plain text format, so this is likely to become even easier in the future. Still, there might be some work to do collect pages into one archive and decide which parts of page will stay and which will be removed: for example converting an infobox to collection of isolated words is not useful for use cases such as WikiTalk, and it can also easily skew word frequencies.

I talked with Sjur Moshagen about keyboards for less resourced languages. Nowadays they have keyboards for Android and iOS, in addition to keyboards for computers (which already existed). They have some impressing additional features, like automatically adding missing accents to typed words. That would be too complicated to implement in jquery.ime, a project used by Wikimedia that implements keyboards in a browser. At least the aforementioned example uses finite state transducer. Running finite state tools in the browser does not yet feel realistic, even though some solutions exist^*. The alternative of making requests to a remote service would slow down typing, except perhaps with some very clever implementation, which would probably be fragile at best. I have still to investigate whether there is some middle ground to bring the basic keyboard implementations to jquery.ime.

^*Such as jsfst. One issue is that the implementations and the transducers themselves can take lot of space, which means we will run into same issues as when distributing large web fonts at Wikipedia.

I spoke with Tommi Pirinen and Antti Kanner about implementing a dictionary application programming interface (API) for the Bank of Finnish Terminology in Arts and Sciences (BFT). That would allow direct use of BFT resources in translation tools like translatewiki.net and Wikimedia’s Content Translation project. It would also help indirectly, by using a dump for extending word lists in the Apertium machine translation software.

I spoke briefly about language identification with Tommi Jauhiainen who had a poster presentation about the project “The Finno-Ugric languages and the internet”. I had implemented one language detector myself, using an existing library. Curiously enough, many other people met in Wikimedia circles have also made their own implementations. Mine had severe problems classifying languages which are very close to each other. Tommi gave me a link for another language detector, which I would like to test in the future to compare its performance with previous attempts. We also talked about something I call “continuous” language identification, where the detector would detect parts of running text which are in a different language. A normal language detector will be useful for my open source translation memory service project, called InTense. Continuous language identification could be used to post-process Wikipedia articles and tag foreign text so that correct fonts are applied, and possibly also in WikiTalk-like applications, to provide the text-to-speech (TTS) with a hint on how to pronounce those words.

Reasonator is a software that generates visually pleasing summary pages in natural language and structured sections, based on structured data. More specifically, it uses Wikidata, which is the Wikimedia structured data project, developed by Wikimedia Germany. Reasonator works primarily for persons, though other types or subjects are being developed. Its localisation is limited, compared to the about three hundred languages of MediaWiki. Translating software which generates natural language sentences dynamically is very different from the usual software translation, which consists mostly of fixed strings with occasional placeholder which is replaced dynamically when showing text to an user.

It is not a new idea to use grammatical framework (GF), which is a language translation software based on interlingua, for Reasonator. In fact I had proposed this earlier in private discussions to Gerard Meijssen, but this conference renewed my interest in the idea, as I attended the GF workshop held by Aarne Ranta, Inari Listenmaa and Francis Tyers. GF seems to be a good fit here, as it allows limited context and limited vocabulary translation to many languages simultaneously; vice versa, Wikidata will contain information like gender of people, which can be fed to GF to get proper grammar in the generated translations. It would be very interesting to have a prototype of a Reasonator-like software using GF as the backend. The downside of GF is that (I assume) it is not easy for our regular translators to work with, so work is needed to make it easier and more accessible. The hypothesis is that with GF backend we would get a better language support (as in grammatically correct and flexible) with less effort on the long run. That would mean providing access to all the Wikidata topics even in smaller languages, without the effort of manually writing articles.

It has been a busy spring: I have yet to blog about Translate UX and Universal Language Selector projects, which have been my main efforts.
But now something different. In this field you can never stop learning. So I was very pleased when my boss let me participate in a week-long course, where Francis Tyers and Tommi Pirinen taught how to do machine translation with Apertium. Report of the course follows.

From translation memory to machine translation

Before going to the details about the course, I want to share my thoughts about what is the relation between the different translation memory and machine translations techniques we are using to help translators. The three different techniques are:

Crude translation memory: for example the TTMServer of Translate
Statistical machine translation: for example Google Translate or Microsoft Translator
Rule-based machine translation: for example Apertium

In the figure below, I have used two properties to compare them.

On x-axis is the amount of information that is extracted from the stored data. Here the stored data is usually a corpus of aligned^* translations in two or more languages.
On y-axis is the amount of external knowledge used by the system. This data is usually dictionaries, rules how words inflect and rules about grammar–or even how to split text into sentences and words.

^* Aligned means that the system knows which parts of the text correspond to each other in the translations. Alignment can be at paragraph level, sentence level or even smaller parts of the text.

Translation memory and machine translation comparison

A very crude implementation just stores an existing translation and can retrieve it if the very same text is translated again.

TTMServer is a little more sophisticated: it splits the translation into paragraph-sized chunks, and it can retrieve the existing translation even if the new text does not match the old text exactly. This system uses only a little information about the data. Even if all the words exist in it, translated as part of different units (strings), the system still cannot provide any kind of translation. Internally, TTMServer uses some external knowledge on how to split up text into words, in order to speed up translation retrieval.

Statistical machine translation at simplest is just a translation memory which extracts more information about the stored translation data. It gathers a huge database about which words usually occur as translation of the words in the source language. Usually it also stores the context so that in the sentence “walking along the river bank” the term “bank” is not interpreted as a building. Most sophisticated systems can also include knowledge about inflection and grammar to filter out invalid interpretations, or even fix grammatically incorrect forms.

On the right hand side of the figure we have rule-based machine translation systems like Apertium. These systems mainly rely on language dependent information supplied by the maker of the system: bilingual dictionaries, inflection and syntax rules are needed for them to function. Unlike the preceding ones, such systems are always language specific. Creating a machine translation needs a linguist for each language in the system.
Still, even these systems can benefit from statistical methods. While they do not store translation data itself, such data can be analysed and used as input to find the correct way to read ambiguous sentences, or the most common translation of a word in the given context among some alternatives.

The ultimate solution for machine translation is most likely a combination of rules and information extracted from a huge translation corpora.

The course

To create a machine translation system with Apertium, you need to choose a source and target language. I built a system to translate from Kven to Finnish. Kven is very close to Finnish, so it was quite easy to do even though I do not know much Kven. Each student was provided skeleton files and a story in the source language, also translated to the target language by a human translator.

We started by adding words in order of frequency to the lexicon. Lexicon defines part of speech and the inflection paradigms of the words. The paradigms are used to analyze the word forms, and also for generation when translating in the opposite direction. Then we added phonological rules. For example Finnish has a vowel harmony. Because of that, many word endings (cases) have two forms, depending on the word – for example koirassa (in the dog), but hiiressä (in the mouse).

As a third step, we created a bilingual dictionary in a form that is suitable for machines (read: XML). At this point we started seeing some words in the target language. Of course we also had to add the lexicon for the target language, if nobody else had done it already.

Finally we started adding rules.
We added rules to disambiguate sentences with multiple readings. For example, in the sentence “The door is open” we added a rule that open is an adjective rather than a verb, because the sentence already has a verb.
We added rules to convert the grammar. For example Finnish cases are usually replaced with prepositions in English. We might also need to add words: “sataa” needs an explicit subject in English, “it rains”.

At the end we compared the translation produced by our system with the translation made by the human translator. We briefly considered two ways to evaluate the quality of the translation.
First, we can use something like edit distance for words (instead of characters) to count how many insertion, deletions or substitutions are needed to change the machine translation to human translation. Otherwise, we can count how many words the human translator needs to change when copy editing the machine translation.
Machine translation systems start to be useful when you need to fix only one word out of six or more words in the translation.

The future

A little while ago Erik asked how the Wikimedia Foundation could support machine translation, which is now mostly in hands of big commercial entities (though the European Union is also building something) and needs an open source alternative.

We do not have lot of translation corpora like Google. We do have lots of text in different languages, but it is not the same content in all languages and it’s not aligned. Exceptions are translatewiki.net and other places where translations are done with the Translate extension. As a side note I think that translatewiki.net contains one of the most multilingual parallel translation corpora under a free license.

Given that we have lots of people in the Wikimedia movement who are multilingual and interested in languages, I think we should cooperate with an existing open source machine translation system (like Apertium) in a way that allows our users to enhance that system. Doing more translations increases the data stored in a translation memory making it more useful. In a similar fashion, doing more translations with machine translation system should make it better.

Apertium has already been in use on the Nynorsk Wikipedia. Bokmål and nynorsk are closely related languages: the kind of situation where Apertium excels.

One thing I have been thinking is that, now that the Wikimedia Language Engineering team is planning to build tools to help translate Wikipedia articles into other languages, we could closely integrate it with Apertium. We could provide an easy way for translators to add missing words and report unintelligible sentences.

I don’t expect most of our translators to actually write and correct rules, so someone should manage that on Apertium side. But at least word collection could be mostly automated; I bet someone has tried and will try to use Wiktionary data too.

As a first step, Wikimedia Foundation could set up their own Apertium instance as a web service for our needs (existing instances are too unstable). The translate extension, for example, can query such a web service to provide translation suggestions.

It rains like a saavi

About me, me and me

Tag Archives: Wikipedia

IWCLUL 3/3: conversations and ideas

On course to machine translation

From translation memory to machine translation

The course

The future