Machine translation | It rains like a saavi

Click for animation how it went through 20 versions.

I attended the International workshop series on spoken dialog systems aka IWSDS as part of my studies. It was my first scientific event where one had to submit papers: a great experience. More details below.

Poster. The paper we had submitted to the workshop was accepted as poster presentation. Having never done a poster before, especially of A0 size, I was a little bit afraid that I would encounter many technical and design issues. But no time for procrastination: little more than a week before the conference, I just started doing it. Luckily my University provided poster templates, so I didn’t need to worry about general layout. Between Adobe InDesign and PowerPoint templates, I chose the latter because I felt that I didn’t want to learn new software but just want to get things done. It took a few hours to come up with the initial draft. During the following days I went through 19 versions with my advisers until we declared it as ready. Working on weekends and late nights is nothing new to me, but I was surprised my advisers did the same. Whether this tells about dedication, short deadlines or the general issue of work spreading over to free time, I do not know.

Travel. I sent the poster to print, and got it the day before I flew in to San Francisco. My first leg was late, so I had to run in New York to catch my connection. I was advised to include the poster in check-in luggage, so of course it did not make it. That wasn’t an issue though, since I had decided to fly in one day early to recover from jet lag. The next day I went back to the airport to pick the poster and take my shuttle to the venue, leaving from SFO.

Location. The workshop was held in an inn in near Napa. I was told jokingly that the isolated place was chosen so that we can’t go out but only stay together to discuss about things. There was probably some truth in that. As a counterbalance for the packed conference program, the place and good food made it not overwhelming.

Did you notice the four hot air balloons?

Presentations. The level of presentations varied somewhat. On one end of the scale there was a presentation which focused on math which I could not grasp. On the other end both Dan Bohus and Louis-Philippe Morency gave captivating keynotes which sparked my interest and were easy to follow even for a complete newbie to the topic as I was.

What struck me the most were Dan’s words about situated interaction. He gave an example where they had a robot in the office, and they were creating algorithms to guess whether the person walking past the robot is going to engage with the robot. They were able to use machine learning to guess many seconds in advance whether the user will engage. But when they moved the robot a bit so that users approached from a different direction, the previous model did not work anymore, and they had to retrain it with new data. The point of this and other given examples was to highlight that, in the context of human interaction, we must devise machine-learning algorithms capable of adapting to new contexts. I immediately draw a connection to the presentation by Filip Ginter (p. 28) at Kites Symposium last year, where data mining was used for distributional semantics, i.e. machine learning was used to learn that words kaunis, ihastuttava and hurmaava are in some way related. There is something very appealing in making machines learn language and interaction without explicitly giving them the rules. In the case of human interaction it is much more difficult because there is less data available and so many variables to take into account.

Presence. I can’t help but wonder why, apart from Microsoft and some car companies, no big companies were present. I’m sure this kind of research is also done in other companies. I asked a few people this question:

The other companies, is what they are doing (in the context of workshop topics) not novel, or are they just not telling about their research and not contributing back; and if so, do we care?

To summarize the answers and my conclusion: Apple Siri, for example, is not so novel, it is just a product done very well using a simple technology; on the other hand big companies have a vast amount of data not available for research. Essentially Google and other big companies have a monopoly on certain areas like machine translation and speech recognition. We do care about this.

Lab tour. After the workshop there was a lab tour in the Silicon Valley. We visited Honda Research Institute, Computer History Museum and Microsoft Research. Microsoft scores again for getting my attention by presenting a new version of Kinect. It was also nice to see a functioning adding machine at the Computer History Museum. And I can’t avoid mentioning having a nice spicy pasta lunch in the warm sunshine with a good company (in increasing order of significance) in Mountain View knowing that Google offices were very close.

It has been a busy spring: I have yet to blog about Translate UX and Universal Language Selector projects, which have been my main efforts.
But now something different. In this field you can never stop learning. So I was very pleased when my boss let me participate in a week-long course, where Francis Tyers and Tommi Pirinen taught how to do machine translation with Apertium. Report of the course follows.

From translation memory to machine translation

Before going to the details about the course, I want to share my thoughts about what is the relation between the different translation memory and machine translations techniques we are using to help translators. The three different techniques are:

Crude translation memory: for example the TTMServer of Translate
Statistical machine translation: for example Google Translate or Microsoft Translator
Rule-based machine translation: for example Apertium

In the figure below, I have used two properties to compare them.

On x-axis is the amount of information that is extracted from the stored data. Here the stored data is usually a corpus of aligned^* translations in two or more languages.
On y-axis is the amount of external knowledge used by the system. This data is usually dictionaries, rules how words inflect and rules about grammar–or even how to split text into sentences and words.

^* Aligned means that the system knows which parts of the text correspond to each other in the translations. Alignment can be at paragraph level, sentence level or even smaller parts of the text.

Translation memory and machine translation comparison

A very crude implementation just stores an existing translation and can retrieve it if the very same text is translated again.

TTMServer is a little more sophisticated: it splits the translation into paragraph-sized chunks, and it can retrieve the existing translation even if the new text does not match the old text exactly. This system uses only a little information about the data. Even if all the words exist in it, translated as part of different units (strings), the system still cannot provide any kind of translation. Internally, TTMServer uses some external knowledge on how to split up text into words, in order to speed up translation retrieval.

Statistical machine translation at simplest is just a translation memory which extracts more information about the stored translation data. It gathers a huge database about which words usually occur as translation of the words in the source language. Usually it also stores the context so that in the sentence “walking along the river bank” the term “bank” is not interpreted as a building. Most sophisticated systems can also include knowledge about inflection and grammar to filter out invalid interpretations, or even fix grammatically incorrect forms.

On the right hand side of the figure we have rule-based machine translation systems like Apertium. These systems mainly rely on language dependent information supplied by the maker of the system: bilingual dictionaries, inflection and syntax rules are needed for them to function. Unlike the preceding ones, such systems are always language specific. Creating a machine translation needs a linguist for each language in the system.
Still, even these systems can benefit from statistical methods. While they do not store translation data itself, such data can be analysed and used as input to find the correct way to read ambiguous sentences, or the most common translation of a word in the given context among some alternatives.

The ultimate solution for machine translation is most likely a combination of rules and information extracted from a huge translation corpora.

The course

To create a machine translation system with Apertium, you need to choose a source and target language. I built a system to translate from Kven to Finnish. Kven is very close to Finnish, so it was quite easy to do even though I do not know much Kven. Each student was provided skeleton files and a story in the source language, also translated to the target language by a human translator.

We started by adding words in order of frequency to the lexicon. Lexicon defines part of speech and the inflection paradigms of the words. The paradigms are used to analyze the word forms, and also for generation when translating in the opposite direction. Then we added phonological rules. For example Finnish has a vowel harmony. Because of that, many word endings (cases) have two forms, depending on the word – for example koirassa (in the dog), but hiiressä (in the mouse).

As a third step, we created a bilingual dictionary in a form that is suitable for machines (read: XML). At this point we started seeing some words in the target language. Of course we also had to add the lexicon for the target language, if nobody else had done it already.

Finally we started adding rules.
We added rules to disambiguate sentences with multiple readings. For example, in the sentence “The door is open” we added a rule that open is an adjective rather than a verb, because the sentence already has a verb.
We added rules to convert the grammar. For example Finnish cases are usually replaced with prepositions in English. We might also need to add words: “sataa” needs an explicit subject in English, “it rains”.

At the end we compared the translation produced by our system with the translation made by the human translator. We briefly considered two ways to evaluate the quality of the translation.
First, we can use something like edit distance for words (instead of characters) to count how many insertion, deletions or substitutions are needed to change the machine translation to human translation. Otherwise, we can count how many words the human translator needs to change when copy editing the machine translation.
Machine translation systems start to be useful when you need to fix only one word out of six or more words in the translation.

The future

A little while ago Erik asked how the Wikimedia Foundation could support machine translation, which is now mostly in hands of big commercial entities (though the European Union is also building something) and needs an open source alternative.

We do not have lot of translation corpora like Google. We do have lots of text in different languages, but it is not the same content in all languages and it’s not aligned. Exceptions are translatewiki.net and other places where translations are done with the Translate extension. As a side note I think that translatewiki.net contains one of the most multilingual parallel translation corpora under a free license.

Given that we have lots of people in the Wikimedia movement who are multilingual and interested in languages, I think we should cooperate with an existing open source machine translation system (like Apertium) in a way that allows our users to enhance that system. Doing more translations increases the data stored in a translation memory making it more useful. In a similar fashion, doing more translations with machine translation system should make it better.

Apertium has already been in use on the Nynorsk Wikipedia. Bokmål and nynorsk are closely related languages: the kind of situation where Apertium excels.

One thing I have been thinking is that, now that the Wikimedia Language Engineering team is planning to build tools to help translate Wikipedia articles into other languages, we could closely integrate it with Apertium. We could provide an easy way for translators to add missing words and report unintelligible sentences.

I don’t expect most of our translators to actually write and correct rules, so someone should manage that on Apertium side. But at least word collection could be mostly automated; I bet someone has tried and will try to use Wiktionary data too.

As a first step, Wikimedia Foundation could set up their own Apertium instance as a web service for our needs (existing instances are too unstable). The translate extension, for example, can query such a web service to provide translation suggestions.

It rains like a saavi

About me, me and me

Tag Archives: Machine translation

IWSDS 2014 reflections

On course to machine translation

From translation memory to machine translation

The course

The future