Category Archives: PhD

IWCLUL 3/3: conversations and ideas

In IWCLUL talks, Miikka Silfverberg’s mention of collecting words from Wikipedia resonated with my earlier experiences working with Wikipedia dumps, especially the difficulty of it. I talked with some people at the conference and everyone seemed to agree that processing Wikipedia dumps takes a lot of time, which they could spend for something else. I am considering to publish plain text Wikipedia dumps and word frequency lists. While working in the DigiSami project, I familiarized myself with the utilities as well as the Wikimedia Tool Labs, so relatively little effort would be needed. The research value would be low, but it would be worth it, if enough people find these dumps and save time. A recent update is that Parsoid is planning to provide plain text format, so this is likely to become even easier in the future. Still, there might be some work to do collect pages into one archive and decide which parts of page will stay and which will be removed: for example converting an infobox to collection of isolated words is not useful for use cases such as WikiTalk, and it can also easily skew word frequencies.

I talked with Sjur Moshagen about keyboards for less resourced languages. Nowadays they have keyboards for Android and iOS, in addition to keyboards for computers (which already existed). They have some impressing additional features, like automatically adding missing accents to typed words. That would be too complicated to implement in jquery.ime, a project used by Wikimedia that implements keyboards in a browser. At least the aforementioned example uses finite state transducer. Running finite state tools in the browser does not yet feel realistic, even though some solutions exist*. The alternative of making requests to a remote service would slow down typing, except perhaps with some very clever implementation, which would probably be fragile at best. I have still to investigate whether there is some middle ground to bring the basic keyboard implementations to jquery.ime.

*Such as jsfst. One issue is that the implementations and the transducers themselves can take lot of space, which means we will run into same issues as when distributing large web fonts at Wikipedia.

I spoke with Tommi Pirinen and Antti Kanner about implementing a dictionary application programming interface (API) for the Bank of Finnish Terminology in Arts and Sciences (BFT). That would allow direct use of BFT resources in translation tools like translatewiki.net and Wikimedia’s Content Translation project. It would also help indirectly, by using a dump for extending word lists in the Apertium machine translation software.

I spoke briefly about language identification with Tommi Jauhiainen who had a poster presentation about the project “The Finno-Ugric languages and the internet”. I had implemented one language detector myself, using an existing library. Curiously enough, many other people met in Wikimedia circles have also made their own implementations. Mine had severe problems classifying languages which are very close to each other. Tommi gave me a link for another language detector, which I would like to test in the future to compare its performance with previous attempts. We also talked about something I call “continuous” language identification, where the detector would detect parts of running text which are in a different language. A normal language detector will be useful for my open source translation memory service project, called InTense. Continuous language identification could be used to post-process Wikipedia articles and tag foreign text so that correct fonts are applied, and possibly also in WikiTalk-like applications, to provide the text-to-speech (TTS) with a hint on how to pronounce those words.

Reasonator entry for Kimmo KoskenniemiReasonator is a software that generates visually pleasing summary pages in natural language and structured sections, based on structured data. More specifically, it uses Wikidata, which is the Wikimedia structured data project, developed by Wikimedia Germany. Reasonator works primarily for persons, though other types or subjects are being developed. Its localisation is limited, compared to the about three hundred languages of MediaWiki. Translating software which generates natural language sentences dynamically is very different from the usual software translation, which consists mostly of fixed strings with occasional placeholder which is replaced dynamically when showing text to an user.

It is not a new idea to use grammatical framework (GF), which is a language translation software based on interlingua, for Reasonator. In fact I had proposed this earlier in private discussions to Gerard Meijssen, but this conference renewed my interest in the idea, as I attended the GF workshop held by Aarne Ranta, Inari Listenmaa and Francis Tyers. GF seems to be a good fit here, as it allows limited context and limited vocabulary translation to many languages simultaneously; vice versa, Wikidata will contain information like gender of people, which can be fed to GF to get proper grammar in the generated translations. It would be very interesting to have a prototype of a Reasonator-like software using GF as the backend. The downside of GF is that (I assume) it is not easy for our regular translators to work with, so work is needed to make it easier and more accessible. The hypothesis is that with GF backend we would get a better language support (as in grammatically correct and flexible) with less effort on the long run. That would mean providing access to all the Wikidata topics even in smaller languages, without the effort of manually writing articles.

IWCLUL 2/3: morphology, OCR, a corpus vs. Wiktionary

More on IWCLUL: now on the sessions. The first session of the day was by the invited speaker Kimmo Koskenniemi. He is applying his two-level formalism in a new area, old literary Finnish (example of old literary Finnish). By using two-level rules for old written Finnish together with OMorFi, he is able to automatically convert old text to standard Finnish dictionary forms, which can be used, in the main example, as an input text to an search engine. He uses weighted transducers to rank the most likely equivalent modern day words. For example the contemporary spelling of wijsautta is viisautta, which is an inflected form of the noun viisaus (wisdom). He only takes the dictionary forms, because otherwise there are too many unrelated suggestions. This avoids the usual problems of too many unrelated morphological analyses: I had the same problen in my master’s thesis when I attempted using OMorFi to improve Wikimedia’s search system, which was still using Lucene at that time.

Jeremy Bradley gave presentation about an online Mari corpus. Their goal was to make a modern English-language textbook for Mari, for people who do not have access to native speakers. I was happy to see they used a free/copyleft Creative Commons license. I asked him whether they considered Wiktionary. He told me he had discussed with a person from Wiktionary who was against an import. I will be reaching my contacts and see whether an another attempt will succeed. The automatic transliteration between Latin, Cyrillic and IPA was nice, as I have been entertaining the idea of doing transliteration from Swedish to Finnish for WikiTalk, to make it able to function in Swedish as well by only using Finnish speech components. One point sticks with me: they had to add information about verb complements themselves, as they were not recorded in their sources. I can sympathize with them based on my own language learning experiences.

Stig-Arne Grönroos’ presentation on Low-resource active learning of North Sámi morphological segmentation did not contain any surprises for me after having been exposed to this topic previously. All efforts to support languages where we have to cope with limited resources are welcome and needed. Intermediate results are better than working with nothing while waiting for a full morphological analyser, for example. It is not completely obvious to me how this tool can be used in other language technology applications, so I will be happy to see an example.

Miikka Silfverberg presented about OCR, using OMorFi: can morphological analyzers improve the quality of optical character recognition? To summarize heavily, OCR performed worse when OMorFi was used, compared to just taking the top N most common words from Wikipedia. I understood this is not exactly the same problem of large number of readings generated by morphological analyser, rather something different but related.

IWCLUL event report 1/3: the story

IWCLUL is short for International Workshop on Computational Linguistics for Uralic Languages. I attended the conference, held on January 16th 2015, and presented a joint paper with Antti on Multilingual Semantic MediaWiki for Finno-Ugric dictionaries at the poster session.

I attentively observe the glimmering city lights of Tromsø as the plane lands in darkness to orientate myself to the maps I studied on my computer before the trip. At the airport I receive a kind welcome by Trond, in Finnish, together with a group of other people going to the conference. While he is driving us into our hotels, Trond elaborates the sights of the island we pass by. I and Antti, who is co-author of our paper about Multilingual Semantic MediaWiki, check in to the hotel and joke about the tendency of forgetting posters in different places.

Next morning I meet Stig-Arne at breakfast. We decide to go see the local cable car. We wander around the city center until we finally find a place where they sell bus tickets. We had asked a few people but they gave conflicting different directions. We take the bus and then Fjellheisen, the cable car, to the top. The sights are wonderful even in winter. I head back, do some walking in the center. I buy some postcards and use that as an excuse to get inside and warm up.

On Friday, on the conference day, almost by miracle, we end up in the conference place without too many issues, despite seeing no signs in the University of Tromsø campus. More information of the conference itself will be provided in the following parts. And the poster? We forgot to take it with us from the social event after the conference.

Seminar in Finland about big data in linguistics

Recently I attended a seminar on big data. The discussion included what big data is in linguistics, whether it has arrived yet and whether it is even needed in all places.

It was nice to meet people like Kimmo Koskennimi (he was advisor for Master’s thesis), Antti Kantter and others with whom I worked on the Bank of Finnish terminology in arts and sciences project. I’ve collected the points I found most interesting as a summary below.

Emeritus professor Kimmo Koskenniemi started the seminar by giving examples of really big data, one of them being the Google n-gram viewer. He also raised the issue that the copyright law in Finland does not allow us to do similar things, which can be a problem for local research. He suggested (not seriously, but still) that perhaps we should move linguistics research to some other country.

Next was presented a corpus project, Arkisyn, which is a collection of annotated everyday conversation. It was funded by Kone foundation Language programme. Topic of interest was how to achieve uniform tagging when multiple people are working on the collection. Annotation guidelines were produced and the resulting work was cross-checked. Participants were encouraged to document clearly their practices: having the data is no longer enough to be able to reproduce research findings. In fact, it is also necessary for researchers to be able to understand the data and to justify their conclusions.

Toni Suutari from Kotus went more in depth in the political landscape of open data politics in Finland. He then gave some sneak peeks on the (huge) efforts at Kotus to open some of their data collections (I wonder if volunteers could help with some). Licensing is a difficult topic: for example the word list of contemporary Finnish is licensed under three different licenses as they went through the experience of finding the most suitable one. Also geographical information (paikkatieto) is a big thing now, and they are working on opening geodata on Finnish dialects and other geodata collections. I’m sure the OpenStreetMap project and many others are eagerly waiting already.

Timo Honkela, new professor of Digital information, gave an engineering perspective of big data in linguistics and what it makes possible.

Jarmo Jantunen said that human intuition is still needed, so machines are not replacing linguistics. When deciding what to study, big data can give ideas. One might end studying a small part of the big data, there is too much data to go over manually. He also went over classification scheme that helps understanding the role of data in the research. Briefly: one can gather supporting examples from the data; one can base the research on analyzing the data; or one can let the data actually drive the research.

Kristiina Jokinen (my doctoral advisor) gave a practical view of issues in multi-modal data collection: privacy issues preventing open data, encoding formats, synchronization, lightning and audio quality. Topics of interest were how to understand the interaction between so many variables (eye glaze, head position, gestures, what is being said, many people interacting) and whether what the machine sees is what a real person would notice. Deb Roy’s research (recording his son while he learned to speak) was also mentioned.

In the end there was a panel discussion. It ranged over a wide variety of topics, but one that struck me was that while some data is becoming available by itself, some kind of data is not coming unless researchers create it, and it is difficult to find resources to create it.

Numbers on translatewiki.net sign-up process

Translatewiki.net features a good user experience for non-technical translators. A crucial or even critical component is signing up. An unrelated data collection for my PhD studies inspired me to get some data on the translatewiki.net user registration process. I will present the results below.

History

At translatewiki.net the process of becoming an approved translator has been, arguably, complicated in some periods.

In the early days of the wiki, permissions were not clearly separated: hundreds users were just given the full set of permissions to edit the MediaWiki namespace and translate that way.

Later, we required people to go through hoops of various kind after registering to be approved as translators. They had to create a user page with certain elements and post a request on a separate page and they would not get notifications when they were approved unless they tweaked their preferences.

At some point, we started using the LiquidThreads extension: now the users could get notifications when approved, at least in theory. That brought its own set of issues though: many people thought that the LiquidThreads search box on the requests page was the place where to write the title of their request. After entering a title, they ended up in a search results page, which was a dead end. This usability issue was so annoying and common that I completely removed the search field from LiquidThreads.
In early 2010 we implemented a special page wizard (FirstSteps) to guide users though the process. For years, this has allowed new users to get approved, and start translating, in few clicks and a handful hours after registering.

In late 2013 we enabled the new main page containing a sign-up form. Using that form, translators can create an account in a sandbox environment. Accounts created this way are normal user accounts except that they can only make example translations to get a feel of the system. Example translations give site administrators some hints on whether to approve or reject the request and approve the user as a translator.

Data collection

The data we have is not ideal.

  • For example, it is impossible to say what’s our conversion rate from users visiting the main page to actual translators.
  • A lot of noise is added by spam bots which create user accounts, even though we have a CAPTCHA.
  • When we go far back in the history, the data gets unreliable or completely missing.
    • We only have dates for account created after 2006 or so.
    • The log entry format for user permissions has changed multiple times, so the promotion times are missing or even incorrect for many entries until a few years back.

The data collection was made with two scripts I wrote for this purpose. The first script produces a tab separated file (tsv) containing all accounts which have been created. Each line has the following fields:

  1. username,
  2. time of account creation,
  3. number of edits,
  4. whether the user was approved as translator,
  5. time of approval and
  6. whether they used the regular sign-up process or the sandbox.

Some of the fields may be empty because the script was unable to find the data. User accounts for which we do not have account creation time are not listed. I chose not to try some methods which can be used to approximate the account creation time, because the data in that much past is too unreliable to be useful.

The first script takes a couple of minutes to run at translatewiki.net, so I split further processing to a separate script to avoid doing the slow data fetching many times. The second script calculates a few additional values like average and median time for approval and aggregates the data per month.

The data also includes translators who signed up through the sandbox, but got rejected: this information is important for approval rate calculation. For them, we do not know the exact registration date, but we use the time they were rejected instead. This has a small impact on monthly numbers, if a translator registers in one month and gets rejected in a later month. If the script is run again later, numbers for previous months might be somewhat different. For approval times there is no such issue.

Results

Account creations and approved translators at translatewiki.net

Image 1: Account creations and approved translators at translatewiki.net

Image 1 displays all account creations at translatewiki.net as described above, simply grouped by their month of account creation.

We can see that approval rate has gone down over time. I assume this is caused by spam bot accounts. We did not exclude them hence we cannot tell whether the approval rate has gone up or down for human users.

We can also see that the number of approved translators who later turn out to be prolific translators has stayed pretty much constant each month. A prolific translator is an approved translator who has made at least 100 edits. The edits can be from any point of time, the script is just looking at current edit count so the graph above doesn’t say anything about wiki activity at any point in time.

There is an inherent bias towards old users for two reasons. First, at the beginning translators were basically invited to a new tool from existing methods they used, so they were likely to continue to translate with the new tool. Second, new users have had less time to reach 100 edits. On the other hand, we can see that a dozen translators even in the past few months have already made over 100 edits.

I have collected some important events below, which I will then compare against the chart.

  • 2009: Translation rallies in August and December.
  • 2010-02: The special page to assist in filing translator requests was enabled.
  • 2010-04: We created a new (now old) main page.
  • 2010-10: Translation rally.
  • 2011: Translation rallies in April, September and December.
  • 2012: Translation rallies in August and December.
  • 2013-12: The sandbox sign-up process was enabled.

There is an increase in account creations and approved translators a few months after the assisting special page was enabled. The explanation of this is likely to be the new main page which had a big green button to access the special page. The September translation rally in 2011 seems to be very successful in requiting new translators, but also the other rallies are visible in the chart.

Image 2: How long it takes for account creation to be approved.

Image 2: How long it takes for account creation to be approved.

The second image shows how long it takes from the account creation for a site administrator to approve the request. Before sandbox, users had to submit a request to become translators on their own: the time for them to do so is out of control of the site administrators. With sandbox, that is much less the case, as users get either approved or rejected in a couple of days. Let me give an overview of how the sandbox works.

All users in the sandbox are listed on a special page together with the sandbox translations they have made. The administrators can then approve or reject the users. Administrators usually wait until the user has made a handful translations. Administrators can also send email reminders for the users to make more translations. If translators do not provide translations within some time, or the translations are very bad, they will get rejected. Otherwise they will be approved and can immediately start using the full translation interface.

We can see that the median approval time is just a couple of hours! The average time varies wildly though. I am not completely sure why, but I have two guesses.
First, some very old user accounts have reactivated after being dormant for months or years and have finally requested translator rights. Even one of these can skew the average significantly. On a quick inspection of the data, this seems plausible.
Second, originally we made all translators site administrators. At some point, we introduced the translator user group, and existing translators have gradually been getting this new permission as they returned to the site. The script only counts the time when they were added to the translator group.
Alternatively, the script may have a bug and return wrong times. However, that should not be the case for recent years because the log format has been stable for a while. In any case, the averages are so big as to be useless before the year 2012, so I completely left them out of the graph.

The sandbox has been in use only for a few months. For January and February 2014, the approval rate has been slightly over 50%. If a significant portion of rejected users are not spam bots, there might be a reason for concern.

Suggested action points

  1. Store the original account creation date and “sandbox edit count” for rejected users.
  2. Investigate the high rejection rate. We can ask the site administrator why about a half of the new users are rejected. Perhaps we can also have “mark as spam” action to get insight whether we get a lot of spam. Event logging could also be used, to get more insight on the points of the process where users get stuck.

Source material

Scripts are in Gerrit. Version ‘2’ of the scripts was used for this blog post. Processed data is in a LibreOffice spreadsheet. Original and updated data is available on request, please email me.

IWSDS 2014 reflections

Click for animation how it went through 20 versions.

I attended the International workshop series on spoken dialog systems aka IWSDS as part of my studies. It was my first scientific event where one had to submit papers: a great experience. More details below.

Poster. The paper we had submitted to the workshop was accepted as poster presentation. Having never done a poster before, especially of A0 size, I was a little bit afraid that I would encounter many technical and design issues. But no time for procrastination: little more than a week before the conference, I just started doing it. Luckily my University provided poster templates, so I didn’t need to worry about general layout. Between Adobe InDesign and PowerPoint templates, I chose the latter because I felt that I didn’t want to learn new software but just want to get things done. It took a few hours to come up with the initial draft. During the following days I went through 19 versions with my advisers until we declared it as ready. Working on weekends and late nights is nothing new to me, but I was surprised my advisers did the same. Whether this tells about dedication, short deadlines or the general issue of work spreading over to free time, I do not know.

Travel. I sent the poster to print, and got it the day before I flew in to San Francisco. My first leg was late, so I had to run in New York to catch my connection. I was advised to include the poster in check-in luggage, so of course it did not make it. That wasn’t an issue though, since I had decided to fly in one day early to recover from jet lag. The next day I went back to the airport to pick the poster and take my shuttle to the venue, leaving from SFO.

Location. The workshop was held in an inn in near Napa. I was told jokingly that the isolated place was chosen so that we can’t go out but only stay together to discuss about things. There was probably some truth in that. As a counterbalance for the packed conference program, the place and good food made it not overwhelming.

Did you notice the four hot air balloons?

Did you notice the four hot air balloons?

Presentations. The level of presentations varied somewhat. On one end of the scale there was a presentation which focused on math which I could not grasp. On the other end both Dan Bohus and Louis-Philippe Morency gave captivating keynotes which sparked my interest and were easy to follow even for a complete newbie to the topic as I was.

What struck me the most were Dan’s words about situated interaction. He gave an example where they had a robot in the office, and they were creating algorithms to guess whether the person walking past the robot is going to engage with the robot. They were able to use machine learning to guess many seconds in advance whether the user will engage. But when they moved the robot a bit so that users approached from a different direction, the previous model did not work anymore, and they had to retrain it with new data. The point of this and other given examples was to highlight that, in the context of human interaction, we must devise machine-learning algorithms capable of adapting to new contexts. I immediately draw a connection to the presentation by Filip Ginter (p. 28) at Kites Symposium last year, where data mining was used for distributional semantics, i.e. machine learning was used to learn that words kaunis, ihastuttava and hurmaava are in some way related. There is something very appealing in making machines learn language and interaction without explicitly giving them the rules. In the case of human interaction it is much more difficult because there is less data available and so many variables to take into account.

Presence. I can’t help but wonder why, apart from Microsoft and some car companies, no big companies were present. I’m sure this kind of research is also done in other companies. I asked a few people this question:

The other companies, is what they are doing (in the context of workshop topics) not novel, or are they just not telling about their research and not contributing back; and if so, do we care?

To summarize the answers and my conclusion: Apple Siri, for example, is not so novel, it is just a product done very well using a simple technology; on the other hand big companies have a vast amount of data not available for research. Essentially Google and other big companies have a monopoly on certain areas like machine translation and speech recognition. We do care about this.

Lab tour. After the workshop there was a lab tour in the Silicon Valley. We visited Honda Research Institute, Computer History Museum and Microsoft Research. Microsoft scores again for getting my attention by presenting a new version of Kinect. It was also nice to see a functioning adding machine at the Computer History Museum. And I can’t avoid mentioning having a nice spicy pasta lunch in the warm sunshine with a good company (in increasing order of significance) in Mountain View knowing that Google offices were very close.

First day at work

Officially I started January 1st, but apart from getting an account today was the first real thing at the university. Still feels great – the “oh my what did I sign up to” feeling has still time to come. ;)

After having the WMF daily standup, I have a usual breakfast and head to city center, where our research group of four had a meeting. To my surprise, the eduroam network worked immediately. I had configured it at home earlier based on a guide on the site of some university of Switzerland, if I remember correctly: my university didn’t provide good help for how to set it up with Fedora and KDE.

Institute of Behavioural Sciences, University of Helsinki

The building on the left is part of Institute of Behavioural Sciences. It is just next to the building (not visible) where I started my university studies in 2005. (Photo CC BY-NC-ND by Irmeli Aro.)

On my side, preparations for the IWSDS conference are now the highest priority. I have until Monday to prepare my first ever poster presentation. I found PowerPoint and InDesign templates from the university’s website (ugh proprietary tools). Then there are few days to get it printed before I fly on Thursday. After the travel I will make a website for the project to allow it to get some visibility and find out about the next steps as well as how to proceed with studies.

After this topic, I got to hear about other part of the research, collection of data in Sami languages. I connected them with Wikimedia Suomi who has expressed interest to work with Sami people.

After the meeting, we went hunting for so-called WBS codes which are needed in various places to target the expenses, for example for poster printing and travel plans. (In case someone knows where the abbreviation WBS comes from, there are at least two people in the world who are interested to know.) The people I met there were all very friendly and helpful.

On my way home I met an old friend from Päivölä&university (Mui Jouni!) in the metro. There was also a surprise ticket inspection – 25% inspection rate for my trips this year based on 4 observations. I guess I need more observations before this is statistically significant ;)

One task left for me when I got home was to do the mandatory travel plan. This needs to be done through university’s travel management software, which is not directly accessible. After trying without success to access it first through their web based VPN proxy, second with openvpn via NetworkManager via “some random KDE GUI for that” on my laptop and, third, even with a proprietary VPN application on my Android phone I gave up for today – it’s likely that the VPN connection itself is not the problem and the issue is somewhere else.

It’s still not known from where I will get a room (I’m employed in a different department from where I’m doing my PhD). Though I will likely work from home often as I am used to.

You can write a paper about that

“You can write a paper” is kind of a running joke in the language engineering team when the discussion sways so far from the original topic that it is no longer helping to get the work done. But sometimes sidelines turn out to be interesting and fruitful. When I was presented an opportunity to do a PhD related to wikis, languages and translation I could not pass it. And because of the joke, I can claim full innocence – they told me to! ;)

The results are in and…. I got accepted! Screams with joy and then quickly shies away hoping nobody noticed.

What does this mean?

Doctoral hat

The doctoral hat is the ultimate goal, right?

If you are a reader of this blog, the topics might get even more incomprehensible. Or the posts might be even more insightful and based on research instead of gut feelings. Hopefully, it doesn’t mean that I won’t have time to write more blog posts.

Practically, I will be starting at the beginning of January with the goal of writing a PhD dissertation and of graduating in about four years. The proposed topic for my dissertation is Supporting creation and interaction of open content with language technology, as part of the project “Finno-Ugric Digital Natives: Linguistic support for Finno-Ugric digital communities in generating online content”. As with my MA, I’ll do this at the University of Helsinki.

Initially I will be working three days a week on that and keep helping the language engineering team as well. We’ll see how it goes.

The first thing I will do is to participate in IWSDS (Workshop on Spoken Dialog Systems) held in January at Napa, California, USA. I will be presenting a paper about multilingual WikiTalk.

-- .