Tag Archives: research

Seminar in Finland about big data in linguistics

Recently I attended a seminar on big data. The discussion included what big data is in linguistics, whether it has arrived yet and whether it is even needed in all places.

It was nice to meet people like Kimmo Koskennimi (he was advisor for Master’s thesis), Antti Kantter and others with whom I worked on the Bank of Finnish terminology in arts and sciences project. I’ve collected the points I found most interesting as a summary below.

Emeritus professor Kimmo Koskenniemi started the seminar by giving examples of really big data, one of them being the Google n-gram viewer. He also raised the issue that the copyright law in Finland does not allow us to do similar things, which can be a problem for local research. He suggested (not seriously, but still) that perhaps we should move linguistics research to some other country.

Next was presented a corpus project, Arkisyn, which is a collection of annotated everyday conversation. It was funded by Kone foundation Language programme. Topic of interest was how to achieve uniform tagging when multiple people are working on the collection. Annotation guidelines were produced and the resulting work was cross-checked. Participants were encouraged to document clearly their practices: having the data is no longer enough to be able to reproduce research findings. In fact, it is also necessary for researchers to be able to understand the data and to justify their conclusions.

Toni Suutari from Kotus went more in depth in the political landscape of open data politics in Finland. He then gave some sneak peeks on the (huge) efforts at Kotus to open some of their data collections (I wonder if volunteers could help with some). Licensing is a difficult topic: for example the word list of contemporary Finnish is licensed under three different licenses as they went through the experience of finding the most suitable one. Also geographical information (paikkatieto) is a big thing now, and they are working on opening geodata on Finnish dialects and other geodata collections. I’m sure the OpenStreetMap project and many others are eagerly waiting already.

Timo Honkela, new professor of Digital information, gave an engineering perspective of big data in linguistics and what it makes possible.

Jarmo Jantunen said that human intuition is still needed, so machines are not replacing linguistics. When deciding what to study, big data can give ideas. One might end studying a small part of the big data, there is too much data to go over manually. He also went over classification scheme that helps understanding the role of data in the research. Briefly: one can gather supporting examples from the data; one can base the research on analyzing the data; or one can let the data actually drive the research.

Kristiina Jokinen (my doctoral advisor) gave a practical view of issues in multi-modal data collection: privacy issues preventing open data, encoding formats, synchronization, lightning and audio quality. Topics of interest were how to understand the interaction between so many variables (eye glaze, head position, gestures, what is being said, many people interacting) and whether what the machine sees is what a real person would notice. Deb Roy’s research (recording his son while he learned to speak) was also mentioned.

In the end there was a panel discussion. It ranged over a wide variety of topics, but one that struck me was that while some data is becoming available by itself, some kind of data is not coming unless researchers create it, and it is difficult to find resources to create it.

Numbers on translatewiki.net sign-up process

Translatewiki.net features a good user experience for non-technical translators. A crucial or even critical component is signing up. An unrelated data collection for my PhD studies inspired me to get some data on the translatewiki.net user registration process. I will present the results below.

History

At translatewiki.net the process of becoming an approved translator has been, arguably, complicated in some periods.

In the early days of the wiki, permissions were not clearly separated: hundreds users were just given the full set of permissions to edit the MediaWiki namespace and translate that way.

Later, we required people to go through hoops of various kind after registering to be approved as translators. They had to create a user page with certain elements and post a request on a separate page and they would not get notifications when they were approved unless they tweaked their preferences.

At some point, we started using the LiquidThreads extension: now the users could get notifications when approved, at least in theory. That brought its own set of issues though: many people thought that the LiquidThreads search box on the requests page was the place where to write the title of their request. After entering a title, they ended up in a search results page, which was a dead end. This usability issue was so annoying and common that I completely removed the search field from LiquidThreads.
In early 2010 we implemented a special page wizard (FirstSteps) to guide users though the process. For years, this has allowed new users to get approved, and start translating, in few clicks and a handful hours after registering.

In late 2013 we enabled the new main page containing a sign-up form. Using that form, translators can create an account in a sandbox environment. Accounts created this way are normal user accounts except that they can only make example translations to get a feel of the system. Example translations give site administrators some hints on whether to approve or reject the request and approve the user as a translator.

Data collection

The data we have is not ideal.

For example, it is impossible to say what’s our conversion rate from users visiting the main page to actual translators.
A lot of noise is added by spam bots which create user accounts, even though we have a CAPTCHA.
When we go far back in the history, the data gets unreliable or completely missing.
- We only have dates for account created after 2006 or so.
- The log entry format for user permissions has changed multiple times, so the promotion times are missing or even incorrect for many entries until a few years back.

The data collection was made with two scripts I wrote for this purpose. The first script produces a tab separated file (tsv) containing all accounts which have been created. Each line has the following fields:

username,
time of account creation,
number of edits,
whether the user was approved as translator,
time of approval and
whether they used the regular sign-up process or the sandbox.

Some of the fields may be empty because the script was unable to find the data. User accounts for which we do not have account creation time are not listed. I chose not to try some methods which can be used to approximate the account creation time, because the data in that much past is too unreliable to be useful.

The first script takes a couple of minutes to run at translatewiki.net, so I split further processing to a separate script to avoid doing the slow data fetching many times. The second script calculates a few additional values like average and median time for approval and aggregates the data per month.

The data also includes translators who signed up through the sandbox, but got rejected: this information is important for approval rate calculation. For them, we do not know the exact registration date, but we use the time they were rejected instead. This has a small impact on monthly numbers, if a translator registers in one month and gets rejected in a later month. If the script is run again later, numbers for previous months might be somewhat different. For approval times there is no such issue.

Results

Image 1: Account creations and approved translators at translatewiki.net

Image 1 displays all account creations at translatewiki.net as described above, simply grouped by their month of account creation.

We can see that approval rate has gone down over time. I assume this is caused by spam bot accounts. We did not exclude them hence we cannot tell whether the approval rate has gone up or down for human users.

We can also see that the number of approved translators who later turn out to be prolific translators has stayed pretty much constant each month. A prolific translator is an approved translator who has made at least 100 edits. The edits can be from any point of time, the script is just looking at current edit count so the graph above doesn’t say anything about wiki activity at any point in time.

There is an inherent bias towards old users for two reasons. First, at the beginning translators were basically invited to a new tool from existing methods they used, so they were likely to continue to translate with the new tool. Second, new users have had less time to reach 100 edits. On the other hand, we can see that a dozen translators even in the past few months have already made over 100 edits.

I have collected some important events below, which I will then compare against the chart.

2009: Translation rallies in August and December.
2010-02: The special page to assist in filing translator requests was enabled.
2010-04: We created a new (now old) main page.
2010-10: Translation rally.
2011: Translation rallies in April, September and December.
2012: Translation rallies in August and December.
2013-12: The sandbox sign-up process was enabled.

There is an increase in account creations and approved translators a few months after the assisting special page was enabled. The explanation of this is likely to be the new main page which had a big green button to access the special page. The September translation rally in 2011 seems to be very successful in requiting new translators, but also the other rallies are visible in the chart.

Image 2: How long it takes for account creation to be approved.

The second image shows how long it takes from the account creation for a site administrator to approve the request. Before sandbox, users had to submit a request to become translators on their own: the time for them to do so is out of control of the site administrators. With sandbox, that is much less the case, as users get either approved or rejected in a couple of days. Let me give an overview of how the sandbox works.

All users in the sandbox are listed on a special page together with the sandbox translations they have made. The administrators can then approve or reject the users. Administrators usually wait until the user has made a handful translations. Administrators can also send email reminders for the users to make more translations. If translators do not provide translations within some time, or the translations are very bad, they will get rejected. Otherwise they will be approved and can immediately start using the full translation interface.

We can see that the median approval time is just a couple of hours! The average time varies wildly though. I am not completely sure why, but I have two guesses.
First, some very old user accounts have reactivated after being dormant for months or years and have finally requested translator rights. Even one of these can skew the average significantly. On a quick inspection of the data, this seems plausible.
Second, originally we made all translators site administrators. At some point, we introduced the translator user group, and existing translators have gradually been getting this new permission as they returned to the site. The script only counts the time when they were added to the translator group.
Alternatively, the script may have a bug and return wrong times. However, that should not be the case for recent years because the log format has been stable for a while. In any case, the averages are so big as to be useless before the year 2012, so I completely left them out of the graph.

The sandbox has been in use only for a few months. For January and February 2014, the approval rate has been slightly over 50%. If a significant portion of rejected users are not spam bots, there might be a reason for concern.

Suggested action points

Store the original account creation date and “sandbox edit count” for rejected users.
Investigate the high rejection rate. We can ask the site administrator why about a half of the new users are rejected. Perhaps we can also have “mark as spam” action to get insight whether we get a lot of spam. Event logging could also be used, to get more insight on the points of the process where users get stuck.

Source material

Scripts are in Gerrit. Version ‘2’ of the scripts was used for this blog post. Processed data is in a LibreOffice spreadsheet. Original and updated data is available on request, please email me.

IWSDS 2014 reflections

Click for animation how it went through 20 versions.

I attended the International workshop series on spoken dialog systems aka IWSDS as part of my studies. It was my first scientific event where one had to submit papers: a great experience. More details below.

Poster. The paper we had submitted to the workshop was accepted as poster presentation. Having never done a poster before, especially of A0 size, I was a little bit afraid that I would encounter many technical and design issues. But no time for procrastination: little more than a week before the conference, I just started doing it. Luckily my University provided poster templates, so I didn’t need to worry about general layout. Between Adobe InDesign and PowerPoint templates, I chose the latter because I felt that I didn’t want to learn new software but just want to get things done. It took a few hours to come up with the initial draft. During the following days I went through 19 versions with my advisers until we declared it as ready. Working on weekends and late nights is nothing new to me, but I was surprised my advisers did the same. Whether this tells about dedication, short deadlines or the general issue of work spreading over to free time, I do not know.

Travel. I sent the poster to print, and got it the day before I flew in to San Francisco. My first leg was late, so I had to run in New York to catch my connection. I was advised to include the poster in check-in luggage, so of course it did not make it. That wasn’t an issue though, since I had decided to fly in one day early to recover from jet lag. The next day I went back to the airport to pick the poster and take my shuttle to the venue, leaving from SFO.

Location. The workshop was held in an inn in near Napa. I was told jokingly that the isolated place was chosen so that we can’t go out but only stay together to discuss about things. There was probably some truth in that. As a counterbalance for the packed conference program, the place and good food made it not overwhelming.

Did you notice the four hot air balloons?

Presentations. The level of presentations varied somewhat. On one end of the scale there was a presentation which focused on math which I could not grasp. On the other end both Dan Bohus and Louis-Philippe Morency gave captivating keynotes which sparked my interest and were easy to follow even for a complete newbie to the topic as I was.

What struck me the most were Dan’s words about situated interaction. He gave an example where they had a robot in the office, and they were creating algorithms to guess whether the person walking past the robot is going to engage with the robot. They were able to use machine learning to guess many seconds in advance whether the user will engage. But when they moved the robot a bit so that users approached from a different direction, the previous model did not work anymore, and they had to retrain it with new data. The point of this and other given examples was to highlight that, in the context of human interaction, we must devise machine-learning algorithms capable of adapting to new contexts. I immediately draw a connection to the presentation by Filip Ginter (p. 28) at Kites Symposium last year, where data mining was used for distributional semantics, i.e. machine learning was used to learn that words kaunis, ihastuttava and hurmaava are in some way related. There is something very appealing in making machines learn language and interaction without explicitly giving them the rules. In the case of human interaction it is much more difficult because there is less data available and so many variables to take into account.

Presence. I can’t help but wonder why, apart from Microsoft and some car companies, no big companies were present. I’m sure this kind of research is also done in other companies. I asked a few people this question:

The other companies, is what they are doing (in the context of workshop topics) not novel, or are they just not telling about their research and not contributing back; and if so, do we care?

To summarize the answers and my conclusion: Apple Siri, for example, is not so novel, it is just a product done very well using a simple technology; on the other hand big companies have a vast amount of data not available for research. Essentially Google and other big companies have a monopoly on certain areas like machine translation and speech recognition. We do care about this.

Lab tour. After the workshop there was a lab tour in the Silicon Valley. We visited Honda Research Institute, Computer History Museum and Microsoft Research. Microsoft scores again for getting my attention by presenting a new version of Kinect. It was also nice to see a functioning adding machine at the Computer History Museum. And I can’t avoid mentioning having a nice spicy pasta lunch in the warm sunshine with a good company (in increasing order of significance) in Mountain View knowing that Google offices were very close.

First day at work

Officially I started January 1st, but apart from getting an account today was the first real thing at the university. Still feels great – the “oh my what did I sign up to” feeling has still time to come. ;)

After having the WMF daily standup, I have a usual breakfast and head to city center, where our research group of four had a meeting. To my surprise, the eduroam network worked immediately. I had configured it at home earlier based on a guide on the site of some university of Switzerland, if I remember correctly: my university didn’t provide good help for how to set it up with Fedora and KDE.

Institute of Behavioural Sciences, University of Helsinki

The building on the left is part of Institute of Behavioural Sciences. It is just next to the building (not visible) where I started my university studies in 2005. (Photo CC BY-NC-ND by Irmeli Aro.)

On my side, preparations for the IWSDS conference are now the highest priority. I have until Monday to prepare my first ever poster presentation. I found PowerPoint and InDesign templates from the university’s website (ugh proprietary tools). Then there are few days to get it printed before I fly on Thursday. After the travel I will make a website for the project to allow it to get some visibility and find out about the next steps as well as how to proceed with studies.

After this topic, I got to hear about other part of the research, collection of data in Sami languages. I connected them with Wikimedia Suomi who has expressed interest to work with Sami people.

After the meeting, we went hunting for so-called WBS codes which are needed in various places to target the expenses, for example for poster printing and travel plans. (In case someone knows where the abbreviation WBS comes from, there are at least two people in the world who are interested to know.) The people I met there were all very friendly and helpful.

On my way home I met an old friend from Päivölä&university (Mui Jouni!) in the metro. There was also a surprise ticket inspection – 25% inspection rate for my trips this year based on 4 observations. I guess I need more observations before this is statistically significant ;)

One task left for me when I got home was to do the mandatory travel plan. This needs to be done through university’s travel management software, which is not directly accessible. After trying without success to access it first through their web based VPN proxy, second with openvpn via NetworkManager via “some random KDE GUI for that” on my laptop and, third, even with a proprietary VPN application on my Android phone I gave up for today – it’s likely that the VPN connection itself is not the problem and the issue is somewhere else.

It’s still not known from where I will get a room (I’m employed in a different department from where I’m doing my PhD). Though I will likely work from home often as I am used to.

You can write a paper about that

“You can write a paper” is kind of a running joke in the language engineering team when the discussion sways so far from the original topic that it is no longer helping to get the work done. But sometimes sidelines turn out to be interesting and fruitful. When I was presented an opportunity to do a PhD related to wikis, languages and translation I could not pass it. And because of the joke, I can claim full innocence – they told me to! ;)

The results are in and…. I got accepted! Screams with joy and then quickly shies away hoping nobody noticed.

What does this mean?

The doctoral hat is the ultimate goal, right?

If you are a reader of this blog, the topics might get even more incomprehensible. Or the posts might be even more insightful and based on research instead of gut feelings. Hopefully, it doesn’t mean that I won’t have time to write more blog posts.

Practically, I will be starting at the beginning of January with the goal of writing a PhD dissertation and of graduating in about four years. The proposed topic for my dissertation is Supporting creation and interaction of open content with language technology, as part of the project “Finno-Ugric Digital Natives: Linguistic support for Finno-Ugric digital communities in generating online content”. As with my MA, I’ll do this at the University of Helsinki.

Initially I will be working three days a week on that and keep helping the language engineering team as well. We’ll see how it goes.

The first thing I will do is to participate in IWSDS (Workshop on Spoken Dialog Systems) held in January at Napa, California, USA. I will be presenting a paper about multilingual WikiTalk.

-- Niklas Laxström.

It rains like a saavi

About me, me and me