Tag Archives: gettext

GNU i18n for high priority projects list

Today,  for a special occasion, I’m hosting this guest post by Federico Leva, dealing with some frequent topics of my blog.

A special GNU committee has invited everyone to comment on the selection of high priority free software projects (thanks M.L. for spreading the word).

In my limited understanding from looking every now and then in the past few years, the list so far has focused on “flagship” projects which are perceived to the biggest opportunities, or roadblocks to remove, for the goal of having people only use free/libre/open source software.

A “positive” item is one which makes people want to embrace GNU/Linux and free software in order to user it: «I want to use Octave because it’s more efficient». A “negative” item is an obstacle to free software adoption, which we want removed: «I can’t use GNU/Linux because I need AutoCAD for work».

We want to propose something different: a cross-fuctional project, which will benefit no specific piece of software, but rather all of them. We believe that the key for success of each and all the free software projects is going to be internationalization and localization. No i18n can help if the product is bad: here we assume that the idea of the product is sound and that we are able to scale its development, but we “just” need more users, more work.

What we believe

If free software is about giving control to the user, we believe it must also be about giving control of the language to its speakers. Proper localisation of a software can only be done by people with a particular interest and competence in it, ideally language natives who use the software.

It’s clear that there is little overlap between this group and developers; if nothing else, because most free software projects have at most a handful developers: all together, they can only know a fraction of the world’s languages. Translation is not, and can’t be, a subset of programming. A GNOME dataset showed a strong specialisation of documenters, coders and i18n contributors.

We believe that the only way to put them in control is to translate the wiki way: easily, the only requirement being language competency; with no or very low barriers on access; using translations immediately in the software; correcting after the fact thanks to their usage, not with pre-publishing gatekeeping.

Translation should not be a labyrinth

In most projects, the i18n process is hard to join and incomprehensible, if explained at all. GNOME has a nice description of their workflow, which however is a perfect example of what the wiki way is not.

A logical consequence of the wiki way is that not all translators will know the software like their pockets. Hence, to translate correctly, translators need message documentation straight in their translation interface (context, possible values of parameters, grammatical role of words, …): we consider this a non-negotiable feature of any system chosen. Various research agrees.

Ok, but why care?

I18n is a recipe for success

First. Developers and experienced users are often affected by the software localisation paradox, which means they only use software in English and will never care about l10n even if they are in the best position to help it. At this point, they are doomed; but the computer users of the future, e.g. students, are not. New users may start using free software simply because of not knowing English and/or because it’s gratis and used by their school; then they will keep using it.

With words we don’t like much, we could say: if we conquer some currently marginal markets, e.g. people under a certain age or several countries, we can then have a sufficient critical mass to expand to the main market of a product.

Research is very lacking on this aspect: there was quite some research on predicting viability of FLOSS projects, but almost nothing on their i18n/l10n and even less on predicting their success compared to proprietary competitors, let alone on the two combined. However, an analysis of SourceForge data from 2009 showed that there is a strong correlation between high SourceForge rank and having translators (table 5): for successful software, translation is the “most important” work after coding and project management, together with documentation and testing.

Second. Even though translation must not be like programming, translation is a way to introduce more people in the contributor base of each piece of software. Eventually, if they become more involved, translators will get in touch with the developers and/or the code, and potentially contribute there as well. In addition to this practical advantage, there’s also a political one: having one or two orders of magnitude more contributors of free software, worldwide, gives our ideas and actions a much stronger base.

Practically speaking, every package should be i18n-ready from the beginning (the investment pays back immediately) and its “Tools”/”Help” menu, or similarly visible interface element, should include a link to a website where everyone can join its translation. If the user’s locale is not available, the software should actively encourage joining translation.

Arjona Reina et al. 2013, based on the observation of 41 free software projects and 22 translation tools, actually claim that recruiting, informing and rewarding the translators is the most important factor for success of l10n, or even the only really important one.

Exton, Wasala et al. also suggest to receive in situ translations in a “crowdsourcing” or “micro-crowdsourcing” limbo, which we find superseded by a wiki. In fact, they end up requiring a “reviewing mechanism such as observed in the Wikipedia community” anyway, in addition to a voting system. Better keep it simple and use a wiki in the first place.

Third. Extensive language support can be a clear demonstration of the power of free software. Unicode CLDR is an effort we share with companies like Microsoft or Apple, yet no proprietary software in the world can support 350 languages like MediaWiki. We should be able to say this of free software in general, and have the motives to use free software include i18n/l10n.

Research agrees that free software is more favourable for multilingualism because compared to proprietary software translation is more efficient, autonomous and web-based (Flórez & Alcina, 2011; citing Mas 2003, Bowker et al. 2008).

The obstacle here is linguistic colonialism, namely the self-disrespect billions of humans have for their own language. Language rights are often neglected and «some languages dominate» the web (UNO report A/HRC/22/49, §84); but many don’t even try to use their own language even where they could. The solution can’t be exclusively technical.

Fourth. Quality. Proprietary software we see in the wild has terrible translations (for example Google, Facebook, Twitter). They usually use very complex i18n systems or they give up on quality and use vote-based statistical approximation of quality; but the results are generally bad. A striking example is Android, which is “open source” but whose translation is closed as in all Google software, with terrible results.

How to reach quality? There can’t be an authoritative source for what’s the best translation of every single software string: the wiki way is the only way to reach the best quality; by gradual approximation, collaboratively. Free software can be more efficient and have a great advantage here.

Indeed, quality of available free software tools for translation is not a weakness compared to proprietary tools, according to the same Flórez & Alcina, 2011: «Although many agencies and clients require translators to use specific proprietary tools, free programmes make it possible to achieve similar results».

We are not there yet

Many have the tendency to think they have “solved” i18n. The internet is full of companies selling i18n/10n services as if they had found the panacea. The reality is, most software is not localised at all, or is localised in very few languages, or has terrible translations. Explaining the reasons is not the purpose of this post; we have discussed or will discuss the details elsewhere. Some perspectives:

A 2000 survey confirms that education about i18n is most needed: «There is a curious “localisation paradox”: while customising content for multiple linguistic and cultural market conditions is a valuable business strategy, localisation is not as widely practised as one would expect. One reason for this is lack of understanding of both the value and the procedures for localisation.»

Can we win this battle?

We believe it’s possible. What above can look too abstract, but it’s intentionally so. Figuring out the solution is not something we can do in this document, because making i18n our general strength is a difficult project: that’s why we argue it needs to be in the high priority projects list.

The initial phase will probably be one of research and understanding. As shown above, we have opinions everywhere, but too little scientific evidence on what really works: this must change. Where evidence is available, it should be known more than it currently is: a lot of education on i18n is needed. Sharing and producing knowledge also implies discussion, which helps the next step.

The second phase could come with a medium term concrete goal: for instance, it could be decided that within a couple years at least a certain percentage of GNU software projects should (also) offer a modern, web-based, computer-assisted translation tool with low barriers on access etc., compatible with the principles above. Requirements will be shaped by the first phase (including the need to accommodate existing workflows, of course).

This would probably require setting up a new translation platform (or giving new life to an existing one), because current “bigs” are either insufficiently maintained (Pootle and Launchpad) or proprietary. Hopefully, this platform would embed multiple perspectives and needs of projects way beyond GNU, and much more un-i18n’d free software would gravitate here as well.

A third (or fourth) phase would be about exploring the uncharted territory with which we share so little, like the formats, methods and CAT tools existing out there for translation of proprietary software and of things other than software. The whole translation world (millions of translators?) deserves free software. For this, a way broader alliance will be needed, probably with university courses and others, like the authors of Free/Open-Source Software for the Translation Classroom: A Catalogue of Available Tools and tuxtrans.

“What are you doing?”

Fair question. This proposal is not all talk. We are doing our best, with the tools we know. One of the challenges, as Wasala et al. say,  is having a shared translation memory to make free software translation more efficient: so, we are building one. InTense is our new showcase of free software l10n and uses existing translations to offer an open translation memory to everyone; we believe we can eventually include practically all free software in the world.

For now, we have added a few dozens GNU projects and others, with 55 thousands strings and about 400 thousands translations. See also the translation interface for some examples.

If translatewiki.net is asked to do its part, we are certainly available. MediaWiki has the potential to scale incredibly, after all: see Wikipedia. In a future, a wiki like InTense could be switched from read-only to read/write and become a über-translatewiki.net, translating thousands of projects.

But that’s not necessarily what we’re advocating for: what matter is the result, how much more well-localised software we get. In fact, MediaWiki gave birth to thousands of wikis; and its success is also in its principles being adopted by others, see e.g. the huge StackExchange family (whose Q&A are wikis and use a free license, though more individual-centred).

Maybe the solution will come with hundreds or thousands separate installs of one or a handful software platforms. Maybe the solution will not be to “translate the wiki way”, but a similar and different concept, which still puts the localisation in the hands of users, giving them real freedom.

What do you think? Tell us in the comments.

Review of Gettext po(t) file format

Gettext shows its age both in developer and translator friendliness. What’s wrong with the old known localisation file formats which Google and Mozilla among others are so keen to replace? I don’t have a full answer to that. Gettext is clearly quite inflexible compared to Mozilla’s file format (which is almost a programming language) and it does not support many of the new features in Google’s resource bundles.

My general recommendation is: use the file format best supported by your i18n framework. If you can choose, prefer key based formats. Only try new file formats if you need the new features, because tool support for them is not as good. There is also no clarity which of the new file formats will “win” the fight and become popular.

When making something new, it is good to look back. The motivation why I wrote this post initially was my annoyances writing a tool which supports this format, but the context I’m going to give is completely different. It has been waiting as draft to be published for a long time because it lacked context where it makes sense. Maybe this also helps people, who are wondering what localisation file format they should use.

Enough of the general thoughts. But let’s start this evaluation with the good things:
Can support plural for many languages. The plural syntax is flexible enough to cover at least most if not all of world’s languages.
Fuzzy translations. It has a standard way to mark outdated translations, which is a necessity for this format which does not identify strings.
Tool support. Gettext can be used in many programming languages and there are plenty of tools for translators.

And then the things I don’t like:
Strings have no identifiers. This is my biggest annoyance with Gettext. Strings are identified by their contents, which means that fixing a typo in the source invalidates all translations. It also makes it impossible to keep any track of history. This causes another problem: Identical strings are collapsed by default. This is especially annoying since in English words like Open (action) and Open (state) are the same but in other languages they are different. This effectively prevents proper translations, unless a message context is provided, but here lies another problem: Not all implementations support passing context. Last time I checked this was the case at least in Python.
And one nasty corner case for tool makers is that empty context is different from no context. If you don’t handle this right you will be producing invalid Gettext files.
I listed plural support above as a plus, but it is not without its problems. One string can only have plural forms depending on one variable. This forces the developers to use lego sentences when there is more than one number, or force the translators to make ungrammatical translations. Not to mention that, in Arabic and other languages where there can be five or even more forms, you need to repeat the whole string as many times with small changes. Lots of overhead updating and proofreading that, as opposed to an inline syntax where you only mark the differences. To be fair, with an inline syntax it might hard to see how each plural form looks in full, but there are solutions to that.
There is no standard way to present authorship information except for last translator. The file header is essentially free form text, making it hard to process and update that information programmatically. To be fair, this is the case for almost all i18n file formats I’ve seen.
The comments for individual strings are funky. There are different kinds of comments that start with “#,” “#|”, not documented anywhere as far as I know, and the order of different kinds of comments matters! Do it wrong and you’ll have a file that some tools refuse to use. Not to mention that developers can also leave comments for the translators, in addition to the context parameter (so there are two ways!): the translators might or might not see them depending on the tool they use and on what is propagated from the pot file to the po file. It is quite a hassle to keep these comments in sync and repeated in all the translation files.

I’m curious to hear whether you would like to see more of these evaluations and perhaps a comparison of the formats. If there isn’t much interest I likely won’t do more.

-- .