Posts Tagged ‘po’

Review of Gettext po(t) file format

Wednesday, November 20th, 2013

Gettext shows its age both in developer and translator friendliness. What’s wrong with the old known localisation file formats which Google and Mozilla among others are so keen to replace? I don’t have a full answer to that. Gettext is clearly quite inflexible compared to Mozilla’s file format (which is almost a programming language) and it does not support many of the new features in Google’s resource bundles.

My general recommendation is: use the file format best supported by your i18n framework. If you can choose, prefer key based formats. Only try new file formats if you need the new features, because tool support for them is not as good. There is also no clarity which of the new file formats will “win” the fight and become popular.

When making something new, it is good to look back. The motivation why I wrote this post initially was my annoyances writing a tool which supports this format, but the context I’m going to give is completely different. It has been waiting as draft to be published for a long time because it lacked context where it makes sense. Maybe this also helps people, who are wondering what localisation file format they should use.

Enough of the general thoughts. But let’s start this evaluation with the good things:
Can support plural for many languages. The plural syntax is flexible enough to cover at least most if not all of world’s languages.
Fuzzy translations. It has a standard way to mark outdated translations, which is a necessity for this format which does not identify strings.
Tool support. Gettext can be used in many programming languages and there are plenty of tools for translators.

And then the things I don’t like:
Strings have no identifiers. This is my biggest annoyance with Gettext. Strings are identified by their contents, which means that fixing a typo in the source invalidates all translations. It also makes it impossible to keep any track of history. This causes another problem: Identical strings are collapsed by default. This is especially annoying since in English words like Open (action) and Open (state) are the same but in other languages they are different. This effectively prevents proper translations, unless a message context is provided, but here lies another problem: Not all implementations support passing context. Last time I checked this was the case at least in Python.
And one nasty corner case for tool makers is that empty context is different from no context. If you don’t handle this right you will be producing invalid Gettext files.
I listed plural support above as a plus, but it is not without its problems. One string can only have plural forms depending on one variable. This forces the developers to use lego sentences when there is more than one number, or force the translators to make ungrammatical translations. Not to mention that, in Arabic and other languages where there can be five or even more forms, you need to repeat the whole string as many times with small changes. Lots of overhead updating and proofreading that, as opposed to an inline syntax where you only mark the differences. To be fair, with an inline syntax it might hard to see how each plural form looks in full, but there are solutions to that.
There is no standard way to present authorship information except for last translator. The file header is essentially free form text, making it hard to process and update that information programmatically. To be fair, this is the case for almost all i18n file formats I’ve seen.
The comments for individual strings are funky. There are different kinds of comments that start with “#,” “#|”, not documented anywhere as far as I know, and the order of different kinds of comments matters! Do it wrong and you’ll have a file that some tools refuse to use. Not to mention that developers can also leave comments for the translators, in addition to the context parameter (so there are two ways!): the translators might or might not see them depending on the tool they use and on what is propagated from the pot file to the po file. It is quite a hassle to keep these comments in sync and repeated in all the translation files.

I’m curious to hear whether you would like to see more of these evaluations and perhaps a comparison of the formats. If there isn’t much interest I likely won’t do more.

-- .