Review of Gettext po(t) file format

Gettext shows its age both in developer and translator friendliness. What’s wrong with the old known localisation file formats which Google and Mozilla among others are so keen to replace? I don’t have a full answer to that. Gettext is clearly quite inflexible compared to Mozilla’s file format (which is almost a programming language) and it does not support many of the new features in Google’s resource bundles.

My general recommendation is: use the file format best supported by your i18n framework. If you can choose, prefer key based formats. Only try new file formats if you need the new features, because tool support for them is not as good. There is also no clarity which of the new file formats will “win” the fight and become popular.

When making something new, it is good to look back. The motivation why I wrote this post initially was my annoyances writing a tool which supports this format, but the context I’m going to give is completely different. It has been waiting as draft to be published for a long time because it lacked context where it makes sense. Maybe this also helps people, who are wondering what localisation file format they should use.

Enough of the general thoughts. But let’s start this evaluation with the good things:
Can support plural for many languages. The plural syntax is flexible enough to cover at least most if not all of world’s languages.
Fuzzy translations. It has a standard way to mark outdated translations, which is a necessity for this format which does not identify strings.
Tool support. Gettext can be used in many programming languages and there are plenty of tools for translators.

And then the things I don’t like:
Strings have no identifiers. This is my biggest annoyance with Gettext. Strings are identified by their contents, which means that fixing a typo in the source invalidates all translations. It also makes it impossible to keep any track of history. This causes another problem: Identical strings are collapsed by default. This is especially annoying since in English words like Open (action) and Open (state) are the same but in other languages they are different. This effectively prevents proper translations, unless a message context is provided, but here lies another problem: Not all implementations support passing context. Last time I checked this was the case at least in Python.
And one nasty corner case for tool makers is that empty context is different from no context. If you don’t handle this right you will be producing invalid Gettext files.
I listed plural support above as a plus, but it is not without its problems. One string can only have plural forms depending on one variable. This forces the developers to use lego sentences when there is more than one number, or force the translators to make ungrammatical translations. Not to mention that, in Arabic and other languages where there can be five or even more forms, you need to repeat the whole string as many times with small changes. Lots of overhead updating and proofreading that, as opposed to an inline syntax where you only mark the differences. To be fair, with an inline syntax it might hard to see how each plural form looks in full, but there are solutions to that.
There is no standard way to present authorship information except for last translator. The file header is essentially free form text, making it hard to process and update that information programmatically. To be fair, this is the case for almost all i18n file formats I’ve seen.
The comments for individual strings are funky. There are different kinds of comments that start with “#,” “#|”, not documented anywhere as far as I know, and the order of different kinds of comments matters! Do it wrong and you’ll have a file that some tools refuse to use. Not to mention that developers can also leave comments for the translators, in addition to the context parameter (so there are two ways!): the translators might or might not see them depending on the tool they use and on what is propagated from the pot file to the po file. It is quite a hassle to keep these comments in sync and repeated in all the translation files.

I’m curious to hear whether you would like to see more of these evaluations and perhaps a comparison of the formats. If there isn’t much interest I likely won’t do more.

-- .

16 thoughts on “Review of Gettext po(t) file format

  1. Alexander E. Patrakov

    [I have reached your post via Planet KDE]

    Please also consider the following showstopper: gettext is unusable if the programmers don’t know English well. This is not often the case in open-source development, but it is practically impossible to hire a developer in Russia for a closed-source project who writes correct English from the first attempt. This means LOTS of msgid bugs, which, as you already mentioned, are very cumbersome to fix.

    And writing English sentences in this context (closed-source Russian company that nevertheless wants to sell its services and products in USA) is a separate paid job for a Russian-to-English translator.

    So, if you can recommend some alternative translation framework that supports the use case where programmers don’t need to write correct English from the first try, and has a ready-made tutorial how to use it with Django, please share your recommendation here. From my viewpoint, even a “postprocess all python sources and templates with this tool after each change” would be an acceptable step. Proprietary tools are OK for me if they are not Windows-only.

  2. Abdurrahman AVCI

    It was an interesting read for me. In fact, this was one of the very few posts I have read completely today. So please, continue on doing evaluations and comparisons like this.

  3. Lockal

    As far as I know, Gettext does not support multiple plural forms. For example, in Russian we have 1-2-5 forms and one-many forms which we use in different cases. You may think that Gettext covers many languages, but actually these plural rules are bad even for common languages, such as Russian.

  4. Franklin

    I’m a translator of zh_TW. What bothered me was actually the habits and behaviors of developers, rather than the po format itself. For example, “Identical strings are collapsed by default” is not bad from some aspect, but it required developers to actively put comments so that they can be distinguished. (And you can see not all developers do that, even with good commenting scheme.) Another problem may be the developing tools. I sometimes would see a lot (really a lot!) of HTML tags in the msgid. And some entries were marked as fuzzy just because one HTML tag was changed.

  5. Vincent

    Hi,

    @Lockal
    This is an interesting remark (i dont know russian). What framework fits the “russian needs” ?

    Cheers

  6. Pau Garcia i Quiles

    Have you read about Gettext Generalized?

    http://nedohodnik.net/gettextbis/

    It’s essentially what KDE uses. It’s a extension of Gettext and it solves most of the problems you describe.

    @Alexander Patrakov: I don’t understand the problem with Russian. You can use Russian as the basis language (the .pot) and English as the second language (have a en.po).

  7. Heiko

    Thanks for this interesting posting. I add a comment to all items and use additionally msgctxt for identification of strings.

    #: btndelete.caption
    msgctxt “btnDelete.Caption”
    msgid “&Delete”
    msgstr “&Löschen”

    But of course this is not a good solution. I’m looking forward your follow-up posts.

  8. Alexander E. Patrakov

    @Pau Garcia i Quiles: it is not possible to use Russian as the basis language, because of the ngettext function. For the target language, it can adapt to virtually any scheme for plural forms (including Russian). However, for the basis language, it has the rule valid for Germanic family of languages only hard-coded in its signature. The invalid rule is: “there is only one plural form, to be used with n > 1”. The signature is:

    char * ngettext (const char * msgid, const char * msgid_plural, unsigned long int n);

  9. Chusslove Illich

    It cannot really be said that strings in PO have no identifiers — that is purely a convention. There is nothing stopping a given project from agreeing upon using identifier-like strings in code for msgid, and then providing English PO file too. If one does it this way, one has solved the problem of fixing typos, and the problem of collapsed strings. But might it be that some other problems would be created? It is a matter of perspective.

    The implementations that do not support passing contexts are old and need to be updated. But even when they are not, they don’t present a real problem. The given example of Python’s implementation can be fixed locally (per project) by a three-line function, and it is in practice.

    That empty context is different from no context can be considered this or that way, but the only really important thing is that it is well defined as such. Because the situation of deciding on equivalence between no-string and empty-string anyway arises many times when writing any non-trivial tool, and which solution is better depends on many things.

    Having general multiple-number plural strings implemented in a “fixed fashion”, would horribly complicate everything: coding, programming language support, translation tool support. Single-number plurals are the practical limit for fixed-fashion support. Multiple-number plurals are in practice achieved by joining single-plural strings together, with ample comments/contexts for translators. To have sane support for multiple-number plurals, one has to go for runtime string interpretation. Pau has given the link to a proposal for such a system (I’m the author), and this is what KDE’s translation system can do, since some years.

    The file header is a free-form text, but with a well established convention. Since the convention is not supported by syntax, it is true that translation tools and manual editing mess things up. Something indeed could be done here, and I would go for # … comments, like there are for messages. As for such comments on messages, their meaning and ordering, they are defined by the Gettext manual: http://www.gnu.org/software/gettext/manual/html_node/PO-Files.html#PO-Files . (The header comment convention is also presented somewhere in the manual.) In general, the Gettext manual is the reference for the PO format (among all other parts of the Gettext system), but for a more gradual translator-view exposition I suggest my text at http://pology.nedohodnik.net/doc/user/en_US/ch-poformat.html .

    One can also use non-English-plural language as the base language. In the given example of Russian, one can define a three-string plural function for use in the code, and instruct xgettext to extract as msgid and msgid_plural only, say, the first and the third string.

  10. Niklas Laxström Post author

    I think your points just reinforce my opinion that one needs to go for the extra mile to work around the deficiencies in the Gettext format. For example using identifiers in msgid is feasible, but makes them incompatible or cumbersome with some translation tools. If the format was naturally id-based, that problem would not exist. I agree that id-based formats also have their own set of problems.

    We can score the issues of each format for example by following criteria:

    1. prevents proper translation
    2. makes translators’ life harder
    3. makes developers’ life harder

    This is order is from most important to least important issues according to my opinion. I am curious to know if we have an agreement at this level.

    I strongly disagree with your statement “having general multiple-number plural strings […] would horribly complicate everything”. Not because it is false, but because we are programmers and our job is to solve complex problems so that the users (in this case translators) can do their job and do it well. Lego messages are not what translators want.

    I calculated some stats for MediaWiki, which has about 27k messages. 14% of the messages varying in plural do so on more than one number (153 out of 1109).

    I also read about Gettext Generalized, linked above. It is an interesting page, but that’s not the same thing as Gettext, so I’m leaving it out of this evaluation. I did spot a nice thing there: named variables. In MediaWiki we are just numbering the variables based on position. I do agree that naming variables would to some extent make them self explanatory, reducing the need for manual explanation.

  11. Chusslove Illich

    I entirely agree with your criteria, both in their statement and priority. On all three accounts, I consider Gettext to be better than any other translation system. And here I do mean the baseline Gettext, not including the proposed generalization.

    With respect to ID-based vs. text-based identifiers, I think that the latter is overally advantageous. With an ID-based system, to perform more sophisticated translation handling (in the sense of improving upon the stated three criteria), it is necessary to dynamically expand it to text-based, i.e. to link source to translation. On the other hand, the advantages of ID-based system, such as clean typo fixes or non-collapsing of short strings, are superficial. (However, from what I read and heard, I acknowledge the weighting may be different for closed-source software, due to inefficiencies and inflexibilities in the development process.) I’d like to hear about the advantages of an ID-based system other than those already mentioned.

    As for my claim of multi-plurals with fixed syntax being to much, it is sufficient to consider the combinatorial problem on the translation side. For a two-plural forms language, a two-number text will yield 4 distinct fixed-syntax strings. But for a four-plural forms language (such as mine), it will result in 16 distinct strings. A three-number text produces 8 and 64 strings respectively. The problem is not merely the number of strings, but tracking which string corresponds to which declination set. This becomes nigh impossible; lego text is by far the saner solution. The only way to correct is to move from fixed-syntax to in-text interpolation. Which would be equivalent to what I proposed (and KDE extensions already provide), only with sharply limited functionality.

  12. Nemo

    Chusslove Illich, I don’t understand what you mean that the need to “link source to translation” makes good translation harder: can you provide an example? Maybe it’s a matter of poor interfaces: with MediaWiki’s Translate, search and replace of any translation text can be easily done on wiki.

    One could list more advantages of ID-based formats as you ask, but 1) I don’t see any disadvantage in the first place, 2) Gettext’s disadvantages above – and in particular “identical strings are collapsed” – are enough of a deal breaker. I also don’t see how improving the source text could be considered a “superficial” advantage: in MediaWiki we care a lot about the quality of our localisation and a simple log comparison shows that we had 2906 changes to core messages in the last few years, of which only 1419 added or removed a message; so the majority of changes is to improve some message and this usecase is important.

    As for the multiplication of messages for plurals, saying that lego (i.e. poor translation) is the only viable way is a frank admission of a failure of Gettext. An inline syntax is what the blog post suggests and you can see an example in MediaWiki’s plural syntax.

  13. Chusslove Illich

    I didn’t want to say that the need to link source to translation makes translation harder, but that to do almost any translation processing (including, well, translation itself) it is necessary to first link source to translation. With a text-key translation format this link is present by definition, while with an ID-key translation format this link has to be established first by almost every translation-related tool (including, well, the translation editor itself). That is my first problem with ID-key formats: unless the full text creation and translation process is tightly integrated (such as on a wiki, but unlike with random software), performing this linking properly is hard. Any inconsistencies in the linking lead to decreasing translation quality, i.e. violating Niklas’ stated criterion No. 1.

    My second problem with ID-key format is that I simply don’t trust the source-text creator (programmer), to decide for me as translator, when to change the ID so that translators can see the updated string. What is a typo fix? Does a given fix really not influence the translation into my language? When I’m faced with body of text in an ID-key format, the first thing I do is establish conversion round-trip (trying hard to get it right) to Gettext PO, thereby intentionally canceling this dubious advantage of ID-key formats. Again, criterion No. 1 is the king. When it really is a “typo fix”, i.e. small source change not requiring translation change, mighty Gettext-based translation tools show me this in a hartbeat. The efort I expend to unfuzzy such strings is negligible compared to the total update effort. So the criterion No. 2 (“don’t make translator’s life harder”) is not violated in quantitative sense.

    I also heard programmers don’t like ID-key formats, too much bookkeeping required on their part. In the rarer instances when I’m the programmer, I wholy concur with this stance. With the text-key format, when I program something that even remotely looks like it may get a “release” eventually, I start wrapping strings with no-op _() calls long before any translation system is actually in place. In other words, ID-key formats very much violate the criterion No. 3 (“don’t make programmers life harder”). Add on top that the first programmer I don’t trust to decide when a new ID is needed, so that translators will be nudged, is myself.

    About collapsing strings, it should be kept in mind that while Gettext collapses by default, it also preserves information that a string is collapsed. Should the programmer wish so, it is entirely trivial to make a script that will look at an extracted PO template and report any collapsed strings, so that they may be uncollapsed by adding context. This could be one of any number of scripts that maintainers will use to verify the aspects of their software. Why, then, no project I know of does this? The reason in free software is simple: very low-barrier cooperation. For example, when I’m updating a KDE PO file and see a collapsed string that should be separated for my language, I can (and do) at that moment reach into code, add disambiguating context, and commit. Other people just shoot a short message to the mailing list, and someone will quickly do the right thing in the code.

    To summarize, compared to text-key formats, ID-key formats are more adverse to criterion No. 1, only illusory improve on criterion No. 2, and much more violate the criterion No. 3. Where there is preference for ID-key formats, I ascribe it to some combination of lacking translation tools and inefficient programmer-translator cooperation.

    Back to plurals, right, in between it slipped my mind that the post suggested inline syntax for plurals. But I too concurred that that is the only viable way (“in-text interpolation”). I only further suggested that introducing such syntax only for plurals, would be a waste of opportunity. Because once everything technically necessary to allow it would be in place, one could as well let it be a general text transformation facility. The fact that this is still not available in baseline Gettext cannot be ascribed to simple lack of aptitude on part of Gettext developers. This is because the intention behind Gettext is to work for all software packages, regardless of the programming framework of implementation. In-text syntax would require a major shift in that sense, from wrapping framework-specific strings to forcing single string format across all frameworks. All solutions currently providing in-text syntax work only with the one framework they were developed for (such as the given MediaWiki example).

  14. Pingback: GNU i18n for high priority projects list « It rains like a saavi

Comments are closed.