New translation memories near you soon

In the last sprint I developed a translation memory server in PHP almost from scratch. Well, it’s not really a server. It’s run inside MediaWiki during client requests. It closely follows the logic of tmserver from translatetoolkit, which uses Python and SQLite.

The logic of how it works is pretty simple: you store all definitions and translations in a database. Then you can query suggestions for a certain text. We use string length and fulltext search to filter the initial list of candidate messages down. After that we use a text similarity algorithm to rank the suggestions and do the final filtering. The logic is explained in more detail in the Translate extension help.

PHP provides a text matching function, but we (Santhosh) had to implement pure PHP fallback for strings longer than 255 bytes or strings containing anything else than ASCII. The pure PHP version is much slower, although that is offset a little because it’s more efficient when there are fewer characters in a string than bytes. But more importantly, it works correctly even when not handling English text. The faster implementation is used when possible. Before we did some optimizations to the matching process, it was the slowest part. After those optimizations the time is now bound by database access. The functions implement the Levenshtein edit distance algorithm.

End users won’t see much difference. Wanting a translation memory on Wikimedia wikis was the original reason for reimplementing translation memory in PHP, and in the coming sprints we are going to enable it on wikis where Translate is enabled (meta-wiki, mediawiki.org, incubator and wikimania2012 currently). It is just over 300 lines of code [1] including comments and in addition there are database table definitions [2].

Now, having explained what was done and why, I can reveal the cool stuff, if you are still reading. There will also be a MediaWiki API module that allows querying the translation memory. There is a simple switch in the configuration to choose whether the memory is public or private. In the future this will allow querying translation memories from other sites, too.

-- .

One thought on “New translation memories near you soon

  1. shaforostoff

    please see a Longest Common Substring based comparison algorithm in Lokalize, tailored for the comparison of strings.

    It compares sentences word-by-word, then it compares words that differ character-by-character. This way it works much faster. This is not as fast as other algorithms, but it is guaranteed to be of the best quality for the real world sentence comparison.

    http://websvn.kde.org/trunk/KDE/kdesdk/lokalize/src/common/diff.cpp?revision=1204514&view=markup

    also see SelectJob::doSelect method in
    http://websvn.kde.org/trunk/KDE/kdesdk/lokalize/src/tm/jobs.cpp?revision=1270890&view=markup
    for an advanced formula used to measure difference level of sctrings, which i built empiricaly based on my experinece, and of course some functional analysis knowledge ))) (look for a variable named score)

Comments are closed.