Difference between revisions of "Internationalization/file format"

m (Normal entry)
(Negative aspects)
Line 69: Line 69:
 
===Negative aspects===
 
===Negative aspects===
  
* We have to write a PO parser
+
* The PO file format does not provide a way of identifying the source and target language within a file. By GNU standards, GNU software is written in American English (en-US), and this is reflected in Gettext by only having support for Germanic plural forms in the source language. It is therefore recommended to set the source-language attribute to en-US by default.
  
 
==ts Files==
 
==ts Files==

Revision as of 06:56, 2 September 2006


Summary

Here we evaluate various file formats used for the translation of programs. For the moment we are considering:

  • XML
  • po
  • xliff (good description in this homepage)
  • create an own format

PO Files

Format of PO files

A PO file has an entry for each string that has to be translated. There are two kind of them, a "normal" one and one that involves plural forms.

Normal entry

Here is the general structure of a "normal" entry:

white-space
#  translator-comments
#. extracted-comments
#: references...
#, flag...
msgid untranslated-string
msgstr translated-string

Where the translator-comments are created and maintained exclusively by the translator, this comments have some white space immediately following the #. The other comments are created by the program that created the PO file. References are space separated lists of locations (sourcefile:linenumber) specifying where the translation unit is found in a source file. After the special comment "#," there can be some flags, as fuzzy shows that the msgstr string might not be a correct translation, i.e. the translator is not sure of his work. The 'untranslated-string' is the untranslated string as it appears in the original program source. The translated-string is (as the name suggests) the translated string, if there is no translation it is an empty string.

Plural form entry

white-space
#  translator-comments
#. automatic-comments
#: reference...
#, flag...
msgid untranslated-string-singular
msgid_plural untranslated-string-plural
msgstr[0] translated-string-case-0
...
msgstr[N] translated-string-case-n

Supported character encodings

character encodings that can be used are limited to those supported by both GNU libc and GNU libiconv. These are: ASCII, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-13, ISO-8859-15, KOI8-R, KOI8-U, CP850, CP866, CP874, CP932, CP949, CP950, CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, GB2312, EUC-JP, EUC-KR, EUC-TW, BIG5, BIG5-HKSCS, GBK, GB18030, SHIFT_JIS, JOHAB, TIS-620, VISCII, UTF-8.

I think they are a lot...

Po Editors

  • poEdit
  • KBabel
  • Gtranslator
  • LocFactoryEditor (XLIFF and PO editor for Mac OSX)

Positive aspects

  • Powerful plural handling
  • Format created for translation purpose
  • Easy for humans to read
  • Used by gettext, kbabel, rosetta and many other programs
  • Support and elaboration tools for almost all plattforms

Negative aspects

  • The PO file format does not provide a way of identifying the source and target language within a file. By GNU standards, GNU software is written in American English (en-US), and this is reflected in Gettext by only having support for Germanic plural forms in the source language. It is therefore recommended to set the source-language attribute to en-US by default.

ts Files

Format of ts files

The .ts file format is used Trolltech for the QT applications. They are XML conforming files. Here an example of a .ts file, generated my lupdate (a tool made by Trolltech that extracts translatable text from the C++ source code of the Qt application, see here for further information):

<!DOCTYPE TS><TS>
    <context>
        <name>MyExample</name>
        <message>
            <source>i18n=Internationalization</source>
            <translation type="unfinished"></translation>
        </message>
    </context>
</TS>

And after the translation (for example with Qt Linguist) it would look like this:

<!DOCTYPE TS><TS>
    <context>
        <name>MyExample</name>
        <message>
            <source>i18n=Internationalization</source>
            <translation>i20e=Internazionalizzazione</translation>
        </message>
    </context>
</TS>

The .ts file is than converted to the .qm file format, a compact binary format that provides extremely fast lookups for translations, with a tool named lrelease.

The creation of .qm files can also be done with the GNU gettext tools: with "xgettext --qt" as string extractor for producing the .pot file. And then convert the translated file (.po) with the "msgfmt --qt" command for creating the .qm files.

Positive aspects

  • full support for unicode character encodings
  • In trolltech's opinion it's a human readable text

Negative aspects

  • QT's translation framework does not support plurals
  • Qt message catalog format supports Unicode only in the translated strings, not in the untranslated strings

New Format

Format of our Format

  • It doesn't exist yet, so we don't know how it looks like.
  • We could give our own extension to the file format for example .et (eiffel translation) or .babe (babylon eiffel) or .eint (eiffel i18n) ... (huge advantage)

Positive aspects

  • Free to do what we want
  • We don't have to care about licenses
  • Possibility to make it the best human readable format
  • So that we can say that we invented a new file format
  • Better integrated and consistent with eiffel syntax

Negative aspects

  • A new format? Why should we be different?
  • Do more work as needed (there are already good formats)
  • Long time until it becomes famous

Conclusions

Our decision is to use the po file format.

References