Internationalization/file format
Summary
Here we evaluate various file formats used for the translation of programs. For the moment we are considering:
- XML
- po
- create an own format
PO Files
Format of PO files
A PO file has an entry for each string that has to be translated. There are two kind of them, a "normal" one and one that involves plural forms.
Normal entry
Here is the general structure of a "normal" entry:
white-space # translator-comments #. automatic-comments #: reference... #, flag... msgid untranslated-string msgstr translated-string
Where the translator-comments are created and maintained exclusively by the translator, this comments have some white space immediately following the #. The other comments are created by the program that created the PO file. After the special comment "#," there can be some flags, as fuzzy shows that the msgstr string might not be a correct translation, i.e. the translator is not sure of his work. The 'untranslated-string' is the untranslated string as it appears in the original program source. The translated-string is (as the name suggests) the translated string, if there is no translation it is an empty string.
Plural form entry
white-space # translator-comments #. automatic-comments #: reference... #, flag... msgid untranslated-string-singular msgid_plural untranslated-string-plural msgstr[0] translated-string-case-0 ... msgstr[N] translated-string-case-n
Supported character encodings
character encodings that can be used are limited to those supported by both GNU libc and GNU libiconv. These are: ASCII, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-13, ISO-8859-15, KOI8-R, KOI8-U, CP850, CP866, CP874, CP932, CP949, CP950, CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, GB2312, EUC-JP, EUC-KR, EUC-TW, BIG5, BIG5-HKSCS, GBK, GB18030, SHIFT_JIS, JOHAB, TIS-620, VISCII, UTF-8.
I think they are a lot...
Positive aspects
- Powerful plural handling
- Format created for translation purpose
- Easy for humans to read
- Used by gettext, kbabel, rosetta and many other programs
Negative aspects
- We have to write a PO parser
XML
Format of XML
XML is used for example by Trolltech for their .ts files. Here an example of .ts file, generated my lupdate (a tool made by trolltech that extracts translatable text from the C++ source code of the Qt application, see here for further information):
<!DOCTYPE TS><TS> <context> <name>MyExample</name> <message> <source>i18n=Internationalization</source> <translation type="unfinished"></translation> </message> </context> </TS>
And after the translation (for example with Qt Linguist) it would look like this:
<!DOCTYPE TS><TS> <context> <name>MyExample</name> <message> <source>i18n=Internationalization</source> <translation>i20e=Internazionalizzazione</translation> </message> </context> </TS>
Positive aspects
- full support for unicode character encodings
- There is already a parser in the EiffelBase
- In trolltech's opinion it's a human readable text
Negative aspects
- Not everybody knows it
- Microsoft seeks XML-related patents that could restrict the use of XML (there should be a "Very negative aspect" section)
- In my opinion it's not a human readable text (Fortunately not all human beings are Computer scientists)
New Format
Format of our Format
It doesn't exist yet, so we don't know how it looks like.
Positive aspects
- Free to do what we want
Negative aspects
- A new format? Why should we be different?