Difference between revisions of "Internationalization/file format"

(XML)
Line 70: Line 70:
 
* We have to write a PO parser
 
* We have to write a PO parser
  
==XML==
+
==ts Files==
  
===Format of XML===
+
===Format of ts files===
  
XML is used for example by Trolltech for their .ts files. Here an example of .ts file, generated my lupdate (a tool made by trolltech that extracts translatable text from the C++ source code of the Qt application, see [[Internationalization/tool evaluation|here]] for further information):
+
The .ts file format is used Trolltech for the QT applications. They are XML conforming files. Here an example of a .ts file, generated my lupdate (a tool made by Trolltech that extracts translatable text from the C++ source code of the Qt application, see [[Internationalization/tool evaluation|here]] for further information):
  
 
  <!DOCTYPE TS><TS>
 
  <!DOCTYPE TS><TS>
Line 97: Line 97:
 
     </context>
 
     </context>
 
  </TS>
 
  </TS>
 +
 +
The .ts file is than converted to the .qm file format, a compact binary format that provides extremely fast lookups for translations, with a tool named lrelease.
 +
 +
The creation of .qm files can also be done with the GNU gettext tools: with "xgettext --qt" as string extractor for producing the .pot file. And then convert the translated file (.po) with the "msgfmt --qt" command for creating the .qm files.
  
 
===Positive aspects===
 
===Positive aspects===
  
 
* full support for unicode character encodings
 
* full support for unicode character encodings
* There is already a parser in the EiffelBase
 
 
* In trolltech's opinion it's a human readable text
 
* In trolltech's opinion it's a human readable text
  
Line 107: Line 110:
  
 
* Not everybody knows it
 
* Not everybody knows it
* Microsoft seeks XML-related patents that could restrict the use of XML (there should be a "Very negative aspect" section)
+
* QT's translation framework does not support plurals
:Thank you for that link, but the Microsoft Patents apply only on the Office dialect of XML, which it uses to save its own documents... they DON'T apply to the XML standard and general use (as we would create our format). [[User:Trosim|Trosim]]
+
* In my opinion it's not a human readable text (Fortunately not all human beings are Computer scientists)
+
  
 
==New Format==
 
==New Format==

Revision as of 23:39, 1 September 2006


Summary

Here we evaluate various file formats used for the translation of programs. For the moment we are considering:

  • XML
  • po
  • xliff (good description in this homepage)
  • create an own format

PO Files

Format of PO files

A PO file has an entry for each string that has to be translated. There are two kind of them, a "normal" one and one that involves plural forms.

Normal entry

Here is the general structure of a "normal" entry:

white-space
#  translator-comments
#. automatic-comments
#: reference...
#, flag...
msgid untranslated-string
msgstr translated-string

Where the translator-comments are created and maintained exclusively by the translator, this comments have some white space immediately following the #. The other comments are created by the program that created the PO file. After the special comment "#," there can be some flags, as fuzzy shows that the msgstr string might not be a correct translation, i.e. the translator is not sure of his work. The 'untranslated-string' is the untranslated string as it appears in the original program source. The translated-string is (as the name suggests) the translated string, if there is no translation it is an empty string.

Plural form entry

white-space
#  translator-comments
#. automatic-comments
#: reference...
#, flag...
msgid untranslated-string-singular
msgid_plural untranslated-string-plural
msgstr[0] translated-string-case-0
...
msgstr[N] translated-string-case-n

Supported character encodings

character encodings that can be used are limited to those supported by both GNU libc and GNU libiconv. These are: ASCII, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-13, ISO-8859-15, KOI8-R, KOI8-U, CP850, CP866, CP874, CP932, CP949, CP950, CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, GB2312, EUC-JP, EUC-KR, EUC-TW, BIG5, BIG5-HKSCS, GBK, GB18030, SHIFT_JIS, JOHAB, TIS-620, VISCII, UTF-8.

I think they are a lot...

Po Editors

  • poEdit
  • KBabel
  • Gtranslator
  • LocFactoryEditor (XLIFF and PO editor for Mac OSX)

Positive aspects

  • Powerful plural handling
  • Format created for translation purpose
  • Easy for humans to read
  • Used by gettext, kbabel, rosetta and many other programs
  • Support and elaboration tools for almost all plattforms

Negative aspects

  • We have to write a PO parser

ts Files

Format of ts files

The .ts file format is used Trolltech for the QT applications. They are XML conforming files. Here an example of a .ts file, generated my lupdate (a tool made by Trolltech that extracts translatable text from the C++ source code of the Qt application, see here for further information):

<!DOCTYPE TS><TS>
    <context>
        <name>MyExample</name>
        <message>
            <source>i18n=Internationalization</source>
            <translation type="unfinished"></translation>
        </message>
    </context>
</TS>

And after the translation (for example with Qt Linguist) it would look like this:

<!DOCTYPE TS><TS>
    <context>
        <name>MyExample</name>
        <message>
            <source>i18n=Internationalization</source>
            <translation>i20e=Internazionalizzazione</translation>
        </message>
    </context>
</TS>

The .ts file is than converted to the .qm file format, a compact binary format that provides extremely fast lookups for translations, with a tool named lrelease.

The creation of .qm files can also be done with the GNU gettext tools: with "xgettext --qt" as string extractor for producing the .pot file. And then convert the translated file (.po) with the "msgfmt --qt" command for creating the .qm files.

Positive aspects

  • full support for unicode character encodings
  • In trolltech's opinion it's a human readable text

Negative aspects

  • Not everybody knows it
  • QT's translation framework does not support plurals

New Format

Format of our Format

  • It doesn't exist yet, so we don't know how it looks like.
  • We could give our own extension to the file format for example .et (eiffel translation) or .babe (babylon eiffel) or .eint (eiffel i18n) ... (huge advantage)

Positive aspects

  • Free to do what we want
  • We don't have to care about licenses
  • Possibility to make it the best human readable format
  • So that we can say that we invented a new file format
  • Better integrated and consistent with eiffel syntax

Negative aspects

  • A new format? Why should we be different?
  • Do more work as needed (there are already good formats)
  • Long time until it becomes famous

Conclusions

Our decision is to use the po file format.

References