Difference between revisions of "Internationalization/file format"
(→ts Files: -> Negative aspects) |
m (→Normal entry) |
||
Line 22: | Line 22: | ||
white-space | white-space | ||
# translator-comments | # translator-comments | ||
− | #. | + | #. extracted-comments |
− | #: | + | #: references... |
#, flag... | #, flag... | ||
msgid untranslated-string | msgid untranslated-string | ||
Line 29: | Line 29: | ||
Where the ''translator-comments'' are created and maintained exclusively by the translator, this comments have some white space immediately following the #. The other comments are created by the program that created the PO file. | Where the ''translator-comments'' are created and maintained exclusively by the translator, this comments have some white space immediately following the #. The other comments are created by the program that created the PO file. | ||
+ | ''References'' are space separated lists of locations (sourcefile:linenumber) specifying where the translation unit is found in a source file. | ||
After the special comment "#," there can be some ''flags'', as ''fuzzy'' shows that the msgstr string might not be a correct translation, i.e. the translator is not sure of his work. | After the special comment "#," there can be some ''flags'', as ''fuzzy'' shows that the msgstr string might not be a correct translation, i.e. the translator is not sure of his work. | ||
The 'untranslated-string' is the untranslated string as it appears in the original program source. The ''translated-string'' is (as the name suggests) the translated string, if there is no translation it is an empty string. | The 'untranslated-string' is the untranslated string as it appears in the original program source. The ''translated-string'' is (as the name suggests) the translated string, if there is no translation it is an empty string. |
Revision as of 06:54, 2 September 2006
Summary
Here we evaluate various file formats used for the translation of programs. For the moment we are considering:
- XML
- po
- xliff (good description in this homepage)
- create an own format
PO Files
Format of PO files
A PO file has an entry for each string that has to be translated. There are two kind of them, a "normal" one and one that involves plural forms.
Normal entry
Here is the general structure of a "normal" entry:
white-space # translator-comments #. extracted-comments #: references... #, flag... msgid untranslated-string msgstr translated-string
Where the translator-comments are created and maintained exclusively by the translator, this comments have some white space immediately following the #. The other comments are created by the program that created the PO file. References are space separated lists of locations (sourcefile:linenumber) specifying where the translation unit is found in a source file. After the special comment "#," there can be some flags, as fuzzy shows that the msgstr string might not be a correct translation, i.e. the translator is not sure of his work. The 'untranslated-string' is the untranslated string as it appears in the original program source. The translated-string is (as the name suggests) the translated string, if there is no translation it is an empty string.
Plural form entry
white-space # translator-comments #. automatic-comments #: reference... #, flag... msgid untranslated-string-singular msgid_plural untranslated-string-plural msgstr[0] translated-string-case-0 ... msgstr[N] translated-string-case-n
Supported character encodings
character encodings that can be used are limited to those supported by both GNU libc and GNU libiconv. These are: ASCII, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-13, ISO-8859-15, KOI8-R, KOI8-U, CP850, CP866, CP874, CP932, CP949, CP950, CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, GB2312, EUC-JP, EUC-KR, EUC-TW, BIG5, BIG5-HKSCS, GBK, GB18030, SHIFT_JIS, JOHAB, TIS-620, VISCII, UTF-8.
I think they are a lot...
Po Editors
- poEdit
- KBabel
- Gtranslator
- LocFactoryEditor (XLIFF and PO editor for Mac OSX)
Positive aspects
- Powerful plural handling
- Format created for translation purpose
- Easy for humans to read
- Used by gettext, kbabel, rosetta and many other programs
- Support and elaboration tools for almost all plattforms
Negative aspects
- We have to write a PO parser
ts Files
Format of ts files
The .ts file format is used Trolltech for the QT applications. They are XML conforming files. Here an example of a .ts file, generated my lupdate (a tool made by Trolltech that extracts translatable text from the C++ source code of the Qt application, see here for further information):
<!DOCTYPE TS><TS> <context> <name>MyExample</name> <message> <source>i18n=Internationalization</source> <translation type="unfinished"></translation> </message> </context> </TS>
And after the translation (for example with Qt Linguist) it would look like this:
<!DOCTYPE TS><TS> <context> <name>MyExample</name> <message> <source>i18n=Internationalization</source> <translation>i20e=Internazionalizzazione</translation> </message> </context> </TS>
The .ts file is than converted to the .qm file format, a compact binary format that provides extremely fast lookups for translations, with a tool named lrelease.
The creation of .qm files can also be done with the GNU gettext tools: with "xgettext --qt" as string extractor for producing the .pot file. And then convert the translated file (.po) with the "msgfmt --qt" command for creating the .qm files.
Positive aspects
- full support for unicode character encodings
- In trolltech's opinion it's a human readable text
Negative aspects
- QT's translation framework does not support plurals
- Qt message catalog format supports Unicode only in the translated strings, not in the untranslated strings
New Format
Format of our Format
- It doesn't exist yet, so we don't know how it looks like.
- We could give our own extension to the file format for example .et (eiffel translation) or .babe (babylon eiffel) or .eint (eiffel i18n) ... (huge advantage)
Positive aspects
- Free to do what we want
- We don't have to care about licenses
- Possibility to make it the best human readable format
- So that we can say that we invented a new file format
- Better integrated and consistent with eiffel syntax
Negative aspects
- A new format? Why should we be different?
- Do more work as needed (there are already good formats)
- Long time until it becomes famous
Conclusions
Our decision is to use the po file format.