Internationalization/mo parser

Revision as of 06:36, 15 May 2006 by Etienner (Talk | contribs) (Hash Function: removed precondition)

Summary

That's what this part of the project should achieve:

  • reading and parsing of MO files containing the strings and their translations
  • organize the object collection in an incremental way: don't load the whole file if it's not needed
  • give a simple interface to the localization class, so that the strings can be printed out without too much efforts

What is a mo file

MO stands for Machine Object, as the name suggests, they are meant to be read by programs, and are binary in nature. The GNU program that creates them out of a PO file, is msgfmt.

Reading and parsing

Parser structure

I'll propose the class structure of the parser, later.

MO file structure

As reported from the gettext manual.

          byte
               +------------------------------------------+
            0  | magic number = 0x950412de                |
               |                                          |
            4  | file format revision = 0                 |
               |                                          |
            8  | number of strings                        |  == N
               |                                          |
           12  | offset of table with original strings    |  == O
               |                                          |
           16  | offset of table with translation strings |  == T
               |                                          |
           20  | size of hashing table                    |  == S
               |                                          |
           24  | offset of hashing table                  |  == H
               |                                          |
               .                                          .
               .    (possibly more entries later)         .
               .                                          .
               |                                          |
            O  | length & offset 0th string  ----------------.
        O + 8  | length & offset 1st string  ------------------.
                ...                                    ...   | |
O + ((N-1)*8)  | length & offset (N-1)th string           |  | |
               |                                          |  | |
            T  | length & offset 0th translation  ---------------.
        T + 8  | length & offset 1st translation  -----------------.
                ...                                    ...   | | | |
T + ((N-1)*8)  | length & offset (N-1)th translation      |  | | | |
               |                                          |  | | | |
            H  | start hash table                         |  | | | |
                ...                                    ...   | | | |
    H + S * 4  | end hash table                           |  | | | |
               |                                          |  | | | |
               | NUL terminated 0th string  <----------------' | | |
               |                                          |    | | |
               | NUL terminated 1st string  <------------------' | |
               |                                          |      | |
                ...                                    ...       | |
               |                                          |      | |
               | NUL terminated 0th translation  <---------------' |
               |                                          |        |
               | NUL terminated 1st translation  <-----------------'
               |                                          |
                ...                                    ...
               |                                          |
               +------------------------------------------+

Magic number

The magic number signals GNU MO files, it is stored in the byte order of the generating machine, so the magic number really is two numbers: 0x950412de and 0xde120495

Tables

In the tables, each string descriptor uses two 32 bits integers, the first for the string length, and the second for the offset of the string in the MO file, counting in bytes from the start of the file. The first table contains descriptors for the original strings, and is sorted in lexicographical order. The second table contains descriptors for the translated strings, and is parallel to the first table: to find the corresponding translation one has to access the array slot in the second array with the same index.

Hash table

The hash table is not contained in al Mo files, in this cases the size S is zero. It is sometimes better to not include the hash table because a precomputed hashing table takes disk space, and does not win that much speed. The hash table contains indices to the sorted array of strings in the MO file.

Hash Function

Here the description of the hash function used by gettext. I've taken it from it's source code gettext-0.14.5.


Each string has an associate hashing value V computed by a fixed function (see below). To locate the string, open addressing with double hashing is used. The first index will be V % S, where S is the size of the hashing table. If no entry is found, iterating with a second, independent hashing function takes place. This second value will be:

1 + V % (S - 2).

The approximate number of probes will be

for unsuccessful search: (1 - N / S) ^ -1
for successful search: - (N / S) ^ -1 * ln (1 - N / S)
where N is the number of keys.

If we now choose S to be the next prime bigger than 4 / 3 * N, we get the values

4 and 1.85 resp.

Because unsuccessful searches are unlikely this is a good value. Formulas: [Knuth, The Art of Computer Programming, Volume 3, Sorting and Searching, 1973, Addison Wesley]

The hash function used to convert stings to an integer, is the so called hashpjw function by P.J. Weinberger [see Aho/Sethi/Ullman, COMPILERS: Principles, Techniques and Tools, 1986, 1987 Bell Telephone Laboratories, Inc.]

The C code of the function comes from the source code of gettext (./gettext-0.14.5/gettext-runtime/intl/hash-string.h):

#define HASHWORDBITS 32
hash_string (const char *str_param) {
   unsigned long int hval, g;
   const char *str = str_param;
 
   /* Compute the hash value for the given string.  */
   hval = 0
   while (*str != '\0'
     {
       hval <<= 4;
       hval += (unsigned char) *str++;
       g = hval & ((unsigned long int) 0xf << (HASHWORDBITS - 4));
       if (g != 0
 	{
 	  hval ^= g >> (HASHWORDBITS - 8);
 	  hval ^= g;
 	}
     }
   return hval;
 }

Here is the translated function in Eiffel:

hash_string (a_string: STRING): INTEGER is
    -- Compute the hash value for the given string
require
    valid_string: a_string /= Void
local
    position: INTEGER
    l_result, g: NATURAL_32
 do
    from
        position := 1
    invariant
        position >= 1
        position <= a_string.count + 1
    variant
        a_string.count + 1 - position
    until
        position > a_string.count
    loop
        l_result := l_result |<< 4
        l_result := l_result + a_string.code(position)
        g := l_result & ({NATURAL_32} 0xf |<< 28)
        if g /= 0 then
            l_result := l_result.bit_xor(g |>> 24)
            l_result := l_result.bit_xor(g)
        end
        position := position + 1
    end
    Result := l_result.as_integer_32
end