Difference between revisions of "Internationalization/mo parser"

m (Hash Function)
m (Hash Function)
Line 69: Line 69:
  
  
Each string has an associate hashing value Vm computed by a fixed function (see below).  To locate the string, open addressing with double hashing is used. The first index will be V % S, where S is the size of the hashing table. If no entry is found, iterating with a second, independent hashing function takes place. This second value will be:
+
Each string has an associate hashing value V computed by a fixed function (see below).  To locate the string, open addressing with double hashing is used. The first index will be V % S, where S is the size of the hashing table. If no entry is found, iterating with a second, independent hashing function takes place. This second value will be:
 
  1 + V % (S - 2).
 
  1 + V % (S - 2).
 
The approximate number of probes will be
 
The approximate number of probes will be

Revision as of 22:00, 10 May 2006

Summary

That's what this part of the project should achieve:

  • reading and parsing of MO files containing the strings and their translations
  • organize the object collection in an incremental way: don't load the whole file if it's not needed
  • give a simple interface to the localization class, so that the strings can be printed out without too much efforts

Reading and parsing

Parser structure

I'll propose the class structure of the parser, later.

MO file structure

As reported from the gettext manual.

          byte
               +------------------------------------------+
            0  | magic number = 0x950412de                |
               |                                          |
            4  | file format revision = 0                 |
               |                                          |
            8  | number of strings                        |  == N
               |                                          |
           12  | offset of table with original strings    |  == O
               |                                          |
           16  | offset of table with translation strings |  == T
               |                                          |
           20  | size of hashing table                    |  == S
               |                                          |
           24  | offset of hashing table                  |  == H
               |                                          |
               .                                          .
               .    (possibly more entries later)         .
               .                                          .
               |                                          |
            O  | length & offset 0th string  ----------------.
        O + 8  | length & offset 1st string  ------------------.
                ...                                    ...   | |
O + ((N-1)*8)  | length & offset (N-1)th string           |  | |
               |                                          |  | |
            T  | length & offset 0th translation  ---------------.
        T + 8  | length & offset 1st translation  -----------------.
                ...                                    ...   | | | |
T + ((N-1)*8)  | length & offset (N-1)th translation      |  | | | |
               |                                          |  | | | |
            H  | start hash table                         |  | | | |
                ...                                    ...   | | | |
    H + S * 4  | end hash table                           |  | | | |
               |                                          |  | | | |
               | NUL terminated 0th string  <----------------' | | |
               |                                          |    | | |
               | NUL terminated 1st string  <------------------' | |
               |                                          |      | |
                ...                                    ...       | |
               |                                          |      | |
               | NUL terminated 0th translation  <---------------' |
               |                                          |        |
               | NUL terminated 1st translation  <-----------------'
               |                                          |
                ...                                    ...
               |                                          |
               +------------------------------------------+

Hash Function

Here the description of the hash function used by gettext. I've taken it from it's source code gettext-0.14.5.


Each string has an associate hashing value V computed by a fixed function (see below). To locate the string, open addressing with double hashing is used. The first index will be V % S, where S is the size of the hashing table. If no entry is found, iterating with a second, independent hashing function takes place. This second value will be:

1 + V % (S - 2).

The approximate number of probes will be

for unsuccessful search: (1 - N / S) ^ -1
for successful search: - (N / S) ^ -1 * ln (1 - N / S)
where N is the number of keys.

If we now choose S to be the next prime bigger than 4 / 3 * N, we get the values

4 and 1.85 resp.

Because unsuccessful searches are unlikely this is a good value. Formulas: [Knuth, The Art of Computer Programming, Volume 3, Sorting and Searching, 1973, Addison Wesley]

The hash function used to convert stings to an integer, is the so called hashpjw function by P.J. Weinberger [see Aho/Sethi/Ullman, COMPILERS: Principles, Techniques and Tools, 1986, 1987 Bell Telephone Laboratories, Inc.]

The C code of the function is the following (Eiffel translation comming soon...):

#define HASHWORDBITS 32

hash_string (const char *str_param)
{
  unsigned long int hval, g;
  const char *str = str_param;

  /* Compute the hash value for the given string.  */
  hval = 0;
  while (*str != '\0')
    {
      hval <<= 4;
      hval += (unsigned char) *str++;
      g = hval & ((unsigned long int) 0xf << (HASHWORDBITS - 4));
      if (g != 0)
	{
	  hval ^= g >> (HASHWORDBITS - 8);
	  hval ^= g;
	}
    }
  return hval;
}
This function comes from the source code of gettext (./gettext-0.14.5/gettext-runtime/intl/hash-string.h)