Difference between revisions of "Internationalization/mo parser"

(hash function used for for .mo files)
(Hash Function)
Line 67: Line 67:
 
Here the description of the hash function used by gettext. I've taken it from it's source code [ftp://mirrors.kernel.org/gnu/gettext/gettext-0.14.5.tar.gz gettext-0.14.5].
 
Here the description of the hash function used by gettext. I've taken it from it's source code [ftp://mirrors.kernel.org/gnu/gettext/gettext-0.14.5.tar.gz gettext-0.14.5].
  
Each string has an associate hashing value V, computed by a fixed function.  To locate the string, open addressing with double hashing is used. The first index will be V % S, where S is the size of the hashing table. If no entry is found, iterating with a second, independent hashing function takes place. This second value will be:
+
 
 +
 
 +
Each string has an associate hashing value Vm computed by a fixed function (see below).  To locate the string, open addressing with double hashing is used. The first index will be V % S, where S is the size of the hashing table. If no entry is found, iterating with a second, independent hashing function takes place. This second value will be:
 
  1 + V % (S - 2).
 
  1 + V % (S - 2).
 
The approximate number of probes will be
 
The approximate number of probes will be
Line 80: Line 82:
 
Because unsuccessful searches are unlikely this is a good value.
 
Because unsuccessful searches are unlikely this is a good value.
 
Formulas: [Knuth, The Art of Computer Programming, Volume 3, Sorting and Searching, 1973, Addison Wesley]
 
Formulas: [Knuth, The Art of Computer Programming, Volume 3, Sorting and Searching, 1973, Addison Wesley]
 +
 +
The hash function used is the so called 'hashpjw' function by P.J. Weinberger [see Aho/Sethi/Ullman, COMPILERS: Principles, Techniques and Tools, 1986, 1987 Bell Telephone Laboratories, Inc.]
 +
 +
the C function is the following (Eiffel translation comming soon...):
 +
 +
hash_string (<font color="Brown">const char</font> *str_param)
 +
{
 +
  <font color="Brown">unsigned long int</font> hval, g;
 +
  <font color="Brown">const char</font> *str = str_param;
 +
 +
  <font color="Gray">/* Compute the hash value for the given string.  */</font>
 +
  hval = <font color="Blue">0</font>;
 +
  while (*str != <font color="Pink">'\0'</font>)
 +
    {
 +
      hval <<= <font color="Blue">4</font>;
 +
      hval += (<font color="Brown">unsigned char</font>) *str++;
 +
      g = hval & ((<font color="Brown">unsigned long int</font>) <font color="Green">0xf</font> << (HASHWORDBITS - <font color="Blue">4</font>));
 +
      if (g != <font color="Blue">0</font>)
 +
{
 +
  hval ^= g >> (HASHWORDBITS - <font color="Blue">8</font>);
 +
  hval ^= g;
 +
}
 +
    }
 +
  return hval;
 +
}

Revision as of 08:47, 9 May 2006

Summary

That's what this part of the project should achieve:

  • reading and parsing of MO files containing the strings and their translations
  • organize the object collection in an incremental way: don't load the whole file if it's not needed
  • give a simple interface to the localization class, so that the strings can be printed out without too much efforts

Reading and parsing

Parser structure

I'll propose the class structure of the parser, later.

MO file structure

As reported from the gettext manual.

          byte
               +------------------------------------------+
            0  | magic number = 0x950412de                |
               |                                          |
            4  | file format revision = 0                 |
               |                                          |
            8  | number of strings                        |  == N
               |                                          |
           12  | offset of table with original strings    |  == O
               |                                          |
           16  | offset of table with translation strings |  == T
               |                                          |
           20  | size of hashing table                    |  == S
               |                                          |
           24  | offset of hashing table                  |  == H
               |                                          |
               .                                          .
               .    (possibly more entries later)         .
               .                                          .
               |                                          |
            O  | length & offset 0th string  ----------------.
        O + 8  | length & offset 1st string  ------------------.
                ...                                    ...   | |
O + ((N-1)*8)  | length & offset (N-1)th string           |  | |
               |                                          |  | |
            T  | length & offset 0th translation  ---------------.
        T + 8  | length & offset 1st translation  -----------------.
                ...                                    ...   | | | |
T + ((N-1)*8)  | length & offset (N-1)th translation      |  | | | |
               |                                          |  | | | |
            H  | start hash table                         |  | | | |
                ...                                    ...   | | | |
    H + S * 4  | end hash table                           |  | | | |
               |                                          |  | | | |
               | NUL terminated 0th string  <----------------' | | |
               |                                          |    | | |
               | NUL terminated 1st string  <------------------' | |
               |                                          |      | |
                ...                                    ...       | |
               |                                          |      | |
               | NUL terminated 0th translation  <---------------' |
               |                                          |        |
               | NUL terminated 1st translation  <-----------------'
               |                                          |
                ...                                    ...
               |                                          |
               +------------------------------------------+

Hash Function

Here the description of the hash function used by gettext. I've taken it from it's source code gettext-0.14.5.


Each string has an associate hashing value Vm computed by a fixed function (see below). To locate the string, open addressing with double hashing is used. The first index will be V % S, where S is the size of the hashing table. If no entry is found, iterating with a second, independent hashing function takes place. This second value will be:

1 + V % (S - 2).

The approximate number of probes will be

for unsuccessful search: (1 - N / S) ^ -1
for successful search: - (N / S) ^ -1 * ln (1 - N / S)
where N is the number of keys.

If we now choose S to be the next prime bigger than 4 / 3 * N, we get the values

4 and 1.85 resp.

Because unsuccessful searches are unlikely this is a good value. Formulas: [Knuth, The Art of Computer Programming, Volume 3, Sorting and Searching, 1973, Addison Wesley]

The hash function used is the so called 'hashpjw' function by P.J. Weinberger [see Aho/Sethi/Ullman, COMPILERS: Principles, Techniques and Tools, 1986, 1987 Bell Telephone Laboratories, Inc.]

the C function is the following (Eiffel translation comming soon...):

hash_string (const char *str_param)
{
  unsigned long int hval, g;
  const char *str = str_param;

  /* Compute the hash value for the given string.  */
  hval = 0;
  while (*str != '\0')
    {
      hval <<= 4;
      hval += (unsigned char) *str++;
      g = hval & ((unsigned long int) 0xf << (HASHWORDBITS - 4));
      if (g != 0)
	{
	  hval ^= g >> (HASHWORDBITS - 8);
	  hval ^= g;
	}
    }
  return hval;
}