Difference between revisions of "Internationalization/mo parser"

m (Hash Function)
m
 
(12 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
[[Category:Internationalization]]
 +
 
== Summary ==
 
== Summary ==
 
That's what this part of the project should achieve:
 
That's what this part of the project should achieve:
Line 4: Line 6:
 
* organize the object collection in an incremental way: don't load the whole file if it's not needed
 
* organize the object collection in an incremental way: don't load the whole file if it's not needed
 
* give a simple interface to the localization class, so that the strings can be printed out without too much efforts
 
* give a simple interface to the localization class, so that the strings can be printed out without too much efforts
 +
 +
==What is a mo file==
 +
 +
MO stands for '''M'''achine '''O'''bject, as the name suggests, they are meant to be read by programs, and are binary in nature. The GNU program that creates them out of a PO file, is [http://www.scit.wlv.ac.uk/cgi-bin/mansec?1+msgfmt ''msgfmt''].
  
 
== Reading and parsing ==
 
== Reading and parsing ==
Line 62: Line 68:
 
                 |                                          |
 
                 |                                          |
 
                 +------------------------------------------+
 
                 +------------------------------------------+
 +
====Magic number====
 +
 +
The magic number signals GNU MO files, it is stored in the byte order of the generating machine, so the magic number really is two numbers: 0x950412de and 0xde120495
 +
 +
====Tables====
 +
 +
In the tables, each string descriptor uses two 32 bits integers, the first for the string length, and the second for the offset of the string in the MO file, counting in bytes from the start of the file. The first table contains descriptors for the original strings, and is sorted in lexicographical order. The second table contains descriptors for the translated strings, and is parallel to the first table: to find the corresponding translation one has to access the array slot in the second array with the same index.
 +
 +
====Hash table====
 +
 +
The hash table is not contained in al Mo files, in this cases the size S is zero. It is sometimes better to not include the hash table because a precomputed hashing table takes disk space, and does not win that much speed. The hash table contains indices to the sorted array of strings in the MO file.
  
 
==Hash Function==
 
==Hash Function==
Line 69: Line 86:
  
  
Each string has an associate hashing value Vm computed by a fixed function (see below).  To locate the string, open addressing with double hashing is used. The first index will be V % S, where S is the size of the hashing table. If no entry is found, iterating with a second, independent hashing function takes place. This second value will be:
+
Each string has an associate hashing value V computed by a fixed function (see below).  To locate the string, open addressing with double hashing is used. The first index will be V % S, where S is the size of the hashing table. If no entry is found, iterating with a second, independent hashing function takes place. This second value will be:
 
  1 + V % (S - 2).
 
  1 + V % (S - 2).
 
The approximate number of probes will be
 
The approximate number of probes will be
Line 85: Line 102:
 
The hash function used to convert stings to an integer, is the so called ''hashpjw'' function by P.J. Weinberger [see Aho/Sethi/Ullman, COMPILERS: Principles, Techniques and Tools, 1986, 1987 Bell Telephone Laboratories, Inc.]
 
The hash function used to convert stings to an integer, is the so called ''hashpjw'' function by P.J. Weinberger [see Aho/Sethi/Ullman, COMPILERS: Principles, Techniques and Tools, 1986, 1987 Bell Telephone Laboratories, Inc.]
  
The C code of the function is the following (Eiffel translation comming soon...):
+
The C code of the function comes from the source code of gettext (./gettext-0.14.5/gettext-runtime/intl/hash-string.h):
  
<font color="Green">#define HASHWORDBITS 32</font>
+
<code>[c,N]
+
#define HASHWORDBITS 32
hash_string (<font color="Brown">const char</font> *str_param)
+
hash_string (const char *str_param) {
{
+
   unsigned long int hval, g;
   <font color="Brown">unsigned long int</font> hval, g;
+
   const char *str = str_param;
   <font color="Brown">const char</font> *str = str_param;
+
 
   
 
   
   <font color="Gray">/* Compute the hash value for the given string.  */</font>
+
   /* Compute the hash value for the given string.  */
   hval = <font color="Blue">0</font>;
+
   hval = 0
   while (*str != <font color="Pink">'\0'</font>)
+
   while (*str != '\0'
 
     {
 
     {
       hval <<= <font color="Blue">4</font>;
+
       hval <<= 4;
       hval += (<font color="Brown">unsigned char</font>) *str++;
+
       hval += (unsigned char) *str++;
       g = hval & ((<font color="Brown">unsigned long int</font>) <font color="Green">0xf</font> << (HASHWORDBITS - <font color="Blue">4</font>));
+
       g = hval & ((unsigned long int) 0xf << (HASHWORDBITS - 4));
       if (g != <font color="Blue">0</font>)
+
       if (g != 0
 
  {
 
  {
    hval ^= g >> (HASHWORDBITS - <font color="Blue">8</font>);
+
    hval ^= g >> (HASHWORDBITS - 8);
 
    hval ^= g;
 
    hval ^= g;
 
  }
 
  }
Line 109: Line 125:
 
   return hval;
 
   return hval;
 
  }
 
  }
 +
</code>
  
:This function comes from the source code of gettext (./gettext-0.14.5/gettext-runtime/intl/hash-string.h)
+
Here is the translated function in Eiffel:
 +
<code>[eiffel,N]
 +
hash_string (a_string: STRING): INTEGER is
 +
    -- Compute the hash value for the given string
 +
require
 +
    valid_string: a_string /= Void
 +
local
 +
    position: INTEGER
 +
    l_result, g: NATURAL_32
 +
do
 +
    from
 +
        position := 1
 +
    invariant
 +
        position >= 1
 +
        position <= a_string.count + 1
 +
    variant
 +
        a_string.count + 1 - position
 +
    until
 +
        position > a_string.count
 +
    loop
 +
        l_result := l_result |<< 4
 +
        l_result := l_result + a_string.code(position)
 +
        g := l_result & ({NATURAL_32} 0xf |<< 28)
 +
        if g /= 0 then
 +
            l_result := l_result.bit_xor(g |>> 24)
 +
            l_result := l_result.bit_xor(g)
 +
        end
 +
        position := position + 1
 +
    end
 +
    Result := l_result.as_integer_32
 +
end
 +
</code>

Latest revision as of 21:13, 2 June 2006


Summary

That's what this part of the project should achieve:

  • reading and parsing of MO files containing the strings and their translations
  • organize the object collection in an incremental way: don't load the whole file if it's not needed
  • give a simple interface to the localization class, so that the strings can be printed out without too much efforts

What is a mo file

MO stands for Machine Object, as the name suggests, they are meant to be read by programs, and are binary in nature. The GNU program that creates them out of a PO file, is msgfmt.

Reading and parsing

Parser structure

I'll propose the class structure of the parser, later.

MO file structure

As reported from the gettext manual.

          byte
               +------------------------------------------+
            0  | magic number = 0x950412de                |
               |                                          |
            4  | file format revision = 0                 |
               |                                          |
            8  | number of strings                        |  == N
               |                                          |
           12  | offset of table with original strings    |  == O
               |                                          |
           16  | offset of table with translation strings |  == T
               |                                          |
           20  | size of hashing table                    |  == S
               |                                          |
           24  | offset of hashing table                  |  == H
               |                                          |
               .                                          .
               .    (possibly more entries later)         .
               .                                          .
               |                                          |
            O  | length & offset 0th string  ----------------.
        O + 8  | length & offset 1st string  ------------------.
                ...                                    ...   | |
O + ((N-1)*8)  | length & offset (N-1)th string           |  | |
               |                                          |  | |
            T  | length & offset 0th translation  ---------------.
        T + 8  | length & offset 1st translation  -----------------.
                ...                                    ...   | | | |
T + ((N-1)*8)  | length & offset (N-1)th translation      |  | | | |
               |                                          |  | | | |
            H  | start hash table                         |  | | | |
                ...                                    ...   | | | |
    H + S * 4  | end hash table                           |  | | | |
               |                                          |  | | | |
               | NUL terminated 0th string  <----------------' | | |
               |                                          |    | | |
               | NUL terminated 1st string  <------------------' | |
               |                                          |      | |
                ...                                    ...       | |
               |                                          |      | |
               | NUL terminated 0th translation  <---------------' |
               |                                          |        |
               | NUL terminated 1st translation  <-----------------'
               |                                          |
                ...                                    ...
               |                                          |
               +------------------------------------------+

Magic number

The magic number signals GNU MO files, it is stored in the byte order of the generating machine, so the magic number really is two numbers: 0x950412de and 0xde120495

Tables

In the tables, each string descriptor uses two 32 bits integers, the first for the string length, and the second for the offset of the string in the MO file, counting in bytes from the start of the file. The first table contains descriptors for the original strings, and is sorted in lexicographical order. The second table contains descriptors for the translated strings, and is parallel to the first table: to find the corresponding translation one has to access the array slot in the second array with the same index.

Hash table

The hash table is not contained in al Mo files, in this cases the size S is zero. It is sometimes better to not include the hash table because a precomputed hashing table takes disk space, and does not win that much speed. The hash table contains indices to the sorted array of strings in the MO file.

Hash Function

Here the description of the hash function used by gettext. I've taken it from it's source code gettext-0.14.5.


Each string has an associate hashing value V computed by a fixed function (see below). To locate the string, open addressing with double hashing is used. The first index will be V % S, where S is the size of the hashing table. If no entry is found, iterating with a second, independent hashing function takes place. This second value will be:

1 + V % (S - 2).

The approximate number of probes will be

for unsuccessful search: (1 - N / S) ^ -1
for successful search: - (N / S) ^ -1 * ln (1 - N / S)
where N is the number of keys.

If we now choose S to be the next prime bigger than 4 / 3 * N, we get the values

4 and 1.85 resp.

Because unsuccessful searches are unlikely this is a good value. Formulas: [Knuth, The Art of Computer Programming, Volume 3, Sorting and Searching, 1973, Addison Wesley]

The hash function used to convert stings to an integer, is the so called hashpjw function by P.J. Weinberger [see Aho/Sethi/Ullman, COMPILERS: Principles, Techniques and Tools, 1986, 1987 Bell Telephone Laboratories, Inc.]

The C code of the function comes from the source code of gettext (./gettext-0.14.5/gettext-runtime/intl/hash-string.h):

#define HASHWORDBITS 32
hash_string (const char *str_param) {
   unsigned long int hval, g;
   const char *str = str_param;
 
   /* Compute the hash value for the given string.  */
   hval = 0
   while (*str != '\0'
     {
       hval <<= 4;
       hval += (unsigned char) *str++;
       g = hval & ((unsigned long int) 0xf << (HASHWORDBITS - 4));
       if (g != 0
 	{
 	  hval ^= g >> (HASHWORDBITS - 8);
 	  hval ^= g;
 	}
     }
   return hval;
 }

Here is the translated function in Eiffel:

hash_string (a_string: STRING): INTEGER is
    -- Compute the hash value for the given string
require
    valid_string: a_string /= Void
local
    position: INTEGER
    l_result, g: NATURAL_32
 do
    from
        position := 1
    invariant
        position >= 1
        position <= a_string.count + 1
    variant
        a_string.count + 1 - position
    until
        position > a_string.count
    loop
        l_result := l_result |<< 4
        l_result := l_result + a_string.code(position)
        g := l_result & ({NATURAL_32} 0xf |<< 28)
        if g /= 0 then
            l_result := l_result.bit_xor(g |>> 24)
            l_result := l_result.bit_xor(g)
        end
        position := position + 1
    end
    Result := l_result.as_integer_32
end