Difference between revisions of "Internationalization/mo parser"

m (small changes)
m
 
(7 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
[[Category:Internationalization]]
 +
 
== Summary ==
 
== Summary ==
 
That's what this part of the project should achieve:
 
That's what this part of the project should achieve:
Line 100: Line 102:
 
The hash function used to convert stings to an integer, is the so called ''hashpjw'' function by P.J. Weinberger [see Aho/Sethi/Ullman, COMPILERS: Principles, Techniques and Tools, 1986, 1987 Bell Telephone Laboratories, Inc.]
 
The hash function used to convert stings to an integer, is the so called ''hashpjw'' function by P.J. Weinberger [see Aho/Sethi/Ullman, COMPILERS: Principles, Techniques and Tools, 1986, 1987 Bell Telephone Laboratories, Inc.]
  
The C code of the function is the following:
+
The C code of the function comes from the source code of gettext (./gettext-0.14.5/gettext-runtime/intl/hash-string.h):
  
<font color="Green">#define HASHWORDBITS 32</font>
+
<code>[c,N]
 +
#define HASHWORDBITS 32
 +
hash_string (const char *str_param) {
 +
  unsigned long int hval, g;
 +
  const char *str = str_param;
 
   
 
   
hash_string (<font color="Brown">const char</font> *str_param)
+
   /* Compute the hash value for the given string.  */
{
+
   hval = 0
   <font color="Brown">unsigned long int</font> hval, g;
+
   while (*str != '\0'
  <font color="Brown">const char</font> *str = str_param;
+
+
  <font color="Gray">/* Compute the hash value for the given string.  */</font>
+
   hval = <font color="Blue">0</font>;
+
   while (*str != <font color="Pink">'\0'</font>)
+
 
     {
 
     {
       hval <<= <font color="Blue">4</font>;
+
       hval <<= 4;
       hval += (<font color="Brown">unsigned char</font>) *str++;
+
       hval += (unsigned char) *str++;
       g = hval & ((<font color="Brown">unsigned long int</font>) <font color="Green">0xf</font> << (HASHWORDBITS - <font color="Blue">4</font>));
+
       g = hval & ((unsigned long int) 0xf << (HASHWORDBITS - 4));
       if (g != <font color="Blue">0</font>)
+
       if (g != 0
 
  {
 
  {
    hval ^= g >> (HASHWORDBITS - <font color="Blue">8</font>);
+
    hval ^= g >> (HASHWORDBITS - 8);
 
    hval ^= g;
 
    hval ^= g;
 
  }
 
  }
Line 124: Line 125:
 
   return hval;
 
   return hval;
 
  }
 
  }
:This function comes from the source code of gettext (./gettext-0.14.5/gettext-runtime/intl/hash-string.h)
+
</code>
 
+
Here is my translated function, unfortunately it does not work as it should, I think it is because in the C function the variable ''g'' is an unsigned integer and hence if it's MSB is 1, the 'right shift', in the if statement, is different from the one computed in Eiffel (which I think, has no unsigned integer). So, if anybody could help me... :-)
+
 
+
<!--
+
IF YOU WANT TO COPY AND PASTE THE EIFFEL CODE, I RECOMEND YOU TO DO IT FROM THE ARTICLE :-)
+
-->
+
  
hash_string (a_sting : <font color="Blue">STRING</font>) : <font color="Blue">INTEGER</font> <font color=#1D0489>is</font>
+
Here is the translated function in Eiffel:
    <font color="Brown">--Compute the hash value for the given string</font>
+
<code>[eiffel,N]
<font color=#1D0489>require</font>
+
hash_string (a_string: STRING): INTEGER is
    valid_string: a_sting <font color="Red">/=</font> <font color="Blue">Void</font>
+
    -- Compute the hash value for the given string
<font color=#1D0489>local</font>
+
require
    position : <font color="Blue">INTEGER</font>
+
    valid_string: a_string /= Void
    g : <font color="Blue">INTEGER</font>
+
local
  <font color=#1D0489>do</font>
+
    position: INTEGER
    <font color=#1D0489>from</font>
+
    l_result, g: NATURAL_32
        position <font color="Red">:=</font> <font color=#A709CA>1</font>
+
do
    <font color=#1D0489>until</font>
+
    from
        position <font color="Red">></font> a_sting.count
+
        position := 1
    <font color=#1D0489>loop</font>
+
    invariant
        <font color=#1D0489>Result</font> <font color="Red">:=</font> <font color=#1D0489>Result</font> |<< <font color=#A709CA>4</font>
+
        position >= 1
        <font color=#1D0489>Result</font> <font color="Red">:=</font> <font color=#1D0489>Result</font> <font color="Red">+</font> a_sting.item (position).code
+
        position <= a_string.count + 1
        g <font color="Red">:=</font> <font color=#1D0489>Result</font> & (<font color=#A709CA>0xf</font> |<< <font color=#A709CA>28</font>)
+
    variant
        <font color=#1D0489>if</font> g <font color="Red">/=</font> <font color=#A709CA>0</font> then
+
        a_string.count + 1 - position
            <font color="Blue">Result</font> <font color="Red">:=</font> <font color="Blue">Result</font>.bit_xor (g |>> <font color=#A709CA>24</font>)
+
    until
            <font color="Blue">Result</font> <font color="Red">:=</font> <font color="Blue">Result</font>.bit_xor (g)
+
        position > a_string.count
        <font color=#1D0489>end</font>
+
    loop
        position <font color="Red">:=</font> position+<font color=#A709CA>1</font>
+
        l_result := l_result |<< 4
    <font color=#1D0489>end</font>
+
        l_result := l_result + a_string.code(position)
<font color=#1D0489>end</font>
+
        g := l_result & ({NATURAL_32} 0xf |<< 28)
 +
        if g /= 0 then
 +
            l_result := l_result.bit_xor(g |>> 24)
 +
            l_result := l_result.bit_xor(g)
 +
        end
 +
        position := position + 1
 +
    end
 +
    Result := l_result.as_integer_32
 +
end
 +
</code>

Latest revision as of 22:13, 2 June 2006


Summary

That's what this part of the project should achieve:

  • reading and parsing of MO files containing the strings and their translations
  • organize the object collection in an incremental way: don't load the whole file if it's not needed
  • give a simple interface to the localization class, so that the strings can be printed out without too much efforts

What is a mo file

MO stands for Machine Object, as the name suggests, they are meant to be read by programs, and are binary in nature. The GNU program that creates them out of a PO file, is msgfmt.

Reading and parsing

Parser structure

I'll propose the class structure of the parser, later.

MO file structure

As reported from the gettext manual.

          byte
               +------------------------------------------+
            0  | magic number = 0x950412de                |
               |                                          |
            4  | file format revision = 0                 |
               |                                          |
            8  | number of strings                        |  == N
               |                                          |
           12  | offset of table with original strings    |  == O
               |                                          |
           16  | offset of table with translation strings |  == T
               |                                          |
           20  | size of hashing table                    |  == S
               |                                          |
           24  | offset of hashing table                  |  == H
               |                                          |
               .                                          .
               .    (possibly more entries later)         .
               .                                          .
               |                                          |
            O  | length & offset 0th string  ----------------.
        O + 8  | length & offset 1st string  ------------------.
                ...                                    ...   | |
O + ((N-1)*8)  | length & offset (N-1)th string           |  | |
               |                                          |  | |
            T  | length & offset 0th translation  ---------------.
        T + 8  | length & offset 1st translation  -----------------.
                ...                                    ...   | | | |
T + ((N-1)*8)  | length & offset (N-1)th translation      |  | | | |
               |                                          |  | | | |
            H  | start hash table                         |  | | | |
                ...                                    ...   | | | |
    H + S * 4  | end hash table                           |  | | | |
               |                                          |  | | | |
               | NUL terminated 0th string  <----------------' | | |
               |                                          |    | | |
               | NUL terminated 1st string  <------------------' | |
               |                                          |      | |
                ...                                    ...       | |
               |                                          |      | |
               | NUL terminated 0th translation  <---------------' |
               |                                          |        |
               | NUL terminated 1st translation  <-----------------'
               |                                          |
                ...                                    ...
               |                                          |
               +------------------------------------------+

Magic number

The magic number signals GNU MO files, it is stored in the byte order of the generating machine, so the magic number really is two numbers: 0x950412de and 0xde120495

Tables

In the tables, each string descriptor uses two 32 bits integers, the first for the string length, and the second for the offset of the string in the MO file, counting in bytes from the start of the file. The first table contains descriptors for the original strings, and is sorted in lexicographical order. The second table contains descriptors for the translated strings, and is parallel to the first table: to find the corresponding translation one has to access the array slot in the second array with the same index.

Hash table

The hash table is not contained in al Mo files, in this cases the size S is zero. It is sometimes better to not include the hash table because a precomputed hashing table takes disk space, and does not win that much speed. The hash table contains indices to the sorted array of strings in the MO file.

Hash Function

Here the description of the hash function used by gettext. I've taken it from it's source code gettext-0.14.5.


Each string has an associate hashing value V computed by a fixed function (see below). To locate the string, open addressing with double hashing is used. The first index will be V % S, where S is the size of the hashing table. If no entry is found, iterating with a second, independent hashing function takes place. This second value will be:

1 + V % (S - 2).

The approximate number of probes will be

for unsuccessful search: (1 - N / S) ^ -1
for successful search: - (N / S) ^ -1 * ln (1 - N / S)
where N is the number of keys.

If we now choose S to be the next prime bigger than 4 / 3 * N, we get the values

4 and 1.85 resp.

Because unsuccessful searches are unlikely this is a good value. Formulas: [Knuth, The Art of Computer Programming, Volume 3, Sorting and Searching, 1973, Addison Wesley]

The hash function used to convert stings to an integer, is the so called hashpjw function by P.J. Weinberger [see Aho/Sethi/Ullman, COMPILERS: Principles, Techniques and Tools, 1986, 1987 Bell Telephone Laboratories, Inc.]

The C code of the function comes from the source code of gettext (./gettext-0.14.5/gettext-runtime/intl/hash-string.h):

#define HASHWORDBITS 32
hash_string (const char *str_param) {
   unsigned long int hval, g;
   const char *str = str_param;
 
   /* Compute the hash value for the given string.  */
   hval = 0
   while (*str != '\0'
     {
       hval <<= 4;
       hval += (unsigned char) *str++;
       g = hval & ((unsigned long int) 0xf << (HASHWORDBITS - 4));
       if (g != 0
 	{
 	  hval ^= g >> (HASHWORDBITS - 8);
 	  hval ^= g;
 	}
     }
   return hval;
 }

Here is the translated function in Eiffel:

hash_string (a_string: STRING): INTEGER is
    -- Compute the hash value for the given string
require
    valid_string: a_string /= Void
local
    position: INTEGER
    l_result, g: NATURAL_32
 do
    from
        position := 1
    invariant
        position >= 1
        position <= a_string.count + 1
    variant
        a_string.count + 1 - position
    until
        position > a_string.count
    loop
        l_result := l_result |<< 4
        l_result := l_result + a_string.code(position)
        g := l_result & ({NATURAL_32} 0xf |<< 28)
        if g /= 0 then
            l_result := l_result.bit_xor(g |>> 24)
            l_result := l_result.bit_xor(g)
        end
        position := position + 1
    end
    Result := l_result.as_integer_32
end