Encoding library

Revision as of 07:07, 25 January 2007 by Ted (Talk | contribs) (Windows)

Overview

The encoding library is a library used to convert string stream among various encodings. The main reason it's developed is internationalization of batch EiffelStudio. The idea is directing localized encoding strings to the console on Windows and UTF-8 encoding on Unix makes local languages supported be displayed correctly.

Layout

 encoding
  |-ENCODING
  |-CODE_PAGE_CONSTANTS
  |-implementation
    | ENCODING_I
    |-unix
      |-ENCODING_IMP
      |-CODE_SET
    |-windows
      |-ENCODING_IMP
      |-CODE_PAGE

Usage

  • The usage is simple.
 - Initialize a from ENCODING object and a to object with `code_page's. 
 - Invoke {ENCODING}.convert_to of the from ENCODING object. `convert_to' takes the to ENCODING object and original string as arguments, and returns the target encoded  string.
  • `code_page' should be valid a given OS so that the conversion can be achieved. A valid `code_page' on Windows are mostly the same as defined code page identifier at MSDN, there are also a few out of the table are valid as defined in CODE_PAGE_CONSTANTS. On Unix, a valid `code_page' is actually a name of encodings supported by libiconv. To guarentee a valid `code_page', it should be either from CODE_PAGE_CONSTANTS or from {I18N_LOCALE}.info.code_page of i18n library.
  • "a_from_string" should be guaranteed to be of correct character set and encoding specified as from ENCODING object. Or error could occur, none or unexcepted output might be returned.
  • Data converted from Unicode UTF-16 to non-Unicode code pages (code pages other than UTF-7 or UTF-8) is subject to data loss, because a code page might not be able to represent every character used in the specific Unicode data.
  • Example:
foo is
		local
			l_encoding_from, l_encoding_to: ENCODING
			l_string_from: STRING_32
			l_output: STRING_GENERAL
		do
			create l_string_from.make (2)
			l_string_from.append_code (0x0E0041)
			l_string_from.append_string ("A")
 
			create l_encoding_from.make ((create {CODE_PAGE_CONSTANTS}).utf32)
			create l_encoding_to.make ((create {CODE_PAGE_CONSTANTS}).utf16)
 
			l_output := l_encoding_from.convert_to (l_encoding_to, l_string_from)
				-- l_string_from is now 0x000E0041 0x00000041.
				-- l_output is now 0x0000DB40 0x0000DC41 0x00000041.
		end

Implemenation

Generally the library wraps Windows api and iconv library on Unix.

Windows

  • Main windows apis are WideCharToMultiByte and MultiByteToWideChar. MultiByteToWideChar maps a character string to a wide character (Unicode UTF-16) string. WideCharToMultiByte maps a wide character string to a new character string. The new character string is not necessarily from a multibyte character set.
  • UTF-32 to/from UTF-16 conversion is implemented by pure Eiffel code. The algerithms are from the book of Unicode Demystified by Richard Gillam. See {ENCODING_I}.utf32_to_tuf16 and {ENCODING_I}.utf16_to_utf32.
  • Big endian to/from little endian conversion is also implemented by pure Eiffel code. See {ENCODING_I}.string_32_switch_endian and {ENCODING_I}.string_16_swtich_endian.

Unix