Difference between revisions of "Encoding library"

(Usage)
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
[[Category: Library]]
 
== Overview ==
 
== Overview ==
The encoding library is a library used to convert string stream among various encodings. The main reason it's developed is internationalization of batch EiffelStudio. The idea is directing localized encoding strings to the console on Windows and UTF-8 encoding on Unix makes local languages supported be displayed correctly.
+
The encoding library is a library used to convert string stream among various encodings. The main reason it's developed is internationalization of batch EiffelStudio. The idea is directing localized encoding strings to the console on Windows and UTF-8 encoding on Unix makes local languages supported be displayed correctly. Another potential use for EiffelStudio is reading files of all kinds of encoding, later displayed in Unicode supported editors.
  
 
== Layout ==
 
== Layout ==
Line 22: Line 23:
 
* `code_page' should be valid a given OS so that the conversion can be achieved. A valid `code_page' on Windows are mostly the same as defined code page identifier at [http://msdn2.microsoft.com/en-us/library/ms776446.aspx MSDN], there are also a few out of the table are valid as defined in CODE_PAGE_CONSTANTS. On Unix, a valid `code_page' is actually a name of encodings supported by [http://www.gnu.org/software/libiconv/documentation/libiconv/iconv_open.3.html libiconv]. To guarentee a valid `code_page', it should be either from CODE_PAGE_CONSTANTS or from {I18N_LOCALE}.info.code_page of i18n library.
 
* `code_page' should be valid a given OS so that the conversion can be achieved. A valid `code_page' on Windows are mostly the same as defined code page identifier at [http://msdn2.microsoft.com/en-us/library/ms776446.aspx MSDN], there are also a few out of the table are valid as defined in CODE_PAGE_CONSTANTS. On Unix, a valid `code_page' is actually a name of encodings supported by [http://www.gnu.org/software/libiconv/documentation/libiconv/iconv_open.3.html libiconv]. To guarentee a valid `code_page', it should be either from CODE_PAGE_CONSTANTS or from {I18N_LOCALE}.info.code_page of i18n library.
 
* "a_from_string" should be guaranteed to be of correct character set and encoding specified as from ENCODING object. Or error could occur, none  or unexcepted output might be returned.
 
* "a_from_string" should be guaranteed to be of correct character set and encoding specified as from ENCODING object. Or error could occur, none  or unexcepted output might be returned.
 +
* Data converted from Unicode UTF-16 to non-Unicode code pages (code pages other than UTF-7 or UTF-8) is subject to data loss, because a code page might not be able to represent every character used in the specific Unicode data.
 
* Example:
 
* Example:
 
<code>[eiffel, N]
 
<code>[eiffel, N]
Line 42: Line 44:
 
end
 
end
 
</code>
 
</code>
 +
 +
== Implemenation ==
 +
* Generally the library wraps Windows api and iconv library on Unix.
 +
* The whole library is a simple bridge structure.
 +
 +
=== Windows ===
 +
* Main windows apis are [http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp WideCharToMultiByte] and [http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_17si.asp MultiByteToWideChar]. [http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_17si.asp MultiByteToWideChar] maps a character string to a wide character (Unicode UTF-16) string. [http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp WideCharToMultiByte] maps a wide character string to a new character string. The new character string is not necessarily from a multibyte character set. Those apis are fully supported by all Windows from 95 to vista.
 +
* UTF-32 to/from UTF-16 conversion is implemented by pure Eiffel code. The algerithms are from the book of Unicode Demystified by Richard Gillam. See {ENCODING_I}.utf32_to_tuf16 and {ENCODING_I}.utf16_to_utf32.
 +
* Big endian to/from little endian conversion is also implemented by pure Eiffel code. See {ENCODING_I}.string_32_switch_endian and {ENCODING_I}.string_16_swtich_endian.
 +
 +
=== Unix ===
 +
* The [http://www.gnu.org/software/libiconv/ iconv library] is a GNU project. Most Unix OSes have it installed. Unfortunately it is reported there are a few OS do not make it a standard installation.
 +
* iconv library correctly handles most conversions. So the implementation is relatively simply. One thing should be mentioned is that the output of iconv is prefixed with 0xFFFE or 0XFEFF when the target encoding is UTF16 or UTF32. Then encoding library wipes those endian marks for consistancy to Windows implementation.

Latest revision as of 06:54, 25 January 2007

Overview

The encoding library is a library used to convert string stream among various encodings. The main reason it's developed is internationalization of batch EiffelStudio. The idea is directing localized encoding strings to the console on Windows and UTF-8 encoding on Unix makes local languages supported be displayed correctly. Another potential use for EiffelStudio is reading files of all kinds of encoding, later displayed in Unicode supported editors.

Layout

 encoding
  |-ENCODING
  |-CODE_PAGE_CONSTANTS
  |-implementation
    | ENCODING_I
    |-unix
      |-ENCODING_IMP
      |-CODE_SET
    |-windows
      |-ENCODING_IMP
      |-CODE_PAGE

Usage

  • The usage is simple.
 - Initialize a from ENCODING object and a to object with `code_page's. 
 - Invoke {ENCODING}.convert_to of the from ENCODING object. `convert_to' takes the to ENCODING object and original string as arguments, and returns the target encoded  string.
  • `code_page' should be valid a given OS so that the conversion can be achieved. A valid `code_page' on Windows are mostly the same as defined code page identifier at MSDN, there are also a few out of the table are valid as defined in CODE_PAGE_CONSTANTS. On Unix, a valid `code_page' is actually a name of encodings supported by libiconv. To guarentee a valid `code_page', it should be either from CODE_PAGE_CONSTANTS or from {I18N_LOCALE}.info.code_page of i18n library.
  • "a_from_string" should be guaranteed to be of correct character set and encoding specified as from ENCODING object. Or error could occur, none or unexcepted output might be returned.
  • Data converted from Unicode UTF-16 to non-Unicode code pages (code pages other than UTF-7 or UTF-8) is subject to data loss, because a code page might not be able to represent every character used in the specific Unicode data.
  • Example:
foo is
		local
			l_encoding_from, l_encoding_to: ENCODING
			l_string_from: STRING_32
			l_output: STRING_GENERAL
		do
			create l_string_from.make (2)
			l_string_from.append_code (0x0E0041)
			l_string_from.append_string ("A")
 
			create l_encoding_from.make ((create {CODE_PAGE_CONSTANTS}).utf32)
			create l_encoding_to.make ((create {CODE_PAGE_CONSTANTS}).utf16)
 
			l_output := l_encoding_from.convert_to (l_encoding_to, l_string_from)
				-- l_string_from is now 0x000E0041 0x00000041.
				-- l_output is now 0x0000DB40 0x0000DC41 0x00000041.
		end

Implemenation

  • Generally the library wraps Windows api and iconv library on Unix.
  • The whole library is a simple bridge structure.

Windows

  • Main windows apis are WideCharToMultiByte and MultiByteToWideChar. MultiByteToWideChar maps a character string to a wide character (Unicode UTF-16) string. WideCharToMultiByte maps a wide character string to a new character string. The new character string is not necessarily from a multibyte character set. Those apis are fully supported by all Windows from 95 to vista.
  • UTF-32 to/from UTF-16 conversion is implemented by pure Eiffel code. The algerithms are from the book of Unicode Demystified by Richard Gillam. See {ENCODING_I}.utf32_to_tuf16 and {ENCODING_I}.utf16_to_utf32.
  • Big endian to/from little endian conversion is also implemented by pure Eiffel code. See {ENCODING_I}.string_32_switch_endian and {ENCODING_I}.string_16_swtich_endian.

Unix

  • The iconv library is a GNU project. Most Unix OSes have it installed. Unfortunately it is reported there are a few OS do not make it a standard installation.
  • iconv library correctly handles most conversions. So the implementation is relatively simply. One thing should be mentioned is that the output of iconv is prefixed with 0xFFFE or 0XFEFF when the target encoding is UTF16 or UTF32. Then encoding library wipes those endian marks for consistancy to Windows implementation.