Difference between revisions of "File Name Handling in EiffelBase"
(Updated Unicode handling with latest changes) |
(Added exception for FILE and DIRECTORY) |
||
Line 28: | Line 28: | ||
In order to not break existing code but still offer users to provide Unicode path as STRING_32, we have also equipped most API to take a READABLE_STRING_GENERAL as argument. Note that when an API takes a READABLE_STRING_GENERAL, it will interpret all strings as if they were a Unicode character sequence, which means that STRING_8 are Unicode character whose code is between 0 and 255, and STRING_32 is the full set. | In order to not break existing code but still offer users to provide Unicode path as STRING_32, we have also equipped most API to take a READABLE_STRING_GENERAL as argument. Note that when an API takes a READABLE_STRING_GENERAL, it will interpret all strings as if they were a Unicode character sequence, which means that STRING_8 are Unicode character whose code is between 0 and 255, and STRING_32 is the full set. | ||
− | + | The only exception is with FILE and DIRECTORY where STRING_8 are interpreted in the current locale. | |
+ | |||
+ | Otherwise, existing APIs that used to take STRING_8 are left as is to not break backward compatibility and the STRING_8 sequence is not interpreted as Unicode character sequence but as a sequence of byte as before. |
Latest revision as of 16:02, 8 November 2012
File name handling could be simpler if the Unix world hadn't decided that a file name is simply a sequence of bytes. The impact of such statement is that sequence of bytes can be interpreted differently depending on your locale settings. For example a French user looking at the name of a file would see a different name than a Chinese person looking at the exact same file.
On Windows the story is better as all file names are UTF-16 encoded but Windows does not guarantee that it is a valid UTF-16 sequence, which means that not all fill names are representable in a printed manner.
But what about Eiffel that provides a multiplatform way to access files?
Contents
EiffelStudio 7.1 and older
Prior to 7.1, the only way to reference the file system was to use STRING_8 instances to represent a path.
On Unix, if the file was Unicode, that did not matter because really it is not Unicode, it is just a sequence of bytes that can be represented in a STRING_8 instance.
On Windows, if the file had a name that contained a code greater than 256, it would work just fine most of the time since the Eiffel runtime used the ANSI version of the Microsoft APIs. However if you would get a path from the Windows Open File dialog, you will be unable to read that file since it would return a full blown Unicode string as a STRING_32 instance.
EiffelStudio 7.2 and later
Eiffel had support for Unicode since EiffelStudio 5.7 but not at the level of the FILE classes. In 7.2, we have introduced the possibility to refer to a file using a PATH or a READABLE_STRING_GENERAL instance. PATH is a new class of EiffelBase.
PATH
As mentioned before the main issue with handling with file names is that they use 1-byte sequence (UNIX) or a 2-byte sequence (Windows). Usually the encoding will be UTF-8 (UNIX) or UTF-16 (Windows) but there is no guarantee.
When a file name/path contains such mixed encoding, the file name/path is not representable textually. This is where PATH is handy. PATH is immutable making it easy to manipulate and to share.
All APIs of EiffelBase and other libraries have been updated to also take a PATH instance as argument each time there is a need to access a file.
READABLE_STRING_GENERAL
In order to not break existing code but still offer users to provide Unicode path as STRING_32, we have also equipped most API to take a READABLE_STRING_GENERAL as argument. Note that when an API takes a READABLE_STRING_GENERAL, it will interpret all strings as if they were a Unicode character sequence, which means that STRING_8 are Unicode character whose code is between 0 and 255, and STRING_32 is the full set.
The only exception is with FILE and DIRECTORY where STRING_8 are interpreted in the current locale.
Otherwise, existing APIs that used to take STRING_8 are left as is to not break backward compatibility and the STRING_8 sequence is not interpreted as Unicode character sequence but as a sequence of byte as before.