Difference between revisions of "File Name Handling in EiffelBase"

(Work in progress)
 
(Added exception for FILE and DIRECTORY)
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
File name handling could be simpler if the Unix world hadn't stated that a file name is simply a sequence of bytes. The impact of such statement is that sequence of bytes can be interpreted differently depending on your locale settings. For example a French user looking at the name of a file would see a different name than a Chinese person looking at the exact same file.
+
File name handling could be simpler if the Unix world hadn't decided that a file name is simply a sequence of bytes. The impact of such statement is that sequence of bytes can be interpreted differently depending on your locale settings. For example a French user looking at the name of a file would see a different name than a Chinese person looking at the exact same file.
  
On Windows the story is better as all file names are UTF-16 encoded.
+
On Windows the story is better as all file names are UTF-16 encoded but Windows does not guarantee that it is a valid UTF-16 sequence, which means that not all fill names are representable in a printed manner.
  
 
But what about Eiffel that provides a multiplatform way to access files?
 
But what about Eiffel that provides a multiplatform way to access files?
  
== STRING_8 ==
+
== EiffelStudio 7.1 and older ==
Prior to 7.1, the only way to reference the file system was to use STRING_8 instances to represent a path.
+
Prior to 7.1, the only way to reference the file system was to use '''STRING_8''' instances to represent a path.
  
On Unix, if the file was Unicode, that did not matter because really it is not Unicode, it is just a sequence of bytes.
+
On Unix, if the file was Unicode, that did not matter because really it is not Unicode, it is just a sequence of bytes that can be represented in a STRING_8 instance.
  
On Windows, if the file had a name that contained a code greater than 256, it would work just fine most of the time since the Eiffel runtime used the ANSI version of the Microsoft APIs. However if you would get a path from the Windows Open File dialog, you will be unable to read that file since it would return a STRING_32 instances.
+
On Windows, if the file had a name that contained a code greater than 256, it would work just fine most of the time since the Eiffel runtime used the ANSI version of the Microsoft APIs. However if you would get a path from the Windows Open File dialog, you will be unable to read that file since it would return a full blown Unicode string as a STRING_32 instance.
  
  
== STRING_32 ==
+
== EiffelStudio 7.2 and later ==
Eiffel had support for Unicode for quite a few releases but not at the level of the file classes. In 7.2, we have introduced this capabilities which has unfortunately some restrictions for Unix users.
+
Eiffel had support for Unicode since EiffelStudio 5.7 but not at the level of the '''FILE''' classes. In 7.2, we have introduced the possibility to refer to a file using a PATH or a READABLE_STRING_GENERAL instance. PATH is a new class of EiffelBase.
  
First we can verify that if you provide a STRING_32 path to the file classes, then it will internally convert it to UTF-16 and use the Microsoft APIs to open the file, read the content of a directory, ... Whenever we get a file back from Windows, we convert it from UTF-16 to STRING_32. Now, if you pass a STRING_8 instance, we did not want to break any code, so when you do so, we would convert the STRING_8 instance to UTF-16 the same way that Microsoft does it by using the ANSI code page which is basically dependent on the language of your OS. Then you can query back the STRING_8 instance, or its STRING_32 conversion.
+
=== PATH ===
  
===What problem for Unix users?===
+
As mentioned before the main issue with handling with file names is that they use 1-byte sequence (UNIX) or a 2-byte sequence (Windows). Usually the encoding will be UTF-8 (UNIX) or UTF-16 (Windows) but there is no guarantee.
AS mentioned above, the Unix APIs for file manipulation handles a file name as a byte sequence without specifying any encoding. So if I have a STRING_32 instance and want to create a file, I need to choose an encoding, but which one? In 7.2, we have decided that if yo pass a STRING_8 instance, we pass the byte sequence directly to the OS as done prior to 7.2. However if you pass a STRING_32 instance, then we convert it first to UTF-8 before passing it to the OS.
+
  
So far so good. The rule becomes the following:
+
When a file name/path contains such mixed encoding, the file name/path is not representable textually. This is where PATH is handy. PATH is immutable making it easy to manipulate and to share.
# You create your own files: use STRING_32 and Unicode and your files will be saved in UTF-8.
+
# You need to read files in path where no encoding is specified: use a STRING_8 sequence to represent the path.
+
  
The caveat for Unix is with the DIRECTORY class. How do we interpret the content of a directory listing since we do not know in advance
+
All APIs of EiffelBase and other libraries have been updated to also take a PATH instance as argument each time there is a need to access a file.
 +
 
 +
=== READABLE_STRING_GENERAL ===
 +
 
 +
In order to not break existing code but still offer users to provide Unicode path as STRING_32, we have also equipped most API to take a READABLE_STRING_GENERAL as argument. Note that when an API takes a READABLE_STRING_GENERAL, it will interpret all strings as if they were a Unicode character sequence, which means that STRING_8 are Unicode character whose code is between 0 and 255, and STRING_32 is the full set.
 +
 
 +
The only exception is with FILE and DIRECTORY where STRING_8 are interpreted in the current locale.
 +
 
 +
Otherwise, existing APIs that used to take STRING_8 are left as is to not break backward compatibility and the STRING_8 sequence is not interpreted as Unicode character sequence but as a sequence of byte as before.

Latest revision as of 16:02, 8 November 2012

File name handling could be simpler if the Unix world hadn't decided that a file name is simply a sequence of bytes. The impact of such statement is that sequence of bytes can be interpreted differently depending on your locale settings. For example a French user looking at the name of a file would see a different name than a Chinese person looking at the exact same file.

On Windows the story is better as all file names are UTF-16 encoded but Windows does not guarantee that it is a valid UTF-16 sequence, which means that not all fill names are representable in a printed manner.

But what about Eiffel that provides a multiplatform way to access files?

EiffelStudio 7.1 and older

Prior to 7.1, the only way to reference the file system was to use STRING_8 instances to represent a path.

On Unix, if the file was Unicode, that did not matter because really it is not Unicode, it is just a sequence of bytes that can be represented in a STRING_8 instance.

On Windows, if the file had a name that contained a code greater than 256, it would work just fine most of the time since the Eiffel runtime used the ANSI version of the Microsoft APIs. However if you would get a path from the Windows Open File dialog, you will be unable to read that file since it would return a full blown Unicode string as a STRING_32 instance.


EiffelStudio 7.2 and later

Eiffel had support for Unicode since EiffelStudio 5.7 but not at the level of the FILE classes. In 7.2, we have introduced the possibility to refer to a file using a PATH or a READABLE_STRING_GENERAL instance. PATH is a new class of EiffelBase.

PATH

As mentioned before the main issue with handling with file names is that they use 1-byte sequence (UNIX) or a 2-byte sequence (Windows). Usually the encoding will be UTF-8 (UNIX) or UTF-16 (Windows) but there is no guarantee.

When a file name/path contains such mixed encoding, the file name/path is not representable textually. This is where PATH is handy. PATH is immutable making it easy to manipulate and to share.

All APIs of EiffelBase and other libraries have been updated to also take a PATH instance as argument each time there is a need to access a file.

READABLE_STRING_GENERAL

In order to not break existing code but still offer users to provide Unicode path as STRING_32, we have also equipped most API to take a READABLE_STRING_GENERAL as argument. Note that when an API takes a READABLE_STRING_GENERAL, it will interpret all strings as if they were a Unicode character sequence, which means that STRING_8 are Unicode character whose code is between 0 and 255, and STRING_32 is the full set.

The only exception is with FILE and DIRECTORY where STRING_8 are interpreted in the current locale.

Otherwise, existing APIs that used to take STRING_8 are left as is to not break backward compatibility and the STRING_8 sequence is not interpreted as Unicode character sequence but as a sequence of byte as before.