Difference between revisions of "File Name Handling in EiffelBase"

m
m
Line 14: Line 14:
  
 
== EiffelStudio 7.2 and later ==
 
== EiffelStudio 7.2 and later ==
Eiffel had support for Unicode since EiffelStudio 5.7 but not at the level of the file classes. In 7.2, we have introduced the possibility to refer a file using a STRING_32 instance with some restrictions which depend on the platform you are running.
+
Eiffel had support for Unicode since EiffelStudio 5.7 but not at the level of the '''FILE''' classes. In 7.2, we have introduced the possibility to refer a file using a STRING_32 instance with some restrictions which depend on the platform you are running.
  
=== Windows ===
+
=== Normal behavior ===
First we can verify that if you provide a STRING_32 path to the file classes, then it will internally convert it to UTF-16 and use the Microsoft APIs to open the file, read the content of a directory, ... Whenever we get a file back from Windows, we convert it from UTF-16 to STRING_32. Now, if you pass a STRING_8 instance, we did not want to break any code, so when you do so, we would convert the STRING_8 instance to UTF-16 the same way that Microsoft does it by using the ANSI code page which is basically dependent on the language of your OS. Then you can query back the STRING_8 instance, or its STRING_32 conversion.
+
In order to not break existing code, whenever you pass a STRING_8 instance as the name for the '''FILE''' classes, we preserve the older behavior. On Windows, the string is converted using the ANSI code page to Unicode, and on Unix it stays the same.
 +
 
 +
If you pass a STRING_32 instance as the name, then the behavior changes slightly. On Windows the Unicode string is still converted but this time the conversion is really just an encoding of the Unicode string into UTF-16. On Unix the Unicode string is encoded as a UTF-8 sequence used as the file name.
 +
 
 +
=== The know limitations for Unix ===
 +
One Unix limitation is that if you have API returning you a Unicode string for a file, it doesn't tell you for sure that the file is actually UTF-8 encoded in the file system, if it was not UTF-8 encoded then you would not be able to access that file this way.
 +
 
 +
The DIRECTORY class provides a way to iterate a directory content. If the content has files with different encodings which are not UTF-8, then we cannot translate them back to STRING_32 without breaking the roundtrip (i.e. using that STRING_32 instance, we should be able to open the corresponding file).
  
=== Unix ===
 
 
AS mentioned above, the Unix APIs for file manipulation handles a file name as a byte sequence without specifying any encoding. So if I have a STRING_32 instance and want to create a file, I need to choose an encoding, but which one? In 7.2, we have decided that if yo pass a STRING_8 instance, we pass the byte sequence directly to the OS as done prior to 7.2. However if you pass a STRING_32 instance, then we convert it first to UTF-8 before passing it to the OS.
 
AS mentioned above, the Unix APIs for file manipulation handles a file name as a byte sequence without specifying any encoding. So if I have a STRING_32 instance and want to create a file, I need to choose an encoding, but which one? In 7.2, we have decided that if yo pass a STRING_8 instance, we pass the byte sequence directly to the OS as done prior to 7.2. However if you pass a STRING_32 instance, then we convert it first to UTF-8 before passing it to the OS.
  

Revision as of 21:56, 15 October 2012

File name handling could be simpler if the Unix world hadn't decided that a file name is simply a sequence of bytes. The impact of such statement is that sequence of bytes can be interpreted differently depending on your locale settings. For example a French user looking at the name of a file would see a different name than a Chinese person looking at the exact same file.

On Windows the story is better as all file names are UTF-16 encoded.

But what about Eiffel that provides a multiplatform way to access files?

EiffelStudio 7.1 and older

Prior to 7.1, the only way to reference the file system was to use STRING_8 instances to represent a path.

On Unix, if the file was Unicode, that did not matter because really it is not Unicode, it is just a sequence of bytes that can be represented in a STRING_8 instance.

On Windows, if the file had a name that contained a code greater than 256, it would work just fine most of the time since the Eiffel runtime used the ANSI version of the Microsoft APIs. However if you would get a path from the Windows Open File dialog, you will be unable to read that file since it would return a full blown Unicode string as a STRING_32 instance.


EiffelStudio 7.2 and later

Eiffel had support for Unicode since EiffelStudio 5.7 but not at the level of the FILE classes. In 7.2, we have introduced the possibility to refer a file using a STRING_32 instance with some restrictions which depend on the platform you are running.

Normal behavior

In order to not break existing code, whenever you pass a STRING_8 instance as the name for the FILE classes, we preserve the older behavior. On Windows, the string is converted using the ANSI code page to Unicode, and on Unix it stays the same.

If you pass a STRING_32 instance as the name, then the behavior changes slightly. On Windows the Unicode string is still converted but this time the conversion is really just an encoding of the Unicode string into UTF-16. On Unix the Unicode string is encoded as a UTF-8 sequence used as the file name.

The know limitations for Unix

One Unix limitation is that if you have API returning you a Unicode string for a file, it doesn't tell you for sure that the file is actually UTF-8 encoded in the file system, if it was not UTF-8 encoded then you would not be able to access that file this way.

The DIRECTORY class provides a way to iterate a directory content. If the content has files with different encodings which are not UTF-8, then we cannot translate them back to STRING_32 without breaking the roundtrip (i.e. using that STRING_32 instance, we should be able to open the corresponding file).

AS mentioned above, the Unix APIs for file manipulation handles a file name as a byte sequence without specifying any encoding. So if I have a STRING_32 instance and want to create a file, I need to choose an encoding, but which one? In 7.2, we have decided that if yo pass a STRING_8 instance, we pass the byte sequence directly to the OS as done prior to 7.2. However if you pass a STRING_32 instance, then we convert it first to UTF-8 before passing it to the OS.

So far so good. The rule becomes the following:

  1. You create your own files: use STRING_32 and Unicode and your files will be saved in UTF-8.
  2. You need to read files in path where no encoding is specified: use a STRING_8 sequence to represent the path.

The caveat for Unix is with the DIRECTORY class. How do we interpret the content of a directory listing since we do not know in advance