Difference between revisions of "File Name Handling in EiffelBase"

(The known limitations for Unix)
(Added exception for FILE and DIRECTORY)
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
 
File name handling could be simpler if the Unix world hadn't decided that a file name is simply a sequence of bytes. The impact of such statement is that sequence of bytes can be interpreted differently depending on your locale settings. For example a French user looking at the name of a file would see a different name than a Chinese person looking at the exact same file.
 
File name handling could be simpler if the Unix world hadn't decided that a file name is simply a sequence of bytes. The impact of such statement is that sequence of bytes can be interpreted differently depending on your locale settings. For example a French user looking at the name of a file would see a different name than a Chinese person looking at the exact same file.
  
On Windows the story is better as all file names are UTF-16 encoded.
+
On Windows the story is better as all file names are UTF-16 encoded but Windows does not guarantee that it is a valid UTF-16 sequence, which means that not all fill names are representable in a printed manner.
  
 
But what about Eiffel that provides a multiplatform way to access files?
 
But what about Eiffel that provides a multiplatform way to access files?
Line 14: Line 14:
  
 
== EiffelStudio 7.2 and later ==
 
== EiffelStudio 7.2 and later ==
Eiffel had support for Unicode since EiffelStudio 5.7 but not at the level of the '''FILE''' classes. In 7.2, we have introduced the possibility to refer to a file using a STRING_32 instance with some restrictions which depend on the platform you are running.
+
Eiffel had support for Unicode since EiffelStudio 5.7 but not at the level of the '''FILE''' classes. In 7.2, we have introduced the possibility to refer to a file using a PATH or a READABLE_STRING_GENERAL instance. PATH is a new class of EiffelBase.
  
=== Normal behavior ===
+
=== PATH ===
In order to not break existing code, whenever you pass a '''STRING_8''' instance as the name for the '''FILE''' classes, we preserve the older behavior. On Windows, the string is converted using the ANSI code page to Unicode, and on Unix it stays the same.
+
  
If you pass a '''STRING_32''' instance as the name, then the behavior changes slightly. On Windows the Unicode string is still converted but this time the conversion is really just an encoding of the Unicode string into UTF-16. On Unix the Unicode string is encoded as a UTF-8 sequence used as the file name.
+
As mentioned before the main issue with handling with file names is that they use 1-byte sequence (UNIX) or a 2-byte sequence (Windows). Usually the encoding will be UTF-8 (UNIX) or UTF-16 (Windows) but there is no guarantee.
  
=== The known limitations for Unix ===
+
When a file name/path contains such mixed encoding, the file name/path is not representable textually. This is where PATH is handy. PATH is immutable making it easy to manipulate and to share.
* One Unix limitation is that if you have API returning you a Unicode string for a file, it doesn't tell you for sure that the file is actually UTF-8 encoded in the file system, if it was not UTF-8 encoded then you would not be able to access that file this way.
+
  
* The DIRECTORY class provides a way to iterate a directory content. If the content has files with different encodings which are not UTF-8, then we cannot translate them back to STRING_32 without breaking the roundtrip (i.e. using that STRING_32 instance, we should be able to open the corresponding file). What shall we do? The directory listing is still possible but you need to use the version that gives you back STRING_8 instances, that you can then use to create a FILE instance (but no conversion to STRING_32 should intervene in the process).
+
All APIs of EiffelBase and other libraries have been updated to also take a PATH instance as argument each time there is a need to access a file.
  
* The final limitation is that building paths cannot mix both STRING_32 and STRING_8 instances.
+
=== READABLE_STRING_GENERAL ===
  
To summarize, on Unix, the rule becomes the following:
+
In order to not break existing code but still offer users to provide Unicode path as STRING_32, we have also equipped most API to take a READABLE_STRING_GENERAL as argument. Note that when an API takes a READABLE_STRING_GENERAL, it will interpret all strings as if they were a Unicode character sequence, which means that STRING_8 are Unicode character whose code is between 0 and 255, and STRING_32 is the full set.
# Create your own files: use STRING_32 and your file names will be saved in UTF-8.
+
 
# Read files in path where no encoding is specified: use a STRING_8 sequence to represent the path.
+
The only exception is with FILE and DIRECTORY where STRING_8 are interpreted in the current locale.
# Building paths cannot mix STRING_32 and STRING_8.
+
 
 +
Otherwise, existing APIs that used to take STRING_8 are left as is to not break backward compatibility and the STRING_8 sequence is not interpreted as Unicode character sequence but as a sequence of byte as before.

Latest revision as of 16:02, 8 November 2012

File name handling could be simpler if the Unix world hadn't decided that a file name is simply a sequence of bytes. The impact of such statement is that sequence of bytes can be interpreted differently depending on your locale settings. For example a French user looking at the name of a file would see a different name than a Chinese person looking at the exact same file.

On Windows the story is better as all file names are UTF-16 encoded but Windows does not guarantee that it is a valid UTF-16 sequence, which means that not all fill names are representable in a printed manner.

But what about Eiffel that provides a multiplatform way to access files?

EiffelStudio 7.1 and older

Prior to 7.1, the only way to reference the file system was to use STRING_8 instances to represent a path.

On Unix, if the file was Unicode, that did not matter because really it is not Unicode, it is just a sequence of bytes that can be represented in a STRING_8 instance.

On Windows, if the file had a name that contained a code greater than 256, it would work just fine most of the time since the Eiffel runtime used the ANSI version of the Microsoft APIs. However if you would get a path from the Windows Open File dialog, you will be unable to read that file since it would return a full blown Unicode string as a STRING_32 instance.


EiffelStudio 7.2 and later

Eiffel had support for Unicode since EiffelStudio 5.7 but not at the level of the FILE classes. In 7.2, we have introduced the possibility to refer to a file using a PATH or a READABLE_STRING_GENERAL instance. PATH is a new class of EiffelBase.

PATH

As mentioned before the main issue with handling with file names is that they use 1-byte sequence (UNIX) or a 2-byte sequence (Windows). Usually the encoding will be UTF-8 (UNIX) or UTF-16 (Windows) but there is no guarantee.

When a file name/path contains such mixed encoding, the file name/path is not representable textually. This is where PATH is handy. PATH is immutable making it easy to manipulate and to share.

All APIs of EiffelBase and other libraries have been updated to also take a PATH instance as argument each time there is a need to access a file.

READABLE_STRING_GENERAL

In order to not break existing code but still offer users to provide Unicode path as STRING_32, we have also equipped most API to take a READABLE_STRING_GENERAL as argument. Note that when an API takes a READABLE_STRING_GENERAL, it will interpret all strings as if they were a Unicode character sequence, which means that STRING_8 are Unicode character whose code is between 0 and 255, and STRING_32 is the full set.

The only exception is with FILE and DIRECTORY where STRING_8 are interpreted in the current locale.

Otherwise, existing APIs that used to take STRING_8 are left as is to not break backward compatibility and the STRING_8 sequence is not interpreted as Unicode character sequence but as a sequence of byte as before.