File Name Handling in EiffelBase

Revision as of 23:20, 15 October 2012 by Alexander Kogtenkov (Talk | contribs) (Fixed typos.)

File name handling could be simpler if the Unix world hadn't decided that a file name is simply a sequence of bytes. The impact of such statement is that sequence of bytes can be interpreted differently depending on your locale settings. For example a French user looking at the name of a file would see a different name than a Chinese person looking at the exact same file.

On Windows the story is better as all file names are UTF-16 encoded.

But what about Eiffel that provides a multiplatform way to access files?

EiffelStudio 7.1 and older

Prior to 7.1, the only way to reference the file system was to use STRING_8 instances to represent a path.

On Unix, if the file was Unicode, that did not matter because really it is not Unicode, it is just a sequence of bytes that can be represented in a STRING_8 instance.

On Windows, if the file had a name that contained a code greater than 256, it would work just fine most of the time since the Eiffel runtime used the ANSI version of the Microsoft APIs. However if you would get a path from the Windows Open File dialog, you will be unable to read that file since it would return a full blown Unicode string as a STRING_32 instance.


EiffelStudio 7.2 and later

Eiffel had support for Unicode since EiffelStudio 5.7 but not at the level of the FILE classes. In 7.2, we have introduced the possibility to refer to a file using a STRING_32 instance with some restrictions which depend on the platform you are running.

Normal behavior

In order to not break existing code, whenever you pass a STRING_8 instance as the name for the FILE classes, we preserve the older behavior. On Windows, the string is converted using the ANSI code page to Unicode, and on Unix it stays the same.

If you pass a STRING_32 instance as the name, then the behavior changes slightly. On Windows the Unicode string is still converted but this time the conversion is really just an encoding of the Unicode string into UTF-16. On Unix the Unicode string is encoded as a UTF-8 sequence used as the file name.

The known limitations for Unix

  • One Unix limitation is that if you have API returning you a Unicode string for a file, it doesn't tell you for sure that the file is actually UTF-8 encoded in the file system, if it was not UTF-8 encoded then you would not be able to access that file this way.
  • The DIRECTORY class provides a way to iterate a directory content. If the content has files with different encodings which are not UTF-8, then we cannot translate them back to STRING_32 without breaking the roundtrip (i.e. using that STRING_32 instance, we should be able to open the corresponding file). What shall we do? The directory listing is still possible but you need to use the version that gives you back STRING_8 instances, that you can then use to create a FILE instance (but no conversion to STRING_32 should intervene in the process).
  • The final limitation is that building paths cannot mix both STRING_32 and STRING_8 instances.

To summarize, on Unix, the rule becomes the following:

  1. You create your own files: use STRING_32 and Unicode and your file names will be saved in UTF-8.
  2. You need to read files in path where no encoding is specified: use a STRING_8 sequence to represent the path.
  3. Building paths cannot mix STRING_32 or STRING_8.