File Name Handling in EiffelBase

Revision as of 21:40, 15 October 2012 by Manus (Talk | contribs) (Work in progress)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

File name handling could be simpler if the Unix world hadn't stated that a file name is simply a sequence of bytes. The impact of such statement is that sequence of bytes can be interpreted differently depending on your locale settings. For example a French user looking at the name of a file would see a different name than a Chinese person looking at the exact same file.

On Windows the story is better as all file names are UTF-16 encoded.

But what about Eiffel that provides a multiplatform way to access files?

STRING_8

Prior to 7.1, the only way to reference the file system was to use STRING_8 instances to represent a path.

On Unix, if the file was Unicode, that did not matter because really it is not Unicode, it is just a sequence of bytes.

On Windows, if the file had a name that contained a code greater than 256, it would work just fine most of the time since the Eiffel runtime used the ANSI version of the Microsoft APIs. However if you would get a path from the Windows Open File dialog, you will be unable to read that file since it would return a STRING_32 instances.


STRING_32

Eiffel had support for Unicode for quite a few releases but not at the level of the file classes. In 7.2, we have introduced this capabilities which has unfortunately some restrictions for Unix users.

First we can verify that if you provide a STRING_32 path to the file classes, then it will internally convert it to UTF-16 and use the Microsoft APIs to open the file, read the content of a directory, ... Whenever we get a file back from Windows, we convert it from UTF-16 to STRING_32. Now, if you pass a STRING_8 instance, we did not want to break any code, so when you do so, we would convert the STRING_8 instance to UTF-16 the same way that Microsoft does it by using the ANSI code page which is basically dependent on the language of your OS. Then you can query back the STRING_8 instance, or its STRING_32 conversion.

What problem for Unix users?

AS mentioned above, the Unix APIs for file manipulation handles a file name as a byte sequence without specifying any encoding. So if I have a STRING_32 instance and want to create a file, I need to choose an encoding, but which one? In 7.2, we have decided that if yo pass a STRING_8 instance, we pass the byte sequence directly to the OS as done prior to 7.2. However if you pass a STRING_32 instance, then we convert it first to UTF-8 before passing it to the OS.

So far so good. The rule becomes the following:

  1. You create your own files: use STRING_32 and Unicode and your files will be saved in UTF-8.
  2. You need to read files in path where no encoding is specified: use a STRING_8 sequence to represent the path.

The caveat for Unix is with the DIRECTORY class. How do we interpret the content of a directory listing since we do not know in advance