Heuristics for detecting class text encoding

When Eiffel Software implement support for class texts written in Unicode, it is important to detect which Unicode encoding scheme is in use. Note that class authors will want to write class text in UTF-8 for certain (support for other encoding schemes is sporadic amongst programmers text editors), UTF-16/UTF-16BE/UTF-16LE probably (especially on Windows systems, or in East Asia), and least likely in UTF-32/UTF-32BE/UTF-32LE. The ECMA standard does not (yet - I have raised the issue) address the matter of encoding schemes. I trust it will allow either all 7 freely, or (better), just UTF-8 (without a BOM), and UTF-16 and UTF-32 (with the aditional requirement of a BOM).

If my suggestion in Mixing Unicode and Latin-1 class texts is followed, and I hope it is, then the compiler will already know from the cluster definition, whether or not a class is written in a Unicode encoding scheme or not. Therefore confusion with Latin-1 texts does not arise.

If only UTF-8 (without a BOM), and UTF-16 and UTF-32 (with the aditional requirement of a BOM) are allowed, then the heuristic is simple - if it has a BOM, examine the first four bytes - this determines the encoding. If it is not a BOM, then it is UTF-8.

Otherwise, simple heuristics can reliably determine the encoding scheme of the class by reading the first line of the class text as a sequence of octets (e.g. reading it as a STRING_8 and interpreting the `code' of each "character").

The class text can only begin with one of the following:

White space, indexing, notes, class, deferred, expanded, frozen

or a byte-order-mark (have I missed anything? --Ericb 19:21, 30 March 2007 (CEST): yes: comments starting with -- ).

So these can all be tested for all seven possible encodings (you only have to read 8 bytes).

Note that the Eiffel Studio editor should save class texts in an encoding scheme according to a user preference. I recommend allowing only UTF-8 (without a BOM), and UTF-16 and UTF-32 (with the aditional requirement of a BOM).