Wish CHARACTER 16

Revision as of 12:14, 14 April 2008 by Clemahieu (Talk | contribs) (Fixed typeos.)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Colin Paul Adams 07:05, 18 January 2007 (CET): On the wish-list, there is a request for a class CHARACTER_16.

I wrote some reasons there why it is bad, but due to space limitations, it doesn't read very well. I thought it well worth the time to write a proper answer here.

First, the requestor's justification for the class:

"Smaller footprint than CHARACTER_32, UTF-16 is the default internal text representation for Windows/.Net/OS X/Java so conversion from CHARACTER_32 is costly, particular when dealing with large amounts of text"

There are a number of unjustified assumptions here.

First:

"conversion from CHARACTER_32 is costly"

But you are not starting with CHARACTER_32, you are starting with STRING_32.

There is nothing in ECMA that specifies that STRING_32 must be UTF-32 (though it is perhaps intended - there are a large number of points related to Unicode in ECMA that are unclear or under-specified. I have written a report for the ECMA committee detailing these).

So as it stands, an implementor is free to implement STRING_32 as UTF-16 encoded. In which case there would be zero conversion cost (let us put aside the possible endianess conversion).

But let us suppose that the ECMA committee clarifies the standard to say that STRING_32 is indeed meant to be implemented as UTF-32. Is there then a need for CHARACTER_16?

No. In fact, it makes no sense at all.

First, let us assume that in the interests of conversion efficiency, we have a class representing UTF-16 encoded strings. Let us call it for the sake of argument UC_UTF_16 (STRING_16 might suggest 16-bit characters, which is not the case for UTF-16).

What then will you get if you call `item' on this class? To explore the answer, let us look at the definition of `item' in STRING_32:

item, infix "@" (i: INTEGER): WIDE_CHARACTER assign put is
		-- Character at position `i'
	do
		Result := area.item (i - 1)
	end

(Clicking on WIDE_CHARACTER gives me CHARACTER_32)

The point is the header comment - you get a character, not a Unicode code point.

Now if we call `item' on our hypothetical class UC_UTF_16, what should we get back? Again, a character. Should it be CHARACTER_32 or the requested CHARACTER_16?

The answer is that it must be CHARACTER_32. In UTF-16, a single character may be represented by a pair of 16-bit numbers, but the components of this pair are not individual characters - the code points that these individual 16-bit numbers represent are technically known as surrogate code points, and they are defined in The Unicode Standard to be non-characters. So if you were somehow able to get a CHARACTER_16 for each of these surrogate code points, then you would have an Eiffel object that supposedly represents a character, which does not do so.

So the whole idea of CHARACTER_16 is a nonsense.

In some scenarios you may need access to the individual code points (UTF-16 debugging, for instance), but to address this, the correct queries should be something like:

is_surrogate_pair (a_index: INTEGER): BOOLEAN
 
lower_surrogate (a_index: INTEGER): NATURAL_16
  require
    surrogate_pair: is_surrogate_pair (a_index)
 
upper_surrogate (a_index: INTEGER): NATURAL_16
  require
    surrogate_pair: is_surrogate_pair (a_index)

I hope that whoever added CHARACTER_16 to the wish list will now remove it, and replace it with a request for a UTF-16 class (which should NOT be called STRING_16 - I would suggest UTF_16_STRING).)