Compiler and encoding
From 6.7, the compiler has been equiped with a Unicode parser. The core of the parser only accepts UTF-8 source code, for simplicity and generality. Before source code is passed into the core process of parsing, it is preprocessed and converted into UTF-8.
Internals
Abstracted syntax tree now stores STRING_8 as UTF-8 data on each node. There are also different features exporting UTF-8, UTF-32 or the written bytes. Here is an example of how a character é is represented in various levels.
Source encoding |
UTF-8 (BOM) |
ISO-8859-1 |
Bytes in source |
0xC3A9 |
0xE9 |
1. {STRING_AS}.value |
0xC3A9 |
0xC3A9 |
2. {STRING_AS}.binary_value |
0xC3A9 |
0xE9 |
3. {STRING_AS}.value_32 |
0xE9 |
0xE9 |
4. {STRING_AS}.string_value_32 |
0xE9 |
0xE9 |
5. Runtime |
0xC3A9 (STRING_8 Rejected) 0xE9 (STRING_32) |
0xE9 (STRING_8) 0xE9 (STRING_32) |