Difference between revisions of "Compiler and encoding"
(→Internals) |
(→Internals) |
||
| Line 4: | Line 4: | ||
== Internals == | == Internals == | ||
| + | === Data Storage === | ||
Abstracted syntax tree now stores STRING_8 as UTF-8 data on each node. There are also different features exporting UTF-8, UTF-32 or the written bytes. | Abstracted syntax tree now stores STRING_8 as UTF-8 data on each node. There are also different features exporting UTF-8, UTF-32 or the written bytes. | ||
| Line 65: | Line 66: | ||
</div> | </div> | ||
| + | |||
| + | === Validility === | ||
| + | Source code encoding is either explicitly or implicitly specified. | ||
| + | * Explicit | ||
| + | ** File level: UTF-8 (BOM) is implemented | ||
| + | ** Class level: note clause (not implemented) | ||
| + | ** Configure file: .ecf (not implemented) | ||
| + | * Implicit | ||
| + | ** Implicit encoding is taken as ISO-8859-1 for compatibility, if no source code encoding is specified. | ||
Revision as of 18:11, 30 May 2012
From 6.7, the compiler has been equiped with a Unicode parser. The core of the parser only accepts UTF-8 source code, for simplicity and generality. Before source code is passed into the core process of parsing, it is preprocessed and converted into UTF-8.
Internals
Data Storage
Abstracted syntax tree now stores STRING_8 as UTF-8 data on each node. There are also different features exporting UTF-8, UTF-32 or the written bytes.
Here is an example of how a character é is represented at various levels.
|
Source encoding |
UTF-8 (BOM) |
ISO-8859-1 |
|
Bytes in source |
0xC3A9 |
0xE9 |
|
1. {STRING_AS}.value |
0xC3A9 |
0xC3A9 |
|
2. {STRING_AS}.binary_value |
0xC3A9 |
0xE9 |
|
3. {STRING_AS}.value_32 |
0xE9 |
0xE9 |
|
4. {STRING_AS}.string_value_32 |
0xE9 |
0xE9 |
|
5. Runtime |
0xC3A9 (STRING_8 Rejected) 0xE9 (STRING_32) |
0xE9 (STRING_8) 0xE9 (STRING_32) |
Validility
Source code encoding is either explicitly or implicitly specified.
- Explicit
- File level: UTF-8 (BOM) is implemented
- Class level: note clause (not implemented)
- Configure file: .ecf (not implemented)
- Implicit
- Implicit encoding is taken as ISO-8859-1 for compatibility, if no source code encoding is specified.

