Difference between revisions of "Compiler and encoding"
(→Validility) |
(→Data Storage) |
||
(2 intermediate revisions by one other user not shown) | |||
Line 56: | Line 56: | ||
<span lang="EN-US">5. Runtime</span> | <span lang="EN-US">5. Runtime</span> | ||
| style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" | | | style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" | | ||
− | <span lang="EN-US"> | + | <span lang="EN-US">0xE9 (STRING_8)</span> |
<span lang="EN-US">0xE9 (STRING_32)</span> | <span lang="EN-US">0xE9 (STRING_32)</span> | ||
Line 83: | Line 83: | ||
|- | |- | ||
| STRING_8 manifest | | STRING_8 manifest | ||
− | | | + | | Unicode point (0-255)? |
| Valid (taken as bytes) | | Valid (taken as bytes) | ||
|- | |- |
Latest revision as of 20:04, 7 June 2012
From 6.7, the compiler has been equiped with a Unicode parser. The core of the parser only accepts UTF-8 source code, for simplicity and generality. Before source code is passed into the core process of parsing, it is preprocessed and converted into UTF-8.
Internals
Data Storage
Abstracted syntax tree now stores STRING_8 as UTF-8 data on each node. There are also different features exporting UTF-8, UTF-32 or the written bytes.
Here is an example of how a character é is represented at various levels.
Source encoding |
UTF-8 (BOM) |
ISO-8859-1 |
Bytes in source |
0xC3A9 |
0xE9 |
1. {STRING_AS}.value |
0xC3A9 |
0xC3A9 |
2. {STRING_AS}.binary_value |
0xC3A9 |
0xE9 |
3. {STRING_AS}.value_32 |
0xE9 |
0xE9 |
4. {STRING_AS}.string_value_32 |
0xE9 |
0xE9 |
5. Runtime |
0xE9 (STRING_8) 0xE9 (STRING_32) |
0xE9 (STRING_8) 0xE9 (STRING_32) |
Validility
Source code encoding is either explicitly or implicitly specified.
- Explicit
- File level: UTF-8 (BOM) is implemented
- Class level: note clause (not implemented)
- Configure file: .ecf (not implemented)
- Implicit
- Implicit encoding is taken as ISO-8859-1 for compatibility, if no source code encoding is specified.
The following table shows how manifest strings are validated by the compiler:
Explicit Encoding | Implicit Encoding (ISO-8859-1) | |
STRING_8 manifest | Unicode point (0-255)? | Valid (taken as bytes) |
STRING_32 manifest | Valid | Valid |