Difference between revisions of "Compiler and encoding"

(Data Storage)
 
(7 intermediate revisions by one other user not shown)
Line 4: Line 4:
  
 
== Internals ==
 
== Internals ==
Abstracted syntax tree now stores STRING_8 as UTF-8 data on each node. There are also different features exporting UTF-8, UTF-32 or the written bytes. Here is an example of how a character é is represented in various levels.
+
=== Data Storage ===
 +
Abstracted syntax tree now stores STRING_8 as UTF-8 data on each node. There are also different features exporting UTF-8, UTF-32 or the written bytes.  
 +
 
 +
Here is an example of how a character é is represented at various levels.
 +
<div class="WordSection1" style="layout-grid: 15.6pt">
 +
 
 +
{| class="MsoTableGrid" style="width: 431.25pt; border-collapse: collapse; border: none" width="575" border="1"
 +
| style="width: 147.15pt; border: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="196" valign="top" |
 +
<span lang="EN-US">Source encoding</span>
 +
| style="width: 142.05pt; border: solid windowtext 1.0pt; border-left: none; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">UTF-8 (BOM)</span>
 +
| style="width: 142.05pt; border: solid windowtext 1.0pt; border-left: none; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">ISO-8859-1</span>
 +
|-
 +
| style="width: 147.15pt; border: solid windowtext 1.0pt; border-top: none; padding: 0cm 5.4pt 0cm 5.4pt" width="196" valign="top" |
 +
<span lang="EN-US">Bytes in source</span>
 +
| style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">0xC3A9</span>
 +
| style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">0xE9</span>
 +
|-
 +
| style="width: 147.15pt; border: solid windowtext 1.0pt; border-top: none; padding: 0cm 5.4pt 0cm 5.4pt" width="196" valign="top" |
 +
<span lang="EN-US">1. {STRING_AS}.value</span>
 +
| style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">0xC3A9</span>
 +
| style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">0xC3A9</span>
 +
|-
 +
| style="width: 147.15pt; border: solid windowtext 1.0pt; border-top: none; padding: 0cm 5.4pt 0cm 5.4pt" width="196" valign="top" |
 +
<span lang="EN-US">2. {STRING_AS}.binary_value</span>
 +
| style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">0xC3A9</span>
 +
| style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">0xE9</span>
 +
|-
 +
| style="width: 147.15pt; border: solid windowtext 1.0pt; border-top: none; padding: 0cm 5.4pt 0cm 5.4pt" width="196" valign="top" |
 +
<span lang="EN-US">3. {STRING_AS}.value_32</span>
 +
| style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">0xE9</span>
 +
| style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">0xE9</span>
 +
|-
 +
| style="width: 147.15pt; border: solid windowtext 1.0pt; border-top: none; padding: 0cm 5.4pt 0cm 5.4pt" width="196" valign="top" |
 +
<span lang="EN-US">4. {STRING_AS}.string_value_32</span>
 +
| style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">0xE9</span>
 +
| style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">0xE9</span>
 +
|-
 +
| style="width: 147.15pt; border: solid windowtext 1.0pt; border-top: none; padding: 0cm 5.4pt 0cm 5.4pt" width="196" valign="top" |
 +
<span lang="EN-US">5. Runtime</span>
 +
| style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">0xE9 (STRING_8)</span>
 +
 
 +
<span lang="EN-US">0xE9 (STRING_32)</span>
 +
| style="width: 142.05pt; border-top: none; border-left: none; border-bottom: solid windowtext 1.0pt; border-right: solid windowtext 1.0pt; padding: 0cm 5.4pt 0cm 5.4pt" width="189" valign="top" |
 +
<span lang="EN-US">0xE9 (STRING_8)</span>
 +
 
 +
<span lang="EN-US">0xE9 (STRING_32)</span>
 +
|}
 +
 
 +
</div>
 +
 
 +
=== Validility ===
 +
Source code encoding is either explicitly or implicitly specified.
 +
* Explicit
 +
** File level: UTF-8 (BOM) is implemented
 +
** Class level: note clause (not implemented)
 +
** Configure file: .ecf (not implemented)
 +
* Implicit
 +
** Implicit encoding is taken as ISO-8859-1 for compatibility, if no source code encoding is specified.
 +
 
 +
The following table shows how manifest strings are validated by the compiler:
 +
{| class="MsoTableGrid" style="width: 431.25pt; border-collapse: collapse; border: none" width="575" border="1"
 +
|
 +
| Explicit Encoding
 +
| Implicit Encoding (ISO-8859-1)
 +
|-
 +
| STRING_8 manifest
 +
| Unicode point (0-255)?
 +
| Valid (taken as bytes)
 +
|-
 +
| STRING_32 manifest
 +
| Valid
 +
| Valid
 +
|}

Latest revision as of 20:04, 7 June 2012


From 6.7, the compiler has been equiped with a Unicode parser. The core of the parser only accepts UTF-8 source code, for simplicity and generality. Before source code is passed into the core process of parsing, it is preprocessed and converted into UTF-8.

Internals

Data Storage

Abstracted syntax tree now stores STRING_8 as UTF-8 data on each node. There are also different features exporting UTF-8, UTF-32 or the written bytes.

Here is an example of how a character é is represented at various levels.

Source encoding

UTF-8 (BOM)

ISO-8859-1

Bytes in source

0xC3A9

0xE9

1. {STRING_AS}.value

0xC3A9

0xC3A9

2. {STRING_AS}.binary_value

0xC3A9

0xE9

3. {STRING_AS}.value_32

0xE9

0xE9

4. {STRING_AS}.string_value_32

0xE9

0xE9

5. Runtime

0xE9 (STRING_8)

0xE9 (STRING_32)

0xE9 (STRING_8)

0xE9 (STRING_32)

Validility

Source code encoding is either explicitly or implicitly specified.

  • Explicit
    • File level: UTF-8 (BOM) is implemented
    • Class level: note clause (not implemented)
    • Configure file: .ecf (not implemented)
  • Implicit
    • Implicit encoding is taken as ISO-8859-1 for compatibility, if no source code encoding is specified.

The following table shows how manifest strings are validated by the compiler:

Explicit Encoding Implicit Encoding (ISO-8859-1)
STRING_8 manifest Unicode point (0-255)? Valid (taken as bytes)
STRING_32 manifest Valid Valid