Design of Roundtrip Parser
Contents
Introduction to roundtrip parser
Source code transformation has a lot of applications, within which, refactoring and pretty-printing are included. Often, the basic function needed by a code transformation is to regenerate source code from abstract syntax tree (AST). For example, in a feature renaming refactoring, except for the names of the feature to be renamed, all other code formatting should be preserved, e.g., indentation, letter case, comments.
Originally, AST doesn't need to store tokens, for their abstract information is implicitly included in non-terminal AST nodes. In order to perform the above code transformation, we need to modify the parser to remember all the formatting information as well to enable regenerating the source code back, hence the name, roundtrip parser.
How to remember comments, separators and text
To make roundtrip code regeneration possible, the following things have to be maintained in AST nodes so when those AST nodes are processed later, correct formatting information can be retrieved:
- Case information of identifiers such as class/feature/local names
- Case information of keywords
- Comments
- Spaces such as normal space, tab, newline characters.
For identifiers, it is easy to store case information of identifiers into AST nodes as identifiers are represented as a node in AST. For keywords, because originally, there is no AST nodes representing them, one easy solution is to add some keyword nodes into AST. For example, in AST node for if statement, some keyword sub nodes can be added for keyword "if", "then", "end". Keywords can only appear at fixed locations in an AST node, so there is no ambiguity of their order. By fixed location, I mean, for example, in an AST node for if statement, the keyword "if" can only appear as the first token, the "end" keyword can only appear as the last token.
For comments and spaces, it's even tricker,
In an Eiffel file, comments are defined by the following regular expression:
"--".*
And separators are defined by:
[ \t\r]+ \n+
We define that comment and separator are called break. And a leading break of a token is all the comments and separators that occur just before that token.
To remember break information, we add actions to the above regular expressions in eiffel.l. For example:
"--".* { last_leading_break := ast_factory.append_break (last_leading_break) } [ \t\r]+ { last_leading_break := ast_factory.append_break (last_leading_break) } \n+ { last_leading_break := ast_factory.append_break (last_leading_break) }
We need to do some extra work when matching reserved word once because there are regular expressions embedded in the expression to match once.
The reason to use an ast_factory while not just last_leading_break.append (text) is for roundtrip / non-roundtrip mode switch. e.g., if we are in non-roundtrip mode, ast_factory.append_break does nothing.
When we have matched a token, we create an AST node for this token, in which we store last_leading_break and actual text of this token. And then ,we clear last_leading_break so the next token will get new leading break. For example:
[aA][gG][eE][nN][tT] { last_token := TE_AGENT last_keyword_as_value := ast_factory.new_keyword_as (TE_AGENT, text, last_leading_break, line, column,position,5) ast_factory.clear_last_leading_break (last_leading_break) }
So, all the information of an token will be stored in an AST node.
Note about leading and trailing break In this design, I use leading break, and in a tool BANG2CREATE included in Gobo, trailing break is used. The reason to use leading break here is to avoid changing every regular expression in eiffel.l (The Gobo lex file is designed to deal with trailing break). And also, both leading and trailing break are not good. See section Alternative Design.
As mentioned before, a token, which is a terminal, will appear at leaf nodes in an AST tree. So for roundtrip parser, we need to store some extra information of token into AST tree hierarchy.
The following is the original class hierarchy of part of the AST tree:
And we modify it into:
KEYWORD_AS is for keywords and SIMBOL_AS is for Eiffel simbols is such as >,=.
We add 2 attributes into LEAF_AS: `text' and `leading_break'. `text' is used to store the actual literal of the token and `leading_break' is for break that occurs just before the token.
There is one problem left: how to store break that occur after the whole class declaration, that is:
class A feature … end -- Comments and separators here are called trailing breaks.
To solve that, we add an attribute trailing_break into CLASS_AS. And after parsing a class, we set it with last_leading_break.
Until now, all the information are stored for later use by the parser.
How to store information into an AST tree
In order to store addition information for roundtrip ability to AST tree, we modify all the XXX_AS classes which represent non-terminals. We add some attributes into these classes to store information of terminals which appear in leaves of the tree.
For example, the following is part of a rule in eiffel.y:
Declaration_body: TE_COLON Type Assigner_mark_opt Dotnet_indexing { -- Attribute case $$ := ast_factory.new_body_as (Void, $2, $3, Void) feature_indexes := $4 }
We add a parameter named `colon’ into class BODY_AS and we modify ast_factory.new_body_as to add a parameter so the code becomes:
Declaration_body: TE_COLON Type Assigner_mark_opt Dotnet_indexing { -- Attribute case $$ := ast_factory.new_body_as (Void, $2, $3, Void, $1) feature_indexes := $4 }
To do so, we store information of TE_COLON into the AST node to which it belongs.
For recursive rules, we need to store all the tokens between two items in the list generated by the rule. We add an attribute `terminal_list’ whose type is EIFFE_LIST [LEAF_AS] into class EIFFEL_LIST. And in ast_factory, we add some features like:
new_terminal_list (ter1: LEAF_AS): EIFFEL_LIST [LEAF_AS] new_terminal_list2 (ter1: LEAF_AS; ter2:LEAF_AS): EIFFEL_LIST [LEAF_AS] new_terminal_list3 (ter1: LEAF_AS; ter2:LEAF_AS; ter3:LEAF_AS): EIFFEL_LIST [LEAF_AS]
to build list with different number of items.
- new_terminal_list is used when there is only one terminal appears between two items in the list, just like the following example.
- new_terminal_listn is used when there are n terminals appear between two items in the list.
For example:
We changed the following code
New_feature_list: New_feature { $$ := ast_factory.new_eiffel_list_feature_name (counter_value + 1) if $$ /= Void and $1 /= Void then $$.reverse_extend ($1) end } | New_feature { increment_counter } TE_COMMA New_feature_list { $$ := $4 if $$ /= Void and $1 /= Void then $$.reverse_extend ($1) end } ;
into
New_feature_list: New_feature { $$ := ast_factory.new_eiffel_list_feature_name (counter_value + 1) if $$ /= Void and $1 /= Void then $$.reverse_extend ($1) end } | New_feature { increment_counter } TE_COMMA New_feature_list { $$ := $4 if $$ /= Void and $1 /= Void then $$.reverse_extend ($1) $$.reverse_extend_terminals (ast_factory.new_terminal_list($3)) end } ;
In the second rule
New_feature { increment_counter } TE_COMMA New_feature_list
because an terminal TE_COMMA appears between two items, we must store it, so we invoke
$$.reverse_extend_terminals (ast_factory.new_terminal_list($3))
ast_factory.new_terminal_list will return a EIFFEL_LIST [LEAF_AS] object in which $3 is contained. And then we use reverse_extend_terminals to store this object in `terminal_list' in the right order.
Switch between roundtrip and non-roundtrip mode
All the modifications are conducted through ast_factory to ensure when we want to switch between roundtrip mode and non-roundtrip mode, we only need to do:
1. give the parser relative AST factory.
2. set case_sensitive to True (in roundtrip mode) or to False (in non-roundtrip mode).
Alternative Design
In essence, the above method relates a certain break (a leading break, to be more precisely) to an AST node. But sometimes, it is not proper to do this. Always, there is some mismatch between break (comments, separators) and semantic structures.
For example, in the following code
feature -- Access item: STRING
comment `-- Access’ will be attached to token `item’. This is not only inappropriate but also makes it very difficult to extract information like “What is the tag for this feature clause?”.
Another example:
foo is do goo -- Check flag if flag = 3 then ... end end
If we are using a trailing break mechanism, comment `-- Check flag' will be attached to feature `goo'. But this is certainly wrong.
In fact, how to deal with break in AST tree is an open question. Because sometimes it is difficult to tell which AST node should a break be attached to. Here is some words from a parser design:
Comments, newlines, spaces, and indentation in code make it easier for humans to read, but computers just throw these extras away. Virtually every compiler tool just discards comments and whitespace, because they're not really "part of the program" as far as the tool is concerned. The ANTLR C grammar that Jazillian uses is no exception. So we changed the grammar to keep the comments, but rather than have each comment converted to a token (and thus mess up all our other rules), each comment is associated with a particular token in the input. The problem then becomes "what if that token is deleted or changed?"
When a token that's got comments associated with it is deleted or replaced, a new token is chosen (using heuristics to choose a good one) to hold the comments.
The paper "Preserving the Documentary Structure of Source Code in Language-based transformation Tools" by Michael L. Van De Vanter at Sun Laboratories explains why this approach sometimes doesn't work well.
The paper can be found at http://research.sun.com/projects/jackpot/COM.sun.mlvdv.doc.scam_nov01.abs.html
So, there is another way: Store all the tokens (now comments and separators are considered as tokens too) in a linked-list.
And store the terminals into AST nodes just like in the above method, but do not store breaks. e.g., store `feature', `item', `:', `STRING' but do not store new-line, space, `-- Comment' into AST nodes. Breaks are maintained only in the linked-list.
Every time when a visitor or a pretty-printer needs to print the source code, it will access to this linked-list to get information. It is easy to spot the location of every single piece of break.
In case that the AST tree hierarchy is static (we don't allow to change part of the AST tree dynamically, just as the parser is for the moment), we only need to use an ARRAYED_LIST and store an integer index of a terminal in non-terminal AST node to save space.
The drawback is time must be spent in searching this long list.
But it is more flexible when we want to do some refactoring.
Yet another design
How to store tokens
In this design, we don't attach a break to any token. Instead, we use a a list called `match_list' to store all tokens read, including breaks. In other words, the whole source code is stored in `match_list'.
In AST_FACTORY, we add the following features:
new_match_list (l_size: INTEGER): AST_NODE_LIST extend_match_list (a_list: AST_NODE_LIST; a_match: LEAF_AS)
When the compiler is working on roundtrip mode (which means we give the parse a AST_ROUNDTRIP_FACTORY), it will create a `match_list' before it starts to parse a class, and during parsing, and when a token is read, it will extend the token into this `match_list'. When parsing ends, this `match_list' will be given to the result CLASS_AS object.
How to manage tokens
In parser level, we must decide to which non-terminal structure a terminal (except breaks and redundent keywords, for example "inherit;") belongs.
To do that, in every AST_EIFFEL node, we maintain a list, in which all index of terminals of this non-terminal structure are stored.
Note: If we maintain a list in AST_EIFFEL, some unnecessary memory will be wasted. Because every terminal belongs to AST_EIFFEL, and certainly, it's meaningless to maintain a list of terminal index in a terminal object. So a better way to do is split the hierarchy of AST_EIFFEL into NON_LEAF_AS and LEAF_AS, and only maintain a list in NON_LEAF_AS.
So, in AST_EIFFEL, we add the following:
terminal_index: TERMINAL_INDEX_LIST set_terminal_index (a_tuple: TUPLE)
and in AST_FACTORY, we add:
set_terminal_index (a_as: AST_EIFFEL a_tuple: TUPLE)
and in eiffel.y, we do things like:
Obsolete: -- Empty -- { $$ := Void } |TE_OBSOLETE Manifest_string { $$ := $2 ast_factory.set_terminal_index ($$, [$1]) } ;
The reason to use TUPLE is to deal with different number of terminals in a uniform.
For recursive rules, we have:
Index_terms: Index_value { $$ := ast_factory.new_eiffel_list_atomic_as (counter_value + 1) if $$ /= Void and $1 /= Void then $$.reverse_extend ($1) end } |Index_value { increment_counter } TE_COMMA Index_terms { $$ := $4 if $$ /= Void and $1 /= Void then $$.reverse_extend ($1) ast_factory.reverse_extend_item_index ($$, [$3]) end } ...
The line
ast_factory.reverse_extend_item_index ($$, [$3])
will store the information of the TE_COMMA terminal in reversed order. and `reverse_extend_item_index' is used to store information of terminals that appear between every two items in a list structure.
How to regenerate code
Use a vistor to regenerate code. This visitor will recursively process the AST tree with the help of `match_list' and index information stored in every non-terminal AST node.
The visitor works like this:
the visitor has some index:
last_processed_token_index, next_token_index
last_processed_token_index shows that last token in `match_list' that has been processed. And `next_token_index' shows the next token that will be processed.
For example:
As the picture shown, the `end' token has been processed, and now we are about to process the `feature' token.
Before we process the `feature' token, the visitor will uses these two index to check if there are some tokens in between these index that have not been processed. If so, the visitor will process those tokens first, and then process the `feature' token. In this way, we will not lost a single token.