Difference between revisions of "Design of Roundtrip Parser"

(Implementation details)
(How to regenerate code)
 
(51 intermediate revisions by the same user not shown)
Line 4: Line 4:
 
Source code transformation has a lot of applications, within which, refactoring and pretty-printing are included. Often, the basic function needed by a code transformation is to regenerate source code from abstract syntax tree (AST). For example, in a feature renaming refactoring, except for the names of the feature to be renamed, all other code formatting should be preserved, e.g., indentation, letter case, comments.
 
Source code transformation has a lot of applications, within which, refactoring and pretty-printing are included. Often, the basic function needed by a code transformation is to regenerate source code from abstract syntax tree (AST). For example, in a feature renaming refactoring, except for the names of the feature to be renamed, all other code formatting should be preserved, e.g., indentation, letter case, comments.
  
Originally, AST doesn't need to store tokens, for their abstract information is implicitly included in non-terminal AST nodes. In order to perform the above code transformation, we need to modify the parser to remember all the formatting information as well to enable regenerating the source code back, hence the name, roundtrip parser.
+
Originally, AST doesn't need to store tokens, for their abstract information is implicitly included in non-terminal AST nodes. In order to perform code transformation, we need to modify the parser to remember all the formatting information as well to enable regenerating the source code back, hence the name, roundtrip parser.
  
 
== How to remember comments, separators and text ==
 
== How to remember comments, separators and text ==
Line 11: Line 11:
 
* Case information of keywords
 
* Case information of keywords
 
* Comments
 
* Comments
* Spaces such as normal space, tab, newline characters. We define that comment and separator are called break
+
* Spaces such as normal space, tab, newline characters. We define that comment and separator are called '''break'''
  
For identifiers, it is easy to store case information of identifiers into AST nodes as identifiers are represented as a node in AST.  
+
For identifiers, it is easy to store case information of identifiers into AST nodes because identifiers are already represented as nodes in AST.  
  
 
For keywords, because originally, there is no AST nodes representing them, one easy solution is to add some keyword nodes into AST. For example, in AST node for ''if'' statement, some keyword sub nodes can be added for keyword "if", "then", "end". Keywords can only appear at fixed locations in an AST node, so there is no ambiguity of their order. By fixed location, I mean, for example, in an AST node for ''if'' statement, the keyword "if" can only appear as the first token, the "end" keyword can only appear as the last token.  
 
For keywords, because originally, there is no AST nodes representing them, one easy solution is to add some keyword nodes into AST. For example, in AST node for ''if'' statement, some keyword sub nodes can be added for keyword "if", "then", "end". Keywords can only appear at fixed locations in an AST node, so there is no ambiguity of their order. By fixed location, I mean, for example, in an AST node for ''if'' statement, the keyword "if" can only appear as the first token, the "end" keyword can only appear as the last token.  
Line 22: Line 22:
  
 
So, how to attach break information into AST nodes?  
 
So, how to attach break information into AST nodes?  
A popular method is to attach break information to the AST node appear just before (this kind of break is called a trailing break) or after (this kind of break is called a leading break) it. For example, in the following code, comments are attached to the AST node that appears just after:
+
A popular method is to attach break information to the AST node appear just before or after it. For example, in the following code, comments are attached to the AST node that appears just after:
 
<code>[eiffel, N]
 
<code>[eiffel, N]
 
feature -- Access
 
feature -- Access
Line 91: Line 91:
  
  
In case that the AST tree hierarchy is static (we don't allow to change part of the AST tree dynamically, just as the parser is for the moment), we only need to use an ARRAYED_LIST and store an integer index of a terminal in non-terminal AST node to save space. For example, for the "do" keyword in the above code snippet, we create an AST node, but, instead of storing the text of the keyword in this AST node, we only store an index of the "do" keyword token which is 104 in the linked-list.  
+
In case that the AST tree hierarchy is static (we don't allow to change part of the AST tree dynamically, just as the parser is for the moment), we only need to use an ARRAYED_LIST and store an integer index of a terminal in non-terminal AST node to save space. From now on, we'll call this list a '''match list'''. For example, for the "do" keyword in the above code snippet, we create an AST node, but, instead of storing the text of the keyword in this AST node, we only store an index of the "do" keyword token which is 104 in the match list.  
  
  
Line 99: Line 99:
  
 
== Implementation details ==
 
== Implementation details ==
To implement the roundtrip function described above, we need to modify AST nodes and Eiffel scanner/parser such that:
+
To implement the roundtrip function described above, we need to modify AST nodes and Eiffel scanner/parser in the following way:
* Added nodes representing keywords in existing AST nodes
+
* Add nodes representing keywords in existing AST nodes
* For every token recognized by the scanner, we insert this token (it can be either a normal token or a break token) into the linked-list. If it's a normal token, we attach the index of it in the linked-list to the related AST node.  
+
* For every token recognized by the scanner, we insert this token (it can be either a normal token or a break token) into the match list. If it's a normal token, we attach the index of it in the match list to the related AST node.  
  
 
=== Add AST nodes for keywords ===
 
=== Add AST nodes for keywords ===
 
We use a class KEYWORD_AS to represent a keyword AST node, and then, for every kind of AST structure which contains keywords, we add keyword attributes. For example, class IF_AS is the AST class for ''if'' statement, its sketch is like this:
 
We use a class KEYWORD_AS to represent a keyword AST node, and then, for every kind of AST structure which contains keywords, we add keyword attributes. For example, class IF_AS is the AST class for ''if'' statement, its sketch is like this:
<code>[Eiffel,N]
+
<code>[eiffel, N]
 
class IF_AS
 
class IF_AS
 
feature -- Attributes
 
feature -- Attributes
Line 117: Line 117:
 
</code>
 
</code>
  
=== Staff token linked list ===
+
After adding keyword attributes, it will look like this:
In an Eiffel file, comments are defined by the following regular expression:
+
  
    "--".*
+
<code>[eiffel, N]
 +
class IF_AS
 +
feature -- Attributes
  
And separators are defined by:
+
    condition: EXPR_AS
 +
        -- Conditional expression
  
     [ \t\r]+
+
     compound: EIFFEL_LIST [INSTRUCTION_AS]
    \n+
+
        -- Compound, i.e., statement list in this if-clause.
+
We define that comment and separator are called break. And a leading break of a token is all the comments and separators that occur just before that token.
+
  
To remember break information, we add actions to the above regular expressions in eiffel.l. For example:
+
    if_keyword: KEYWORD_AS
 +
        -- Keyword AST for "if"
  
     "--".*    { last_leading_break := ast_factory.append_break (last_leading_break) }
+
     then_keyword: KEYWORD_AS
    [ \t\r]+  { last_leading_break := ast_factory.append_break (last_leading_break) }
+
        -- Keyword AST for "then"
    \n+       { last_leading_break := ast_factory.append_break (last_leading_break) }
+
+
We need to do some extra work when matching reserved word once because there are regular expressions embedded in the expression to match once.
+
  
 +
    end_keyword: KEYWORD_AS
 +
        -- Keyword AST for "end"
 +
end
 +
</code>
  
The reason to use an '''ast_factory''' while not just '''last_leading_break.append (text)''' is for roundtrip / non-roundtrip mode switch. e.g., if we are in non-roundtrip mode, '''ast_factory.append_break''' does nothing.
 
  
 +
=== Staff match list ===
  
When we have matched a token, we create an AST node for this token, in which we store last_leading_break and actual text of this token. And then ,we clear last_leading_break so the next token will get new leading break. For example:
+
For every terminal token we meet (including break token), we need to insert it into the match list. Also, we need to attach the index of that token into the AST node. For example, for the following snippet:
 +
<code>[eiffel, N]
 +
if expression then
 +
    compound
 +
end
 +
</code>
  
    [aA][gG][eE][nN][tT]    {
+
We have the match list:
                                last_token := TE_AGENT
+
                                last_keyword_as_value :=
+
                                ast_factory.new_keyword_as (TE_AGENT, text, last_leading_break,
+
                                                            line, column,position,5)
+
                                ast_factory.clear_last_leading_break (last_leading_break)
+
                              }
+
  
 +
[[image:if_clause_match_list.jpg]]
  
So, all the information of an token will be stored in an AST node.
 
  
    Note about leading and trailing break
+
and the AST node for the ''if'' statement:
    In this design, I use leading break, and in a tool BANG2CREATE included in Gobo,
+
    trailing break is used. The reason to use leading break here is to avoid changing
+
    every regular expression in eiffel.l (The Gobo lex file is designed to deal with
+
    trailing break). And also, both leading and trailing break are not good.
+
    See section '''Alternative Design'''.
+
  
 +
[[image:if_clause_ast.jpg]]
  
As mentioned before, a token, which is a terminal, will appear at leaf nodes in an AST tree. So for roundtrip parser, we need to store some extra information of token into AST tree hierarchy.
+
From the above, we know that '''if''' keyword is at position 100 in the match list, and the space break after '''if''' keyword is at position 101, and so on. Here we assume that the condition expression is only comprised of one terminal token, so it's at position 102. The compound part of the ''if'' statement can contain many terminal tokens, so it's marked as 106~n.  
  
The following is the original class hierarchy of part of the AST tree:
+
When the above AST for ''if'' statement is visited in flavor of code regeneration, we know that the original text of the ''if'' keyword can be retrieved from the match list at position 100.
  
[[Image:Ori_ast_tree.JPG]]
+
In an Eiffel file, separators are defined by:
  
And we modify it into:
+
    [ \t\r\n]+
 +
 +
To remember this break information, we add actions to the above regular expressions in eiffel.l (Eiffel lexer description file). For example:
  
[[Image:Revised_ast_tree.JPG]]
+
    [ \t\r\n]+    { ast_factory.create_break_as (Current)  }
  
KEYWORD_AS is for keywords and SIMBOL_AS is for Eiffel simbols is such as >,=.
+
This means that whenever separators are met, we create an AST node of type BREAK_AS:
  
We add 2 attributes into LEAF_AS: `text' and `leading_break'. `text' is used to store the actual literal of the token and `leading_break' is for break that occurs just before the token.
+
<code>[eiffel, N]
 +
create_break_as (a_scn: EIFFEL_SCANNER) is
 +
-- New BREAK_AS node
 +
local
 +
b_as: BREAK_AS
 +
do
 +
increase_match_list_count
 +
-- Increase token index `match_list_count'.
  
There is one problem left: how to store break that occur after the whole class declaration, that is:
+
create b_as.make (a_scn.text, a_scn.line, a_scn.column, a_scn.position, a_scn.text_count)
 +
-- Create an AST node for the break.
  
    class A
+
b_as.set_index (match_list_count)
    feature
+
-- Attach the token index `match_list_count' to the newly created AST node `b_as'.
        …
+
    end
+
   
+
      -- Comments and separators here are called trailing breaks.
+
  
To solve that, we add an attribute trailing_break into CLASS_AS. And after parsing a class, we set it with last_leading_break.
+
extend_match_list (b_as)
 +
-- Insert the AST node `b_as' into the match list.
 +
end
 +
</code>
  
 +
For keywords, similar code patterns present.
 +
 +
We need to do some extra work when matching reserved word once because there are regular expressions embedded in the expression to match once.
  
 
Until now, all the information are stored for later use by the parser.
 
Until now, all the information are stored for later use by the parser.
  
== How to store information into an AST tree ==
+
=== Switch between roundtrip and non-roundtrip mode ===
 +
The involvement of the roundtrip function will definitely slow down the parsing process because more time is spent in creating new objects, copying text and maintaining the match list. To minimize the performance overhead, the Eiffel parser is designed to be able to work on either roundtrip mode or non-roundtrip mode.
  
In order to store addition information for roundtrip ability to AST tree, we modify all the XXX_AS classes which represent non-terminals. We add some attributes into these classes to store information of terminals which appear in leaves of the tree.
 
  
For example, the following is part of a rule in eiffel.y:
+
All the things described above is roundtrip mode. In roundtrip mode, all work related to source code preserving is done, and at the end of parsing, we get two things:
  
    Declaration_body: TE_COLON Type Assigner_mark_opt Dotnet_indexing
+
(1) An AST node annotated with token indexes in match list
        {
+
                -- Attribute case
+
            $$ := ast_factory.new_body_as (Void, $2, $3, Void)
+
            feature_indexes := $4
+
        }
+
  
We add a parameter named `colon’ into class BODY_AS and we modify ast_factory.new_body_as to add a parameter so the code becomes:
+
(2) A match list in which every token literal is stored
   
+
    Declaration_body: TE_COLON Type Assigner_mark_opt Dotnet_indexing
+
        {
+
                -- Attribute case
+
            $$ := ast_factory.new_body_as (Void, $2, $3, Void, '''$1''')
+
            feature_indexes := $4
+
        }
+
  
To do so, we store information of TE_COLON into the AST node to which it belongs.
 
  
 +
The performance penalty (compared with non roundtrip parser) are:
  
For recursive rules, we need to store all the tokens between two items in the list generated by the rule. We add an attribute `terminal_list’ whose type is EIFFE_LIST [LEAF_AS] into class EIFFEL_LIST. And in ast_factory, we add some features like:
+
(1) creating of AST nodes to represent keywords
    new_terminal_list (ter1: LEAF_AS): EIFFEL_LIST [LEAF_AS]
+
    new_terminal_list2 (ter1: LEAF_AS; ter2:LEAF_AS): EIFFEL_LIST [LEAF_AS]
+
    new_terminal_list3 (ter1: LEAF_AS; ter2:LEAF_AS; ter3:LEAF_AS): EIFFEL_LIST [LEAF_AS]
+
  
to build list with different number of items.
+
(2) Maintaining terminal token indexes
  
*new_terminal_list is used when there is only one terminal appears between two items in the list, just like the following example.
+
(3) creating nodes (including copying text from parser) which are to be inserted into match list.
*new_terminal_list''n'' is used when there are n terminals appear between two items in the list.
+
  
For example:
 
  
We changed the following code  
+
The major performance penalty of roundtrip parser (compiled to the original parser without roundtrip support) comes from (3). And because for a normal compilation, roundtrip support is not necessary. In fact, roundtrip facility is only needed in code transformation cases. Based on this, a non-roundtrip mode is introduced, which only does (1) and (2) described above. After a parse in non-roundtrip mode, we only get an AST node annotated with token indexes in match list.
    New_feature_list: New_feature
+
          {
+
            $$ := ast_factory.new_eiffel_list_feature_name (counter_value + 1)
+
            if $$ /= Void and $1 /= Void then
+
                $$.reverse_extend ($1)
+
            end
+
          }
+
        | New_feature { increment_counter } TE_COMMA New_feature_list
+
          {
+
            $$ := $4
+
            if $$ /= Void and $1 /= Void then
+
                $$.reverse_extend ($1)
+
            end
+
          }
+
    ;
+
  
into
 
  
    New_feature_list: New_feature
+
To achieve this, all roundtrip related modifications are encapsulated in a deferred class AST_FACTORY. By giving the parser different AST_FACTORY implementations, namely, roundtrip factory and non-roundtrip factory, we can switch between these two modes easily.
          {
+
            $$ := ast_factory.new_eiffel_list_feature_name (counter_value + 1)
+
            if $$ /= Void and $1 /= Void then
+
                $$.reverse_extend ($1)
+
            end
+
          }
+
        | New_feature { increment_counter } TE_COMMA New_feature_list
+
          {
+
            $$ := $4
+
            if $$ /= Void and $1 /= Void then
+
                $$.reverse_extend ($1)
+
                '''$$.reverse_extend_terminals (ast_factory.new_terminal_list($3))'''
+
            end
+
          }
+
        ;
+
  
In the second rule
 
  
    New_feature { increment_counter } TE_COMMA New_feature_list
+
The reason why the keyword AST nodes are always created and the indexes in match list are always maintained even in non-roundtrip mode is that to generate a match list is easy -- only scanning through the source code is enough, which means a lexer can do all the job to create the match list, no parser is needed. And lexing is much faster than parsing. So by maintaining keyword AST nodes and the token indexes in non-roundtrip mode, we can get the full power of roundtrip parser later by just scanning the source code to generate the match list.
  
because an terminal TE_COMMA appears between two items, we must store it, so we invoke
+
== How to regenerate code ==
 +
To regenerate code from AST nodes and the associated match list, we need to use a visitor. This visitor will recursively process the AST tree with the help of the match list and index information stored in every terminal AST node. From the AST node, we can arrive every terminal tokens originally in the source code except for break tokens. So for non-break tokens, we use the index stored in AST nodes to get the original text of that token from the match list, and then we just print out the text. But before we print the text of a non-break token, we need to check in the match list if before current non-break token, there is some break tokens that are not processed. If so, we print out those break tokens first.
  
    $$.reverse_extend_terminals (ast_factory.new_terminal_list($3))
 
  
ast_factory.new_terminal_list will return a EIFFEL_LIST [LEAF_AS] object in which $3 is contained. And then we use reverse_extend_terminals to store this object in `terminal_list' in the right order.
+
Let's show how it works using the above ''if'' statement example.
  
 +
Again, we have the match list:
  
== Switch between roundtrip and non-roundtrip mode ==
+
[[image:if_clause_match_list.jpg]]
All the modifications are conducted through ast_factory to ensure when we want to switch between roundtrip mode and non-roundtrip mode, we only need to do:
+
  
1. give the parser relative AST factory.
+
And the AST node:
  
2. set case_sensitive to True (in roundtrip mode) or to False (in non-roundtrip mode).
+
[[Image:If_clause_ast.jpg]]
  
  
== Alternative Design ==
+
Suppose that the condition expression node has been processed, which means that the expression sub node has been visited and the the text of the condition expression has been printed out, and the item at position 102 in the match list has been processed as well. The next sub node to visit is the '''then''' keyword node. And item at position 104 in the match list is associated with this '''then''' keyword node. But before we process this node, we check if in the match list, there are unprocessed items before position 104. And in our example, item at position 103 is not processed. So we process that item first by printing out the break token directly. And then we continue to process the '''then''' keyword node.
  
In essence, the above method relates a certain break (a leading break, to be more precisely) to an AST node. But sometimes, it is not proper to do this. Always, there is some mismatch between break (comments, separators) and semantic structures.
+
One thing to note is that after processing the last '''end''' keyword node in a class AST node, we also need to check if there are still items left in the match list because end class comments and breaks can appear after the last '''end''' keyword. Consider the following code:
  
For example, in the following code
+
<code>[eiffel, N]
 +
class A
 +
feature
 +
end -- Comment here.
  
    feature -- Access
+
  -- Yet another comment.
        item: STRING
+
</code>
 
+
comment `-- Access’ will be attached to token `item’. This is not only inappropriate but also makes it very difficult to extract information like “What is the tag for this feature clause?”.
+
 
+
Another example:
+
 
+
    foo is
+
      do
+
          goo
+
              -- Check flag
+
          if flag = 3 then
+
              ...
+
          end
+
      end
+
 
+
If we are using a trailing break mechanism, comment `-- Check flag' will be attached to feature `goo'. But this is certainly wrong.
+
 
+
 
+
 
+
In fact, how to deal with break in AST tree is an open question. Because sometimes it is difficult to tell which AST node should a break be attached to. Here is some words from a parser design:
+
 
+
    Comments, newlines, spaces, and indentation in code make it easier for humans to read,
+
    but computers just throw these extras away. Virtually every compiler tool just discards
+
    comments and whitespace, because they're not really "part of the program" as far as the
+
    tool is concerned. The ANTLR C grammar that Jazillian uses is no exception. So we changed
+
    the grammar to keep the comments, but rather than have each comment converted to a token
+
    (and thus mess up all our other rules), each comment is associated with a particular token
+
    in the input. The problem then becomes "what if that token is deleted or changed?"
+
 
+
    When a token that's got comments associated with it is deleted or replaced, a new token is
+
    chosen (using heuristics to choose a good one) to hold the comments.
+
 
+
    The paper "Preserving the Documentary Structure of Source Code in Language-based
+
    transformation Tools" by Michael L. Van De Vanter at Sun Laboratories explains why this
+
    approach sometimes doesn't work well.
+
 
+
The paper can be found at http://research.sun.com/projects/jackpot/COM.sun.mlvdv.doc.scam_nov01.abs.html
+
 
+
So, there is another way: Store all the tokens (now comments and separators are considered as tokens too) in a linked-list.
+
 
+
 
+
[[Image:Token_list2.jpg]]
+
 
+
 
+
And store the terminals into AST nodes just like in the above method, but do not store breaks. e.g., store `feature', `item', `:', `STRING' but do not store new-line, space, `-- Comment' into AST nodes. Breaks are maintained only in the linked-list.
+
 
+
 
+
Every time when a visitor or a pretty-printer needs to print the source code, it will access to this linked-list to get information. It is easy to spot the location of every single piece of break.
+
 
+
 
+
In case that the AST tree hierarchy is static (we don't allow to change part of the AST tree dynamically, just as the parser is for the moment), we only need to use an ARRAYED_LIST and store an integer index of a terminal in non-terminal AST node to save space.
+
 
+
 
+
The drawback is time must be spent in searching this long list.
+
 
+
But it is more flexible when we want to do some refactoring.
+
 
+
=Yet another design=
+
 
+
== How to store tokens ==
+
 
+
In this design, we don't attach a break to any token. Instead, we use a a list called `match_list' to store all tokens read, including breaks. In other words, the whole source code is stored in `match_list'.
+
 
+
In AST_FACTORY, we add the following features:
+
 
+
    new_match_list (l_size: INTEGER): AST_NODE_LIST
+
    extend_match_list (a_list: AST_NODE_LIST; a_match: LEAF_AS)
+
 
+
When the compiler is working on roundtrip mode (which means we give the parse a AST_ROUNDTRIP_FACTORY), it will create a `match_list' before it starts to parse a class, and during parsing, and when a token is read, it will extend the token into this `match_list'. When parsing ends, this `match_list' will be given to the result CLASS_AS object.
+
 
+
== How to manage tokens ==
+
 
+
In parser level, we must decide to which non-terminal structure a terminal (except breaks and redundent keywords, for example "inherit;") belongs.
+
 
+
To do that, in every AST_EIFFEL node, we maintain a list, in which all index of terminals of this non-terminal structure are stored.
+
 
+
    Note: If we maintain a list in AST_EIFFEL, some unnecessary memory will be wasted. Because
+
    every terminal belongs to AST_EIFFEL, and certainly, it's meaningless to maintain a list of
+
    terminal index in a terminal object. So a better way to do is split the hierarchy of
+
    AST_EIFFEL into NON_LEAF_AS and LEAF_AS, and only maintain a list in NON_LEAF_AS.
+
 
+
So, in AST_EIFFEL, we add the following:
+
 
+
    terminal_index: TERMINAL_INDEX_LIST
+
    set_terminal_index (a_tuple: TUPLE)
+
 
+
and in AST_FACTORY, we add:
+
 
+
    set_terminal_index (a_as: AST_EIFFEL a_tuple: TUPLE)
+
 
+
and in eiffel.y, we do things like:
+
 
+
    Obsolete: -- Empty
+
        -- { $$ := Void }
+
    |TE_OBSOLETE Manifest_string
+
    { $$ := $2
+
      '''ast_factory.set_terminal_index ($$, [$1])'''
+
    }
+
    ;
+
 
+
The reason to use TUPLE is to deal with different number of terminals in a uniform.
+
 
+
For recursive rules, we have:
+
 
+
 
+
    Index_terms: Index_value
+
      {
+
        $$ := ast_factory.new_eiffel_list_atomic_as (counter_value + 1)
+
        if $$ /= Void and $1 /= Void then
+
          $$.reverse_extend ($1)
+
      end
+
      }
+
    |Index_value { increment_counter } TE_COMMA Index_terms
+
      {
+
        $$ := $4
+
        if $$ /= Void and $1 /= Void then
+
          $$.reverse_extend ($1)
+
          '''ast_factory.reverse_extend_item_index ($$, [$3])'''
+
        end
+
      }
+
      ...
+
 
+
 
+
The line
+
    ast_factory.reverse_extend_item_index ($$, [$3])
+
 
+
will store the information of the TE_COMMA terminal in reversed order. and `reverse_extend_item_index' is used to store information of terminals that appear between every two items in a list structure.
+
 
+
 
+
== How to regenerate code ==
+
 
+
Use a vistor to regenerate code. This visitor will recursively process the AST tree with the help of `match_list' and index information stored in every non-terminal AST node.
+
 
+
The visitor works like this:
+
 
+
the visitor has some index:
+
    last_processed_token_index,
+
    next_token_index
+
 
+
last_processed_token_index shows that last token in `match_list' that has been processed. And `next_token_index' shows the next token that will be processed.
+
 
+
For example:
+
 
+
[[Image:List_with_index.jpg]]
+
  
 +
The last comment section needs to be taken care of after processing the '''end''' keyword node.
  
As the picture shown, the `end' token has been processed, and now we are about to process the `feature' token.
+
An quick way to generate source code is by just printing out every item in the match list. This is suitable in cases that no code transformation is needed.
  
Before we process the `feature' token, the visitor will uses these two index to check if there are some tokens in between these index that have not been processed. If so, the visitor will process those tokens first, and then process the `feature' token. In this way, we will not lost a single token.
+
That's how the original code is generated.

Latest revision as of 23:49, 8 August 2007


Introduction to roundtrip parser

Source code transformation has a lot of applications, within which, refactoring and pretty-printing are included. Often, the basic function needed by a code transformation is to regenerate source code from abstract syntax tree (AST). For example, in a feature renaming refactoring, except for the names of the feature to be renamed, all other code formatting should be preserved, e.g., indentation, letter case, comments.

Originally, AST doesn't need to store tokens, for their abstract information is implicitly included in non-terminal AST nodes. In order to perform code transformation, we need to modify the parser to remember all the formatting information as well to enable regenerating the source code back, hence the name, roundtrip parser.

How to remember comments, separators and text

To make roundtrip code regeneration possible, the following things have to be maintained in AST nodes so when those AST nodes are processed later, correct formatting information can be retrieved:

  • Case information of identifiers such as class/feature/local names
  • Case information of keywords
  • Comments
  • Spaces such as normal space, tab, newline characters. We define that comment and separator are called break

For identifiers, it is easy to store case information of identifiers into AST nodes because identifiers are already represented as nodes in AST.

For keywords, because originally, there is no AST nodes representing them, one easy solution is to add some keyword nodes into AST. For example, in AST node for if statement, some keyword sub nodes can be added for keyword "if", "then", "end". Keywords can only appear at fixed locations in an AST node, so there is no ambiguity of their order. By fixed location, I mean, for example, in an AST node for if statement, the keyword "if" can only appear as the first token, the "end" keyword can only appear as the last token.

For comments and spaces, it's even tricker for two reasons:

  • They are not representable in original AST nodes. This prevents the way we deal with keywords from being used here.
  • Their appearances are not in a normal form. for example, in a if statement, you can use a single space as delimiter between the "if" keyword, the conditional expression and the "then" keyword, but you can also use any combination of space, tab, newline as delimiters. Another example is that between two statement, there can be a comment, or not.

So, how to attach break information into AST nodes? A popular method is to attach break information to the AST node appear just before or after it. For example, in the following code, comments are attached to the AST node that appears just after:

feature -- Access
    item: STRING

comment `-- Access’ will be attached to token `item’. This is not only inappropriate but also makes it very difficult to extract information like “What is the tag for this feature clause?”.

And in the following example, comments are attached to AST nodes that appear just before:

statement1
    -- Comment 1
statement2
    -- Comment 2

"-- Comment 1" is attached to the AST node representing statement1, and "-- Comment 2" is attached to the AST node representing statment2. This method can be applied to any kind of break, i.e., we can consider that the newline character after statement1 and the indentation before comment 1 are also attached the AST node of statement1.


One issue of this method is that it may result in intentionally wrong attachments. Consider the follow snippet:

foo is
   do
       goo
           -- Check flag
       if flag = 3 then
          ...
       end
   end

If we are using a trailing break mechanism, comment `-- Check flag' will be attached to feature `goo'. But this is certainly wrong.

In fact, how to deal with break in AST tree is an open question. Because sometimes it is difficult to tell which AST node should a break be attached to. Here is some words from a parser design:

   Comments, newlines, spaces, and indentation in code make it easier for humans to read, 
   but computers just throw these extras away. Virtually every compiler tool just discards
   comments and whitespace, because they're not really "part of the program" as far as the
   tool is concerned. The ANTLR C grammar that Jazillian uses is no exception. So we changed
   the grammar to keep the comments, but rather than have each comment converted to a token
   (and thus mess up all our other rules), each comment is associated with a particular token
   in the input. The problem then becomes "what if that token is deleted or changed?" 
   When a token that's got comments associated with it is deleted or replaced, a new token is
   chosen (using heuristics to choose a good one) to hold the comments. 
   The paper "Preserving the Documentary Structure of Source Code in Language-based 
   transformation Tools" by Michael L. Van De Vanter at Sun Laboratories explains why this
   approach sometimes doesn't work well.

The paper can be found at http://research.sun.com/projects/jackpot/COM.sun.mlvdv.doc.scam_nov01.abs.html

So, there is another way: Store all the tokens (now comments and separators are considered as tokens too) in a linked-list. For code:

foo is
        -- Comment
    do
    end

We have the linked-list:

Match list.jpg


And store the terminals into AST nodes just like in the above method, but do not store breaks. e.g., store "foo", "item", "is", "do" and "end", but do not store new-line, space, `-- Comment' into AST nodes. Breaks are maintained only in the linked-list. A token in the linked-list can be accessed by its position index. For example, 100 for "foo" and 106 for "end".


Every time when a visitor or a pretty-printer needs to print the source code, it will access to this linked-list to get information. It is easy to spot the location of every single piece of break.


In case that the AST tree hierarchy is static (we don't allow to change part of the AST tree dynamically, just as the parser is for the moment), we only need to use an ARRAYED_LIST and store an integer index of a terminal in non-terminal AST node to save space. From now on, we'll call this list a match list. For example, for the "do" keyword in the above code snippet, we create an AST node, but, instead of storing the text of the keyword in this AST node, we only store an index of the "do" keyword token which is 104 in the match list.


The drawback is time must be spent in searching this long list.

But it is more flexible when we want to do some refactoring.

Implementation details

To implement the roundtrip function described above, we need to modify AST nodes and Eiffel scanner/parser in the following way:

  • Add nodes representing keywords in existing AST nodes
  • For every token recognized by the scanner, we insert this token (it can be either a normal token or a break token) into the match list. If it's a normal token, we attach the index of it in the match list to the related AST node.

Add AST nodes for keywords

We use a class KEYWORD_AS to represent a keyword AST node, and then, for every kind of AST structure which contains keywords, we add keyword attributes. For example, class IF_AS is the AST class for if statement, its sketch is like this:

class IF_AS
feature -- Attributes
 
    condition: EXPR_AS
        -- Conditional expression
 
    compound: EIFFEL_LIST [INSTRUCTION_AS]
        -- Compound, i.e., statement list in this if-clause.
end

After adding keyword attributes, it will look like this:

class IF_AS
feature -- Attributes
 
    condition: EXPR_AS
        -- Conditional expression
 
    compound: EIFFEL_LIST [INSTRUCTION_AS]
        -- Compound, i.e., statement list in this if-clause.
 
    if_keyword: KEYWORD_AS
        -- Keyword AST for "if"
 
    then_keyword: KEYWORD_AS
        -- Keyword AST for "then"
 
    end_keyword: KEYWORD_AS
        -- Keyword AST for "end"
end


Staff match list

For every terminal token we meet (including break token), we need to insert it into the match list. Also, we need to attach the index of that token into the AST node. For example, for the following snippet:

if expression then
    compound
end

We have the match list:

If clause match list.jpg


and the AST node for the if statement:

If clause ast.jpg

From the above, we know that if keyword is at position 100 in the match list, and the space break after if keyword is at position 101, and so on. Here we assume that the condition expression is only comprised of one terminal token, so it's at position 102. The compound part of the if statement can contain many terminal tokens, so it's marked as 106~n.

When the above AST for if statement is visited in flavor of code regeneration, we know that the original text of the if keyword can be retrieved from the match list at position 100.

In an Eiffel file, separators are defined by:

   [ \t\r\n]+

To remember this break information, we add actions to the above regular expressions in eiffel.l (Eiffel lexer description file). For example:

   [ \t\r\n]+    { ast_factory.create_break_as (Current)  }

This means that whenever separators are met, we create an AST node of type BREAK_AS:

create_break_as (a_scn: EIFFEL_SCANNER) is
		-- New BREAK_AS node
	local
		b_as: BREAK_AS
	do
		increase_match_list_count
			-- Increase token index `match_list_count'.
 
		create b_as.make (a_scn.text, a_scn.line, a_scn.column, a_scn.position, a_scn.text_count)
			-- Create an AST node for the break.
 
		b_as.set_index (match_list_count)
			-- Attach the token index `match_list_count' to the newly created AST node `b_as'.
 
		extend_match_list (b_as)
			-- Insert the AST node `b_as' into the match list.
	end

For keywords, similar code patterns present.

We need to do some extra work when matching reserved word once because there are regular expressions embedded in the expression to match once.

Until now, all the information are stored for later use by the parser.

Switch between roundtrip and non-roundtrip mode

The involvement of the roundtrip function will definitely slow down the parsing process because more time is spent in creating new objects, copying text and maintaining the match list. To minimize the performance overhead, the Eiffel parser is designed to be able to work on either roundtrip mode or non-roundtrip mode.


All the things described above is roundtrip mode. In roundtrip mode, all work related to source code preserving is done, and at the end of parsing, we get two things:

(1) An AST node annotated with token indexes in match list

(2) A match list in which every token literal is stored


The performance penalty (compared with non roundtrip parser) are:

(1) creating of AST nodes to represent keywords

(2) Maintaining terminal token indexes

(3) creating nodes (including copying text from parser) which are to be inserted into match list.


The major performance penalty of roundtrip parser (compiled to the original parser without roundtrip support) comes from (3). And because for a normal compilation, roundtrip support is not necessary. In fact, roundtrip facility is only needed in code transformation cases. Based on this, a non-roundtrip mode is introduced, which only does (1) and (2) described above. After a parse in non-roundtrip mode, we only get an AST node annotated with token indexes in match list.


To achieve this, all roundtrip related modifications are encapsulated in a deferred class AST_FACTORY. By giving the parser different AST_FACTORY implementations, namely, roundtrip factory and non-roundtrip factory, we can switch between these two modes easily.


The reason why the keyword AST nodes are always created and the indexes in match list are always maintained even in non-roundtrip mode is that to generate a match list is easy -- only scanning through the source code is enough, which means a lexer can do all the job to create the match list, no parser is needed. And lexing is much faster than parsing. So by maintaining keyword AST nodes and the token indexes in non-roundtrip mode, we can get the full power of roundtrip parser later by just scanning the source code to generate the match list.

How to regenerate code

To regenerate code from AST nodes and the associated match list, we need to use a visitor. This visitor will recursively process the AST tree with the help of the match list and index information stored in every terminal AST node. From the AST node, we can arrive every terminal tokens originally in the source code except for break tokens. So for non-break tokens, we use the index stored in AST nodes to get the original text of that token from the match list, and then we just print out the text. But before we print the text of a non-break token, we need to check in the match list if before current non-break token, there is some break tokens that are not processed. If so, we print out those break tokens first.


Let's show how it works using the above if statement example.

Again, we have the match list:

If clause match list.jpg

And the AST node:

If clause ast.jpg


Suppose that the condition expression node has been processed, which means that the expression sub node has been visited and the the text of the condition expression has been printed out, and the item at position 102 in the match list has been processed as well. The next sub node to visit is the then keyword node. And item at position 104 in the match list is associated with this then keyword node. But before we process this node, we check if in the match list, there are unprocessed items before position 104. And in our example, item at position 103 is not processed. So we process that item first by printing out the break token directly. And then we continue to process the then keyword node.

One thing to note is that after processing the last end keyword node in a class AST node, we also need to check if there are still items left in the match list because end class comments and breaks can appear after the last end keyword. Consider the following code:

class A
feature
end -- Comment here.
 
  -- Yet another comment.

The last comment section needs to be taken care of after processing the end keyword node.

An quick way to generate source code is by just printing out every item in the match list. This is suitable in cases that no code transformation is needed.

That's how the original code is generated.