Design of Roundtrip Parser

Revision as of 00:15, 9 August 2007 by Jasonw (Talk | contribs) (How to remember comments, separators and text)

Introduction to roundtrip parser

Source code transformation has a lot of applications, within which, refactoring and pretty-printing are included. Often, the basic function needed by a code transformation is to regenerate source code from abstract syntax tree (AST). For example, in a feature renaming refactoring, except for the names of the feature to be renamed, all other code formatting should be preserved, e.g., indentation, letter case, comments.

Originally, AST doesn't need to store tokens, for their abstract information is implicitly included in non-terminal AST nodes. In order to perform code transformation, we need to modify the parser to remember all the formatting information as well to enable regenerating the source code back, hence the name, roundtrip parser.

How to remember comments, separators and text

To make roundtrip code regeneration possible, the following things have to be maintained in AST nodes so when those AST nodes are processed later, correct formatting information can be retrieved:

  • Case information of identifiers such as class/feature/local names
  • Case information of keywords
  • Comments
  • Spaces such as normal space, tab, newline characters. We define that comment and separator are called break

For identifiers, it is easy to store case information of identifiers into AST nodes because identifiers are already represented as nodes in AST.

For keywords, because originally, there is no AST nodes representing them, one easy solution is to add some keyword nodes into AST. For example, in AST node for if statement, some keyword sub nodes can be added for keyword "if", "then", "end". Keywords can only appear at fixed locations in an AST node, so there is no ambiguity of their order. By fixed location, I mean, for example, in an AST node for if statement, the keyword "if" can only appear as the first token, the "end" keyword can only appear as the last token.

For comments and spaces, it's even tricker for two reasons:

  • They are not representable in original AST nodes. This prevents the way we deal with keywords from being used here.
  • Their appearances are not in a normal form. for example, in a if statement, you can use a single space as delimiter between the "if" keyword, the conditional expression and the "then" keyword, but you can also use any combination of space, tab, newline as delimiters. Another example is that between two statement, there can be a comment, or not.

So, how to attach break information into AST nodes? A popular method is to attach break information to the AST node appear just before (this kind of break is called a trailing break) or after (this kind of break is called a leading break) it. For example, in the following code, comments are attached to the AST node that appears just after:

feature -- Access
    item: STRING

comment `-- Access’ will be attached to token `item’. This is not only inappropriate but also makes it very difficult to extract information like “What is the tag for this feature clause?”.

And in the following example, comments are attached to AST nodes that appear just before:

    -- Comment 1
    -- Comment 2

"-- Comment 1" is attached to the AST node representing statement1, and "-- Comment 2" is attached to the AST node representing statment2. This method can be applied to any kind of break, i.e., we can consider that the newline character after statement1 and the indentation before comment 1 are also attached the AST node of statement1.

One issue of this method is that it may result in intentionally wrong attachments. Consider the follow snippet:

foo is
           -- Check flag
       if flag = 3 then

If we are using a trailing break mechanism, comment `-- Check flag' will be attached to feature `goo'. But this is certainly wrong.

In fact, how to deal with break in AST tree is an open question. Because sometimes it is difficult to tell which AST node should a break be attached to. Here is some words from a parser design:

   Comments, newlines, spaces, and indentation in code make it easier for humans to read, 
   but computers just throw these extras away. Virtually every compiler tool just discards
   comments and whitespace, because they're not really "part of the program" as far as the
   tool is concerned. The ANTLR C grammar that Jazillian uses is no exception. So we changed
   the grammar to keep the comments, but rather than have each comment converted to a token
   (and thus mess up all our other rules), each comment is associated with a particular token
   in the input. The problem then becomes "what if that token is deleted or changed?" 
   When a token that's got comments associated with it is deleted or replaced, a new token is
   chosen (using heuristics to choose a good one) to hold the comments. 
   The paper "Preserving the Documentary Structure of Source Code in Language-based 
   transformation Tools" by Michael L. Van De Vanter at Sun Laboratories explains why this
   approach sometimes doesn't work well.

The paper can be found at

So, there is another way: Store all the tokens (now comments and separators are considered as tokens too) in a linked-list. For code:

foo is
        -- Comment

We have the linked-list:

Match list.jpg

And store the terminals into AST nodes just like in the above method, but do not store breaks. e.g., store "foo", "item", "is", "do" and "end", but do not store new-line, space, `-- Comment' into AST nodes. Breaks are maintained only in the linked-list. A token in the linked-list can be accessed by its position index. For example, 100 for "foo" and 106 for "end".

Every time when a visitor or a pretty-printer needs to print the source code, it will access to this linked-list to get information. It is easy to spot the location of every single piece of break.

In case that the AST tree hierarchy is static (we don't allow to change part of the AST tree dynamically, just as the parser is for the moment), we only need to use an ARRAYED_LIST and store an integer index of a terminal in non-terminal AST node to save space. For example, for the "do" keyword in the above code snippet, we create an AST node, but, instead of storing the text of the keyword in this AST node, we only store an index of the "do" keyword token which is 104 in the linked-list.

The drawback is time must be spent in searching this long list.

But it is more flexible when we want to do some refactoring.

Implementation details

To implement the roundtrip function described above, we need to modify AST nodes and Eiffel scanner/parser such that:

  • Added nodes representing keywords in existing AST nodes
  • For every token recognized by the scanner, we insert this token (it can be either a normal token or a break token) into the linked-list. If it's a normal token, we attach the index of it in the linked-list to the related AST node.

Add AST nodes for keywords

We use a class KEYWORD_AS to represent a keyword AST node, and then, for every kind of AST structure which contains keywords, we add keyword attributes. For example, class IF_AS is the AST class for if statement, its sketch is like this:

class IF_AS
feature -- Attributes
    condition: EXPR_AS
        -- Conditional expression
        -- Compound, i.e., statement list in this if-clause.

After adding keyword attributes, it will look like this:

class IF_AS
feature -- Attributes
    condition: EXPR_AS
        -- Conditional expression
        -- Compound, i.e., statement list in this if-clause.
    if_keyword: KEYWORD_AS
        -- Keyword AST for "if"
    then_keyword: KEYWORD_AS
        -- Keyword AST for "then"
    end_keyword: KEYWORD_AS
        -- Keyword AST for "end"

Staff token linked-list

For every terminal token we meet (including break token), we need to insert it into the token linked-list. Also, we need to attach the index of that token into the AST node. For example, for the following snippet:

if expression then

We have the token linked-list:

If clause match list.jpg

and the AST node for the if statement:

If clause ast.jpg

From the above, we know that if keyword is at position 100 in the token linked-list, and the space break after if keyword is at position 101, and so on. Here we assume that the condition expression is only comprised of one terminal token, so it's at position 102. The compound part if the if statement can contain many terminal tokens, so it's marked as 106~n.

When the above AST for if statement is visited in flavor of code regeneration, we know that the original text of the if keyword can be retrieved from the token linked-list at position 100.

In an Eiffel file, separators are defined by:

   [ \t\r\n]+

To remember this break information, we add actions to the above regular expressions in eiffel.l (Eiffel lexer description file). For example:

   [ \t\r\n]+    { ast_factory.create_break_as (Current)  }

This means that whenever separators are met, we create an AST node of type BREAK_AS:

create_break_as (a_scn: EIFFEL_SCANNER) is
		-- New BREAK_AS node
		b_as: BREAK_AS
			-- Increase token index `match_list_count'.
		create b_as.make (a_scn.text, a_scn.line, a_scn.column, a_scn.position, a_scn.text_count)
			-- Create an AST node for the break.
		b_as.set_index (match_list_count)
			-- Attach the token index `match_list_count' to the newly created AST node `b_as'.
		extend_match_list (b_as)
			-- Insert the AST node `b_as' into the token linked-list.

For keywords, similar code patterns presents.

We need to do some extra work when matching reserved word once because there are regular expressions embedded in the expression to match once.

Until now, all the information are stored for later use by the parser.

Switch between roundtrip and non-roundtrip mode

The involvement of the roundtrip function will definitely slow down the parsing process because we more time is spent in creating new objects, copying text and maintaining the token linked-list. To minimize the performance overhead, the Eiffel parser is designed to be able to work on either roundtrip mode or non-roundtrip mode.

All the things described above is roundtrip mode. In roundtrip mode, all work related to source code preserving is done, and at the end of parsing, we get two things:

(1) An AST node annotated with token indexes in token linked-list

(2) A token linked-list in which every token literal is stored

The major performance penalty (compared with non roundtrip parser) are:

(1) creating of AST nodes to represent keywords

(2) Maintaining terminal token indexes

(3) creating nodes (including copying text from parser) which are to be inserted into token linked-list.

The major performance penalty of roundtrip parser (compiled to the original parser without roundtrip support) comes from (3). And because for a normal compilation, roundtrip support is not necessary. In fact, roundtrip facility is only needed in code transformation cases. Based on this, a non-roundtrip mode is introduced, which only does (1) and (2) described above. After a parse in non-roundtrip mode, we only get an AST node annotated with token indexes in token linked-list.

To achieve this, all roundtrip related modifications are encapsulated in a deferred class AST_FACTORY. By giving the parser different AST_FACTORY implementations, namely, roundtrip factory and non-roundtrip factory, we can switch between these two modes easily.

The reason why the keyword AST nodes are always created and the indexes in token linked-list are always maintained even in non-roundtrip mode is that to generate a token linked-list is easy -- only scanning through the source code is enough, which means a lexer can do all the job to create the token linked-list, no parser is needed. And lexing is much faster than parsing. So by maintaining keyword AST nodes and the token indexes in non-roundtrip mode, we can get the full power of roundtrip parser later by just scanning the source code to generate the token linked-list.

How to regenerate code

To regenerate code from AST nodes and associated token linked-list, we need to use a visitor. This visitor will recursively process the AST tree with the help of the token linked-list and index information stored in every terminal AST node. From the AST node, we can arrive every terminal tokens originally in the source code except for break tokens. So for non-break tokens, we use the index stored in AST nodes to get the original text of that token from the token linked-list, and then we just print out the text. But before we print the text of a non-break token, we need to check in the token linked-list if before current non-break token, there is some break tokens that are not processed. If so, we print out those break tokens first.

Let's show how it works using the above if statement example.

Again, we have the token linked-list:

If clause match list.jpg

And the AST node:

If clause ast.jpg

Suppose that the condition expression node has been processed, which means that the expression sub node has been visited and the the text of the condition expression has been printed out, as the item at position 102 in the token linked-list has been processed as well. The next sub node to visit is the then keyword node. And item at position 104 in the token linked-list is associated with this then keyword node. But before we process this node, we check if in the token linked-list, there are unprocessed items before position 104. And in our example, item at position 103 is not processed. So we process that item first, meaning printing out the break token directly. And then we continue to process the then keyword node.

One thing to note is that after processing the last end keyword node in a class AST node, we also need to check if there are still items left in the token linked-list because end class comments and breaks can appear after the last end keyword. Consider the following code:

class A
end -- Comment here.
  -- Yet another comment.

The last comment section needs to be taken care of after processing the end keyword node.

That's how the original code is generated.