Difference between revisions of "PEG Library"
(New page: Parsing Expression Library implementation for Eiffel.) |
m (→Internal DSL) |
||
(31 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | Parsing Expression Library implementation for Eiffel. | + | This page describes the Parsing Expression Library implementation for Eiffel. Information about PEGs can be found here [http://en.wikipedia.org/wiki/Parsing_expression_grammar]. |
+ | |||
+ | ==Basic classes== | ||
+ | All parsers inherit from PEG_ABSTRACT_PEG which defines the common functionalities. The parsers are the same as in the definition of Wikipedia with additional classes like whitespace support. | ||
+ | |||
+ | The parsers are combined to a object hierarchy which defines the grammar. A string can then be parsed via the feature | ||
+ | <code> | ||
+ | parser.parse_string ("Some source") | ||
+ | </code> | ||
+ | on the root object. | ||
+ | |||
+ | ==Internal DSL== | ||
+ | Objects can be combined via features, but the easier way is to use the defined operators. For instance if we want to define the simple grammar | ||
+ | <code> | ||
+ | 'a' 'b' 'c'* | ||
+ | </code> | ||
+ | we will simply write: | ||
+ | <code> | ||
+ | a + b + (-c) | ||
+ | </code> | ||
+ | Where a, b, c are already defined as character parsers parsing the right character (PEG_CHARACTER). The '+' operator concatenates the parsers to a sequence (PEG_SEQUENCE), while the prefix operator '-' wraps c into a one or more parser (PEG_ONE_OR_MORE). | ||
+ | All the operators are: | ||
+ | |||
+ | * binary '+': Sequence concatenation | ||
+ | * binary '|': Choice concatenation | ||
+ | * prefix '+': wraps one or more | ||
+ | * prefix '-': wraps zero or more | ||
+ | |||
+ | |||
+ | Additionally there is the operator '|+' and the '|&' operator which act like the binary '+' operator. In contrast to it, they insert a <code>whitespace*</code> (for '|+') or a <code>whitespace+</code> (for '|&') parser between the two operands. As it is often needed it makes sense to define them as operators. | ||
+ | |||
+ | Be aware of a common mistake in combination with the binary '|'/'+' operators. If for instance you define an identifier as: | ||
+ | <code> | ||
+ | identifier := a_to_z + (- (a_to_z | underscore)) | ||
+ | </code> | ||
+ | If you go on and define two new parsers based on the latter one: | ||
+ | <code> | ||
+ | identifier2 := identifier + a | ||
+ | identifier3 := identifier + a + b | ||
+ | </code> | ||
+ | .. then you won't get the expected result. <code>identifier2</code> reuses the Sequence instance of <code>identifier</code>. So by adding <code>a</code> to it we change the Sequence of identifier and do not generate a new Sequence. identifier3 will then use that corrupted identifier and modifies the <code>identifier</code> even more. Finally <code>identifier2</code> and <code>identifier3</code> point to the same instance: | ||
+ | <code> | ||
+ | (identifier a a b) | ||
+ | </code> | ||
+ | To prevent this problem identifier has to be fixated: | ||
+ | <code> | ||
+ | identifier.fixate | ||
+ | </code> | ||
+ | |||
+ | ==Building a domain model== | ||
+ | A domain model can be directly created while parsing most of the times and doesn't have to be derived from the AST. With this implementation it can be achieved by defining builder agents on the various parser fragments. | ||
+ | To show its workings we will look at an example grammar for a definition of a "list language": | ||
+ | <code> | ||
+ | '(' identifier (',' identifier)* ')' | ||
+ | </code> | ||
+ | We assume that identifier is already defined. | ||
+ | The corresponding parser in the parser syntax would be | ||
+ | |||
+ | <code> | ||
+ | list := open_parenthesis + identifier + (- (comma + identifier)) + close_parenthesis | ||
+ | </code> | ||
+ | |||
+ | No we can define a feature on identifier which builds a list item and one on list which creates a list with the items: | ||
+ | |||
+ | <code> | ||
+ | list.set_behaviour (agent build_list) | ||
+ | identifier.set_behaviour (agent build_list_item) | ||
+ | </code> | ||
+ | The implementation of those are the following: | ||
+ | <code> | ||
+ | build_list_item (a_result: PEG_PARSER_RESULT): PEG_PARSER_RESULT | ||
+ | -- Builds a value attribute | ||
+ | local | ||
+ | l_list_item: LIST_ITEM | ||
+ | do | ||
+ | Result := a_result | ||
+ | create l_list_item.make | ||
+ | if attached {STRING} Result.internal_result.first as l_name then | ||
+ | l_list_item.name := l_name | ||
+ | end | ||
+ | Result.replace_result (l_list_item) | ||
+ | end | ||
+ | </code> | ||
+ | <code> | ||
+ | build_list (a_result: PEG_PARSER_RESULT): PEG_PARSER_RESULT | ||
+ | -- Builds a value attribute | ||
+ | do | ||
+ | Result := a_result | ||
+ | Result.replace_result (Result.internal_result) | ||
+ | end | ||
+ | </code> | ||
+ | |||
+ | == Error Handling == | ||
+ | To fix source code we need to output syntax errors detected in the parsing process. After every parsing with the parse_string feature we get a PEG_PARSING_RESULT with all the information we need. | ||
+ | On the one hand there is the above mentioned internal_result which is a list of domain models (possibly empty). On the other hand the PEG_PARSING_RESULT object tells you if and where an error occurred. | ||
+ | There are two different cases of a failure: either the parser failed on a character expecting an other one and thus flagging the parsing result as a failure. Or it didn't fail but some of the source code has not been parsed, so only a part of the code is valid. | ||
+ | In the first case we can find out where the syntax lies by inquiring longest_match_line_row: | ||
+ | <code> | ||
+ | l_result.longest_match_line_row | ||
+ | </code> | ||
+ | This feature returns the line and row where the parser failed in a <code>TUPLE [line, row: INTEGER]</code>. | ||
+ | For the second scenario we have to check if the string left to parse is empty: | ||
+ | <code> | ||
+ | l_result.left_to_parse.is_empty | ||
+ | </code> | ||
+ | If it is empty everything could be parsed. Otherwise we will find the location where the parsing ended with the same <code>l_result.longest_match_line_row</code> as before. | ||
+ | ===Advanced error handling=== | ||
+ | Most of the times source code can have semantic errors which can't be detected directly via the syntax. There is the possibility of analyzing the domain model ''after'' it has been generated. With PEGs we can directly interfere with the parsing process and add errors/warnings to the output (<code>PEG_PARSER_RESULT</code>). | ||
+ | If we have the building agents defined we can validate them in the same feature. As an example we take the previously defined list grammar and expand it by raising an error and stopping the parsing process, if the name is not in the database (assuming there is a database of names): | ||
+ | <code> | ||
+ | build_list_item (a_result: PEG_PARSER_RESULT): PEG_PARSER_RESULT | ||
+ | -- Builds a value attribute | ||
+ | local | ||
+ | l_list_item: LIST_ITEM | ||
+ | do | ||
+ | Result := a_result | ||
+ | create l_list_item.make | ||
+ | if attached {STRING} Result.internal_result.first as l_name then | ||
+ | if database.has (l_name) | ||
+ | l_list_item.name := l_name | ||
+ | else | ||
+ | Result.put_error_message ("Name " + l_name + " is not in database!") -- Add an error | ||
+ | Result.success := False -- Stop the parsing process (of this branch) | ||
+ | end | ||
+ | else | ||
+ | Result.put_error_message ("Name is not valid!") -- This line should actually never be executed | ||
+ | end | ||
+ | Result.replace_result (l_list_item) | ||
+ | end | ||
+ | </code> | ||
+ | ==Debugging the parser== | ||
+ | Mostly, while building a parser, we want to know what exactly is being parsed and why for instance the parse process failed. For this purpose the debug option "peg" can be activated in Eiffel Studio. | ||
+ | Now when you try to parse a string you will always see a short serialization of the current parser, the current character and the string left to parse. | ||
+ | While this is useful, in big grammars it is difficult to keep an overview and we may not know where we are in the grammar. That's why parsers can have a 'name'. The name is display instead of the serialization and helps navigate the "parsing stack". | ||
+ | |||
+ | <code> | ||
+ | identifier.set_name ("identifier") | ||
+ | </code> |
Latest revision as of 06:58, 24 August 2009
This page describes the Parsing Expression Library implementation for Eiffel. Information about PEGs can be found here [1].
Contents
Basic classes
All parsers inherit from PEG_ABSTRACT_PEG which defines the common functionalities. The parsers are the same as in the definition of Wikipedia with additional classes like whitespace support.
The parsers are combined to a object hierarchy which defines the grammar. A string can then be parsed via the feature
parser.parse_string ("Some source")
on the root object.
Internal DSL
Objects can be combined via features, but the easier way is to use the defined operators. For instance if we want to define the simple grammar
'a' 'b' 'c'*
we will simply write:
a + b + (-c)
Where a, b, c are already defined as character parsers parsing the right character (PEG_CHARACTER). The '+' operator concatenates the parsers to a sequence (PEG_SEQUENCE), while the prefix operator '-' wraps c into a one or more parser (PEG_ONE_OR_MORE). All the operators are:
- binary '+': Sequence concatenation
- binary '|': Choice concatenation
- prefix '+': wraps one or more
- prefix '-': wraps zero or more
Additionally there is the operator '|+' and the '|&' operator which act like the binary '+' operator. In contrast to it, they insert a whitespace*
(for '|+') or a whitespace+
(for '|&') parser between the two operands. As it is often needed it makes sense to define them as operators.
Be aware of a common mistake in combination with the binary '|'/'+' operators. If for instance you define an identifier as:
identifier := a_to_z + (- (a_to_z | underscore))
If you go on and define two new parsers based on the latter one:
identifier2 := identifier + a identifier3 := identifier + a + b
.. then you won't get the expected result. identifier2
reuses the Sequence instance of identifier
. So by adding a
to it we change the Sequence of identifier and do not generate a new Sequence. identifier3 will then use that corrupted identifier and modifies the identifier
even more. Finally identifier2
and identifier3
point to the same instance:
(identifier a a b)
To prevent this problem identifier has to be fixated:
identifier.fixate
Building a domain model
A domain model can be directly created while parsing most of the times and doesn't have to be derived from the AST. With this implementation it can be achieved by defining builder agents on the various parser fragments. To show its workings we will look at an example grammar for a definition of a "list language":
'(' identifier (',' identifier)* ')'
We assume that identifier is already defined. The corresponding parser in the parser syntax would be
list := open_parenthesis + identifier + (- (comma + identifier)) + close_parenthesis
No we can define a feature on identifier which builds a list item and one on list which creates a list with the items:
list.set_behaviour (agent build_list) identifier.set_behaviour (agent build_list_item)
The implementation of those are the following:
build_list_item (a_result: PEG_PARSER_RESULT): PEG_PARSER_RESULT -- Builds a value attribute local l_list_item: LIST_ITEM do Result := a_result create l_list_item.make if attached {STRING} Result.internal_result.first as l_name then l_list_item.name := l_name end Result.replace_result (l_list_item) end
build_list (a_result: PEG_PARSER_RESULT): PEG_PARSER_RESULT -- Builds a value attribute do Result := a_result Result.replace_result (Result.internal_result) end
Error Handling
To fix source code we need to output syntax errors detected in the parsing process. After every parsing with the parse_string feature we get a PEG_PARSING_RESULT with all the information we need. On the one hand there is the above mentioned internal_result which is a list of domain models (possibly empty). On the other hand the PEG_PARSING_RESULT object tells you if and where an error occurred. There are two different cases of a failure: either the parser failed on a character expecting an other one and thus flagging the parsing result as a failure. Or it didn't fail but some of the source code has not been parsed, so only a part of the code is valid. In the first case we can find out where the syntax lies by inquiring longest_match_line_row:
l_result.longest_match_line_row
This feature returns the line and row where the parser failed in a TUPLE [line, row: INTEGER]
.
For the second scenario we have to check if the string left to parse is empty:
l_result.left_to_parse.is_empty
If it is empty everything could be parsed. Otherwise we will find the location where the parsing ended with the same l_result.longest_match_line_row
as before.
Advanced error handling
Most of the times source code can have semantic errors which can't be detected directly via the syntax. There is the possibility of analyzing the domain model after it has been generated. With PEGs we can directly interfere with the parsing process and add errors/warnings to the output (PEG_PARSER_RESULT
).
If we have the building agents defined we can validate them in the same feature. As an example we take the previously defined list grammar and expand it by raising an error and stopping the parsing process, if the name is not in the database (assuming there is a database of names):
build_list_item (a_result: PEG_PARSER_RESULT): PEG_PARSER_RESULT -- Builds a value attribute local l_list_item: LIST_ITEM do Result := a_result create l_list_item.make if attached {STRING} Result.internal_result.first as l_name then if database.has (l_name) l_list_item.name := l_name else Result.put_error_message ("Name " + l_name + " is not in database!") -- Add an error Result.success := False -- Stop the parsing process (of this branch) end else Result.put_error_message ("Name is not valid!") -- This line should actually never be executed end Result.replace_result (l_list_item) end
Debugging the parser
Mostly, while building a parser, we want to know what exactly is being parsed and why for instance the parse process failed. For this purpose the debug option "peg" can be activated in Eiffel Studio. Now when you try to parse a string you will always see a short serialization of the current parser, the current character and the string left to parse. While this is useful, in big grammars it is difficult to keep an overview and we may not know where we are in the grammar. That's why parsers can have a 'name'. The name is display instead of the serialization and helps navigate the "parsing stack".
identifier.set_name ("identifier")