PEG Madness

Tags: programming-languages, parser, pyret

Published on Sunday, October 8th, 2017

A part of my senior project is to create a framework to allow users to use their own grammar in Pyret. A good framework would need to prohibit ambiguous grammar that users might write.

One way to do this would be to require users to write a grammar that is accepted by LALR parser generator, since LALR parser generator will result in a conflict when the grammar is ambiguous. Note that the converse is not true: there could be conflict while the grammar is not ambiguous. This is exactly the situation that we are in because Pyret is currently using GLR parser generator which accept all context-free grammars. And Pyret does have some grammar that require unbounded and non-local lookahead, so LALR is not directly applicable.

There are two solutions that I am considering:

  • Rewrite Pyret grammar so that it can be accepted by LALR parser generator. This doesn’t seem applicable since Pyret is widely use by a lot of people
  • Use PEG

PEG at first glance seems really nice: linear time parsing, unambiguous by construction, and pretty expressive. However, I found that it could be very deceptive as well.

Consider a grammarTaken from Some Aspects of Parsing Expression Grammar, pg. 4:

1
2
3
4
A ::= B "x"

B ::= "x"
    | "xx"

which generates $\set{xx, xxx}$

In PEG, there’s no choice operator. Instead, we have a prioritized choice operator /, where, if the first branch is successful, the second branch will not be considered.

Because of this, xxx can’t be derived by

1
2
A <- B "x"
B <- "x" / "xx"

On the other hand, xx can’t be derived by

1
2
A <- B "x"
B <- "xx" / "x"

Some people really think that this is bad, which could be true. But this is still not too complicated for me. I can still understand easily why it would not derive some strings.

However, this prioritized choice operator has a lot of consequences. It could cause some unexpected parsing error message as follows:

1
A <- "foo" / "foobar"

If the input is empty string, the parser will say Line 1, column 1: Expected "foo" or "foobar" but end of input found.. However, when we enter foobar, it says Line 1, column 4: Expected end of input but "b" found..

But alright, perhaps A <- "foo" / "foobar" should be considered a bad grammar and I shouldn’t write it in the first place.

However, there is a much bigger problem. Consider:

1
2
A ::= "x" A "x"
    | "xx"

which generates a set of non-empty string of even length with character x, or, formally $\set{x^{2n}: n \ge 1}$

If we were to write:

1
A <- "x" A "x" / "xx"

This generates $\set{x^{2^n}: n \ge 1}$!!!!

This is madness! How could the result be so different!

In particular, the grammar cannot derive xxxxxx (6 x). Here, I will show what happens when the parser is trying to parse the string.


Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Step 8

Step 9

Step 10

Step 11

Step 12

Step 13

Step 14

Step 15

Step 16

Step 17

Step 18

Step 19

Step 20

Step 21

Step 22

Step 23

Step 24

Step 25

Step 26

Step 27

Step 28

However, there are 6 x in the input string. This only matches 4, so it didn’t completely successfully parse.

The problem, in particular, is that at step 24, A3 could have changed to rule 2 which would then match the whole string. However, it did not because rule 1 is successful, so we backtrack to A2, completely ignoring the possibility.

Note, however, that it’s still possible to generate $\set{x^{2n}: n \ge 1}$.

1
A <- "xx" A / "xx"

will do the job.