Practical PLT Part 5: A Parser Generator

July 29th, 2008

At this point we’ve developed a complete, though very simple, interpreter and runtime environment.  To some degree, this interpreter is all we need.  It can express any function, it can easily be extended, it has a simple parser, and it has a working garbage collector.  What more could we ask for?  Life is good.

Life is good, that is, until we ask somebody else to use it.  Their first response is probably an annoyance at the syntax — it’s a little surprising to have to write (+ 1 1) after grade school has burned 1 + 1 into our minds — and, truth be told, the S-Expression syntax isn’t the most natural or compact for every problem.

Luckily we can, with a little work, entirely eliminate this complaint by creating a parser generator — a function that takes as input a grammar (a formal description of a language), and produces as output a parser for that language.

Philosophical Preliminaries

A reasonable question to consider first in attacking this problem is: what is a grammar?  For some people (myself included), the mere mention of the word “grammar” may evoke the mental image of a sharp-eyed schoolmarm, whose equally sharp-edged ruler is held poised at her side, ready to sting the flesh of any hand mapping erroneously the boundary of a prepositional phrase.  This brief and terrible image is enough to bring back the entire experience of High School English, with impenetrable sonnets and lengthy analyses of The Canterbury Tales (which you’re told is provocative and risque, but which seems about as titillating as your balding teacher playing “Moonlight Sonata” on his well-insulated abdomen).  It’s enough to make one ask whether it is nobler in the mind to suffer the slings and arrows of outrageous fortune, or by opposing, end them — to be or not to be?

Don’t worry, it’s not like that.

A grammar, for our purposes, is a set of rules (or “patterns”), each given some (not necessarily unique) name.  These rules can consist of terminals (individual characters like “a” or “9″) and non-terminals (references to the set of rules sharing a particular name in the grammar).  Each named rule in the grammar is written A -> e1 e2 … en, where A represents the name of the non-terminal associated with the pattern, and each e may be a terminal or non-terminal.

For example, here’s a grammar for integers:

Num   -> Num Digit
Num   -> Digit
Digit -> “0″
Digit -> “1″

Digit -> “9″

Here, the Digit rules say that a Digit can be any numeric character from ’0′ to ’9′.  On top of that, a Num can be either a single digit, or it can be a Num followed by a single digit.  The effect of this is that a Num will represent an integer of one or more digits.  People familiar with regular expressions may prefer to write this simply as ‘[0-9]+’, but for our purposes the explicit (and admittedly more verbose) grammar syntax will make it easier for us to understand what a grammar does and, later, how to derive a parser from a grammar.

It’s worth noting, for those familiar with regular expressions, that the case of 0 or more repeating digits (i.e.: ‘[0-9]*’) can also be expressed in a grammar as follows:

Num   -> Num Digit
Num   ->
Digit -> “0″

Digit -> “9″

Here, the second rule contains nothing, and is said to derive the empty string.  This feature is critical for typical grammars, and will require some extra thought when we consider generating parsers from a grammar.

It is often said that a grammar generates a sentence (or string).  One way to think of this concept is to start at the root non-terminal (by convention, the first written rule is considered to represent the root non-terminal) and, descending recursively, choosing a particular rule for each non-terminal encountered.  Each terminal “found” by this process is generated.

For example, the integer grammar above (with root non-terminal Num) generates the sentence “1″ by the rule Digit -> “1″ and finally Num -> Digit.  It also generates the sentence “42″ by the rule Digit -> “4″, then Num -> Digit, then Digit -> “2″, and finally Num -> Num Digit (first the “4″, then the “2″).  On the other hand, we can’t say that “abc” is generated by the integer grammar, as no rule matches the whole (or part) of that sequence of terminals.

This process of tracing how a grammar generates a sentence can become confusing, so instead we can use our existing tool of S-Expressions to denote the non-terminals that are matched, and the nesting of matches.  For example, by saying that the sentence “1″ is generated by the grammar, we say that the sentence produces the parse tree (Num (Digit “1″)).  Likewise, the sentence “42″ produces the parse tree:

(Num (Num (Digit “4″)) (Digit “2″))

The important thing here is that the structure of these S-Expressions matches exactly the structure of the rules they match, and the nesting of S-Expressions matches the order of following those rules.

A subtle, but very important point to make is that there isn’t necessarily one parse tree for a sentence matching an arbitrary grammar.  For example, consider a grammar for arithmetic expressions (building on the previous grammar for integers):

E  -> Num
E  -> E Op E
Op -> “+”
Op -> “-”

Now, it’s true that the sentence “1+1+1″ is generated by this grammar, but what parse tree should that sentence produce?  The match for “1+1+1″ must start at E -> E Op E, but what should the left and right Es match?  If we expand the left E first (so the left E will match “1+1″ and the right “1″), we’ll produce a leftmost derivation (this is the choice made by the so-called LL parsers).  Likewise expanding the right E first produces a rightmost derivation (the LR parsers).

In other words, the sentence “1+1+1″ with a leftmost derivation produces the parse tree:

(E
(E (E (Num “1″)) (Op “+”) (E (Num “1″)))
(Op “+”)
(E (Num “1″)))

While with a rightmost derivation, it produces the parse tree:

(E
(E (Num “1″))
(Op “+”)
(E (E (Num “1″)) (Op “+”) (E (Num “1″))))

When the same sentence, given the same grammar, can produce multiple parse trees, the grammar is said to be ambiguous.  When a parser is able to read sentences of an ambiguous grammar, it is often forced to inefficiently search the set of sentences that a grammar generates in order to find (or eliminate) the sentence that it is presented with.  Therefore, we seek a method of deriving parsers that accepts only unambiguous grammars (unfortunately we can’t hope to accept all unambiguous grammars, but we’ll try to get as close as possible).

Now, as a first step in deriving a parser from a grammar, we’ll look at augmenting the grammar in a few simple ways to elaborate concepts that may not be sufficiently clear when talking about sentences and parse trees.

The first important augmentation is the idea of an item (why such a generic and useless word was chosen for this idea is beyond me — but it’s stuck at this point, so that’s what we’re all left with).  An item is a grammar rule, with a little dot in front of the terminal or non-terminal that we want to read next.  For example, if we’re working with the integer grammar and we’ve not yet read anything, we can represent that idea by the following item:

Num -> •Num Digit

(Notice the little ‘•’ in front of the first Num in the rule.)

Similarly, if we want to say that we’ve just read a Num non-terminal, we can represent that idea with the following item:

Num -> Num •Digit

(Here the ‘•’ is in front of the Digit non-terminal, after having read the first Num.)

When we’ve completely matched a rule, we represent this by its item, with a ‘•’ at the very end.  For example, the following item indicates that we’ve completely matched the Num -> Num Digit rule:

Num -> Num Digit•

Now, if we want the complete picture of the state of parsing that a set of items represents, we must compute its closure.  Its closure is computed by finding each non-terminal after the ‘•’ in each item, then adding items (with a leading ‘•’) for all rules for that non-terminal, and repeating this process on the result until no new items are added.

For example, let’s compute the closure of this (initially singular) item set:

Num -> •Num Digit

Because the ‘•’ is in front of a Num, we must add items (with a leading ‘•’) for each Num rule.  Starting with the first Num rule, this would give us:

Num -> •Num Digit
Num -> •Num Digit

However, the two items above are identical (not because they represent the sale rule, but because they represent the same rule and the ‘•’ is in the same place in both).  So, erasing the duplicate item and moving on to the next rule of Num we get:

Num -> •Num Digit
Num -> •Digit

Now we’ve made progress, but we’re not finished because there’s a ‘•’ in front of the Digit non-terminal in the second rule, so we must add items for each Digit rule:

Num   -> •Num Digit
Num   -> •Digit
Digit -> •”0″
Digit -> •”1″

Digit -> •”9″

Finally, there are no new items with a ‘•’ in front of a non-terminal, so we’ve fully computed the closure.

Now we’re equipped to understand the first critical phase of analyzing a grammar: deriving the parser-state graph.  The parser state graph is meant to represent every state a parser can be in (those states are the nodes of the graph), as well as the terminals and non-terminals that transition one state to another (the edges of the graph).  It should have an initial state (this is where parsing begins), and a final state (where we’re certain that we’ve read a complete sentence).

To make it easier to compute this graph, we’ll first add a unique rule to our grammar that expects just the root non-terminal followed by the special end-of-input terminal, which we’ll represent by ‘$’.  So, with the continuing example of our grammar for integers, we’d add this rule:

S -> Num $

Then we create an item from this rule, with its ‘•’ in the initial position:

S -> •Num $

Next, we compute its closure:

S     -> •Num $
Num   -> •Num Digit
Num   -> •Digit
Digit -> •”0″
Digit -> •”1″

Digit -> •”9″

Now, this item set closure represents the initial state of the parser.  That formal statement should agree with our intuition: we haven’t read anything yet (all items have the ‘•’ in the initial position) and we’re waiting to read the root non-terminal.

Next, we must collect the set of all terminals and non-terminals to the right of the ‘•’ in each item.  In this case, that set will be { Num, Digit, “0″, “1″, … “9″ }.  We must then compute the items we would have if we were to follow each of these in turn, from the initial state.  For example, when we’ve read a Num from the initial state, we should have an item set like this:

S   -> Num •$
Num -> Num •Digit

Notice that all items that weren’t expecting a Num have been dropped, and all items that were expecting a Num have their ‘•’ incremented to point at the next expected terminals/non-terminals.

Next we should compute this item set’s closure:

S     -> Num •$
Num   -> Num •Digit
Digit -> •”0″
Digit -> •”1″

Digit -> •”9″

Now we’ve computed another state of the parser, which the initial state can get to by a transition on the Num non-terminal.  This state happens to be the final state, since it is expecting the special end-of-input terminal (and if it gets it, the entire job of parsing will be over).  Still, that doesn’t mean that it’s the last state we must add to our graph!  We have to continue computing follow states in this way by, in the case of each state, finding the set of terminals and non-terminals to transition on, and linking those transitions to new states (if such states would be new) or to existing states (if an identical state already exists).

If we continue in this way until no new states are added, we should produce a graph like the one below (thin edges represent terminal transitions, thick edges represent non-terminal transitions, the state with a thick border is the initial state, and the state with a black background is the final state):

With this, we almost have something that a machine could interpret to parse (in this case) our language of integers.  If we give each rule a unique ID (assigned sequentially, 0 for S -> Num $, 1 for Num -> Num Digit, and so on) and each state a unique ID, then we can represent the graph above as a table.  The rows of the table represent states of the graph, and the columns of the table represent terminal and non-terminal transitions.  The cells of the table will denote one of four actions for the parser:

1. Shift n – Accept the current terminal/non-terminal and transition to state n, pushing n onto the stack.

2. Reduce r – Ignore the input terminal (leave it for a later Shift action), pop from the stack the number of elements (terminals and non-terminals) in the rule r, and make the current state whatever is on the top of the stack.  Then, where rule r represents non-terminal R, perform whatever action is at the current state in the column for R.

3. Accept – The input has completed and a valid sentence has been matched.

4. Error – The current terminal/non-terminal is not expected, and thus the sentence cannot be matched.

Additionally, the stack mentioned above will be initialized to [0] (i.e.: the first implicit action of the parser is Shift 0).  Where a state in the parser state graph contains a rule r that is completed, we’ll fill the entire row with Reduce r.  Where there are transitions from state a to state b (either by terminals or non-terminals) we’ll add Shift b actions.

Now, using this approach, we can first number the rules and states in our graph:

This will then give us (according to the procedure outlined above) the following table (where sn means Shift n, rr means Reduce r, A means Accept, and a blank cell means Error):

 

’0′

’1′

’2′

’3′

’4′

’5′

’6′

’7′

’8′

’9′

$

Num

Digit

0

s1

s2

s3

s4

s5

s6

s7

s8

s9

s10

 

s11

s13

1

r3

r3

r3

r3

r3

r3

r3

r3

r3

r3

r3

 

 

2

r4

r4

r4

r4

r4

r4

r4

r4

r4

r4

r4

   
3

r5

r5

r5

r5

r5

r5

r5

r5

r5

r5

r5

   
4

r6

r6

r6

r6

r6

r6

r6

r6

r6

r6

r6

   
5

r7

r7

r7

r7

r7

r7

r7

r7

r7

r7

r7

   
6

r8

r8

r8

r8

r8

r8

r8

r8

r8

r8

r8

   
7

r9

r9

r9

r9

r9

r9

r9

r9

r9

r9

r9

   
8

r10

r10

r10

r10

r10

r10

r10

r10

r10

r10

r10

   
9

r11

r11

r11

r11

r11

r11

r11

r11

r11

r11

r11

   
10

r12

r12

r12

r12

r12

r12

r12

r12

r12

r12

r12

   
11

s1

s2

s3

s4

s5

s6

s7

s8

s9

s10

A

 

s12

12

r1

r1

r1

r1

r1

r1

r1

r1

r1

r1

r1

   
13

r2

r2

r2

r2

r2

r2

r2

r2

r2

r2

r2

   

A quick test of this table is to again interpret the sentence “42″.  In state 0 on input ’4′ we consume the terminal and shift to state 5 (the stack now equals [0, 5]).  In state 5 on input ’2′, the terminal is not consumed and we reduce by rule 7 (Digit -> ’4′).  This rule only has one element, so we’ll just pop the top of the stack, leaving us with the stack [0], which means that we’ll return to state 0.  Now in state 0 (because we just reduced by rule 7), we process the non-terminal Digit, which puts us into state 13, with stack [0, 13].  There, still on input ’2′, we reduce by rule 2 (Num -> Digit).  Rule 2 has one element, so we pop one element from the stack, leaving us with [0] and we’re back in state 0.  Now, still with ’2′ as the input, we process the Num non-terminal we’ve just reduced, putting us in state 11 with a stack of [0, 11].  In state 11 we finally consume the terminal ’2′, as we shift to state 3, giving us the stack [0, 11, 3].  In state 3 on the terminal $ (the end-of-input) we reduce by rule 5 (Digit -> ’2′).  Again rule 5 has only one element, so we pop one element from the stack, giving us [0, 11], and we’re back to state 11.  In reducing by rule 5 we read the non-terminal Digit, and we’re told to shift to state 12, giving us the stack [0, 11, 12].  In state 12 (still with the terminal $) we reduce by rule 1 (Num -> Num Digit), which has two elements, and so we pop two states from the stack, leaving us again with [0] and we’re back at state 0.  Here we’ve read a non-terminal Num, and we’re told to shift to state 11 (giving the stack [0, 11]).  Finally, on the terminal $ in state 11 we accept the string, and we’re done.

Whew!

We now seem to have a viable plan for producing parsers from grammars, but there’s still a major problem waiting for us.  Consider this grammar, based on the notion of C-style assignment statements dividing expressions into “left hand” and “right hand” values:

e -> l “=” r
e -> r
l -> “*” r
l -> “n”
r -> l

This will produce a parser state graph like the following (with state and rule numbers added):

Now, when we attempt to derive the parser table from this graph, we find that we can’t do it, because state 6 contains both a Reduce 5 action (which should fill the entire row) and a Shift 7 action (on the terminal ‘=’).  We need a clever way to escape from this predicament.

If we only carried out the Reduce r action when the current input character is one that can legitimately follow rule r in that state, we would eliminate the ambiguity in this case (and in many other cases)!  In other words, if we look ahead one character in the input to disambiguate the shift and reduce actions, our problem will be solved.  We now need to develop a method to determine the set of terminals that can legitimately follow a rule reduction, in a particular parser state.

Luckily, we can compute this lookahead set purely by analysis of the parser state graph.  The solution to this problem was first described thoroughly in the popular paper Efficient Computation of LALR(1) Look-Ahead Sets by Frank DeRemer and Thomas Pennello.  This paper has served as the basis for the GNU Bison parser generator, among many others.  It’s worth reading for the full technical treatment, but we’ll cover a summarized version here.

The most important observation in deriving lookahead sets from the parser state graph is that, where there is a rule reduction, we must look back to all parser states that could possibly reach that reduction, and then determine (in each of those states) what terminals could possibly follow the non-terminal associated with that rule reduction.  Phrased more compactly:

LA(p, A -> w) = the union of Follow(q, A) | (p, A -> w) lookback (q, A)

Here, LA(p, A -> w) is exactly the function we’re looking for (it computes the lookahead set in state p for rule reduction A -> w — where w is meant to represent 0 or more terminals or non-terminals).  This computation takes place by finding every edge of the parser state graph (q, A) where state q is waiting for an A and (critically) the path w (the body of the rule reduction A -> w) leads from state q to state p.  Once we’ve found all such edges, we compute Follow for each one (a function yet to be explained), and append all of the results together.

The lookback relation, then, essentially works by following terminal and non-terminal edges backward from the end of a rule definition to its beginning, starting from the state where a rule was completed.

To help understand the lookback relation described above, consider again the C-style assignment grammar above.  If we augment its parser state graph with dashed edges linking completed rules to the states that initiated them, we will have a more straightforward picture of what lookback means:

Notice that the completion of rule 5 in state 3 does not lookback to the same states that the completion of rule 5 in state 6 does, though it’s the same rule.  This is a crucial insight into understanding what lookback tells us.  For a particular completed rule in a particular state, we want to know what states led to it (not to any instance of that rule, but just that particular instance).

Also, notice that the reduction of rule 1 in state 8 reaches all the way back to state 0, where the e non-terminal was expected.  The way we say this is:

(8, e -> l ‘=’ r) lookback (0, e)

The most important thing this tells us is that the reduction of rule 1 in state 8 should have a lookahead set containing just the end-of-stream terminal ($) because when we lookback to state 0 and then transition on the non-terminal e, we’re left in state 5, which can only read $.  Therefore we should only reduce by rule 1 in state 8 when $ is the next terminal to be read.

Intuitively, all we must do now to compute the lookahead for each rule reduction is follow the dashed lookback edges to each state that initiated that rule, and determine the set of terminals that could follow that rule’s non-terminal.

But how do we determine that set of terminals?  We just did it in an abbreviated way for the case of (0, e) — which leads to a state where only $ can be read.  How do we do it in general though?

The first step is to compute the direct-read set (which is exactly what we did in following (0, e)).  In general, we follow (p, A) — from state p along the non-terminal edge A — to another state, q.  We then look at all items in state q, and gather all terminals with a ‘•’ in front of them.

For example, if we look at (0, l) in the graph above (starting in state 0 and following the l non-terminal edge) we arrive at state 6, which looks like this:

e -> l •’=’ r
r -> l•

Here, the first item shows a ‘=’ terminal after the ‘•’ and therefore we can say:

direct-read (0, l) is {‘=’}

But when we want to find the set of terminals following (p, A) we don’t really want direct-read merely, because it doesn’t account for the case where a state contains a rule that derives the empty string — we should also add the direct-read terminals of any states that follow the empty string from the state we’re interested in.

Stated formally, we say:

Read (p, A) = the union of direct-read (q, B) | (p, A) reads (q, B)

And we say that (p, A) reads (q, B) if we can get from state p to state q by following an edge labeled A, and the non-terminal B generates the empty string.

A simple (if not somewhat contrived) example of this idea is with the following grammar (again, we’ll add the s -> a $ rule to provide a simple root and end-of-input terminal):

s -> a $
a -> a b c
a -> “a”
b -> “b”
b ->
c -> “c”

This grammar produces a parser state graph that looks like this:

Consider the problem of computing the lookahead set for the reduction of rule 2 in state 1.  It should be clear that (1, a -> “a”) lookback (0, a) and so we must include direct-read (0, a) in the lookahead set.  However, the transition (0, a) leads to state 2, which has a transition on the non-terminal b that derives the empty string.  So (according to the reads relation detailed above) we must also include direct-read (2, b).  Therefore, the lookahead set won’t just be {‘b’} (as it would if only direct-read (0, a) were considered), but rather it is {‘b’, ‘c’}.

Of course, in a larger (or even more contrived) grammar, this sequence of direct-read sets can get arbitrarily long (e.g.: a -> a b c d e f g, where b, c, d, e, and f all derive the empty string).  We’ll just need to follow them in whatever manner they appear (and soon we’ll cover an algorithm to efficiently do exactly that).

Another way to visualize this idea of the reads relation is with a derivative graph of the parser state graph.  That is, if we let non-terminal edges be nodes in the derivative graph, the reads relation can be exactly the edges that link those nodes.

Though it is one level of abstraction removed from the parser state graph, this derivative reads graph shows a simple picture of which parser state graph edges can spill into others.  The graph above shows us that the transition from state 4 on non-terminal c stands on its own — so if we lookback to (4, c) to compute the lookahead set, it’s enough to compute direct-read (4, c).  The graph also tells us that if we lookback to (0, a), we must compute not only direct-read (0, a) for the lookahead set, but also direct-read (2, b), which is exactly what we determined before.

We’re now almost to the final solution of our problem.  The Read function computes part of the lookahead set associated with an edge of the parser state graph, but there’s one last critical subtlety to consider.  Specifically, if we need to compute the lookahead of (p, A) for some state p and some non-terminal A then it’s also possible that there’s some other rule of the form:

B -> GAy where y generates the empty string and state q follows G to state p.

Or (equivalently, but perhaps the more typical case), there’s no y at all and our non-terminal A is just the last thing in the definition of another non-terminal’s rule.  In that case, whatever lookahead set is associated with (q, B) must also be part of the lookahead for (p, A).  More formally, we say that in these cases, (p, A) includes (q, B).

Finally we can define the Follow function analogously to the Read function as:

Follow (p, A) = the union of Read (q, B) | (p, A) includes (q, B)

To illustrate this, consider a simple grammar:

s -> e $
e -> a “;”
a -> b
b -> “b”

Now, our aim here is to show that the rule a -> b will cause the rule b -> “b” to include the a transition in the rule e -> a “;” and therefore to reduce b -> “b” we should find that “;” is in the lookahead set.  To prove that, let’s first look at the parser state graph:

Now, by analyzing the graph, we can see that when we reduce rule 3 in state 1, we must look back to state 0 and then follow the b non-terminal edge as we Read the immediate lookahead.  Though there is no immediate lookahead, we can see that in state 0, by the description of the includes relation, the rule a -> b subsumes the non-terminal b, and so we must also Read (0, a) — which yields just the terminal “;”.  More simply, we can draw the derivative graph of the includes relation:

Now with this knowledge of the parser state graph, and the reads and includes derivative graphs, we can generate our final picture of the parser state graph, which includes the lookahead set (in a gray box) for each completed rule:

Finally, we’ve come to a complete picture of how to compute lookahead, and the fundamental functions that compute it:

LA(p, A -> w) = the union of Follow(q, A) | (p, A -> w) lookback (q, A)
Follow (p, A) = the union of Read (q, B) | (p, A) includes (q, B)
Read (p, A) = the union of direct-read (q, B) | (p, A) reads (q, B)

Those three simple functions encapsulate all of the grief we’ve gone through to understand how to compute lookahead.  Now, all that’s left is to implement them!

Actual Code

Our immediate goal in writing the code for this parser generator is to satisfy the (perhaps too general) type signature:

Value* MakeLALRTable(Value* egrm, EnvironmentFrame* env)

Essentially, this reads a grammar in an S-Expression form and returns the table of terminal and non-terminal shift/reduce actions.  This function can be bound to our existing interpreter.  Further, we’ll need functions to display representations of all of the important intermediate structures (like the various forms of parser state graph used in this article, the derivative reads and includes graphs, and sets of items generally).  And, critical to our practical interest in this problem in the first place, we’ll need a function to generate the C++ code to implement a parser, given the shift/reduce table.

All of this code is available from the code.lab49.com SVN module.  It is organized into interpreter bindings, the LALR table generator, and the final C++ code generator.

Additionally (and to continue this theme of “meta-circular code generators”) there is a short collection of Scheme scripts that define a grammar for a prettier form of grammars (which includes directives for constructing parse trees from the grammar) so that, for example, we can define a complex C-like language and compile the grammar into the simpler form our parser generator expects, so that we can ultimately derive an efficient C++ parser for it.

And once we can parse a more complex language, we can more easily compile it — which is exactly what we’ll do in the next article.

References

Compilers: Principles, Techniques and Tools by Aho, Sethi, and Ullman
Efficient Computation of LALR(1) Look-Ahead Sets by Frank DeRemer and Thomas Pennello
Parsing Techniques: A Practical Guide by Grune and Jacobs
LR Parser from Wikipedia

Comments are closed.