Parse Earley, Parse Often

December 10th, 2006

The Earley Parser is a general method for parsing any context-free grammar. Although it avoids the significant faults of its narrower associates (e.g.: infinite loops on left recursive rules in top-down parsers, or reduce-reduce conflicts on ambiguous rules in table-based parsers), it’s a very inefficient algorithm.

A few years back I implemented Earley’s algorithm, and bound it to a small Scheme interpreter I’d made. The result is a simple environment for interactively creating parsers and experimenting with their output:

Download Earley.zip (227 KB)

An example of an interaction one might have:

> (define arith-parser
(make-parser
(concat
“E ::= Val | Val Op E;”
“Val ::= Num | \”(\” E \”)\”;”
“Op ::= \”+\” | \”-\” | \”*\” | \”/\”;”
“Num ::= ‘[0-9]+’;”
)))

> (arith-parser “(5+3)*2″)
(E (VAL “(” (E (VAL (NUM “5″)) (OP “+”) (E (VAL (NUM “3″)))) “)”) (OP “*”) (E (VAL (NUM “2″))))


The zip file also contains a couple of example files that demonstrate a more sophisticated grammar than the above grammar for simple arithmetic expressions. The example grammar, in a file named “lp.grm”, is the grammar for a programming language I’ve made called “Lambda Pascal” (a variant of Pascal with first-class procedures). The example source file, “hello.lpas”, is an example of an expression that can be parsed with this grammar.

In order to parse the example source file with the example grammar, unzip all files to a common directory, then run earley.exe and enter the following bit of code at the prompt:

(define lpp (make-parser (open-file “lp.grm”)))
(lpp (open-file “hello.lpas”))

If you take a look at the grammar in “lp.grm” you’ll see that there are a few strange operators/symbols used in the BNF syntax for describing the grammar. The character “$” is used to represent the “epsilon rule” (i.e.: a rule that matches the empty string). The character “`” preceding a rule element tells the parse-tree constructor to prevent the corresponding node from being placed in the parse tree (for example, there’s a rule to match whitespace that’s used everywhere, but it’s not very useful to save whitespace in the parse tree). The character “~” preceding a rule element means that terminals in the element should be matched without regard to case. Finally, the character “:” preceding a non-terminal rule element means that the element itself shouldn’t be placed in the parse tree, but that its children should be (for example, a rule like AList ::= ‘a’ AList | $; applied to “aaa” would produce (ALIST “a” (ALIST “a” (ALIST “a” (ALIST)))) but if we change the rule to AList ::= ‘a’ :AList | $; then for “aaa” we’ll get (ALIST “a” “a” “a”), which looks much better).

Comments are closed.