Practical PLT Part 2: An S-Expression Parser

May 1st, 2008

In the last article we saw how to build a simple interpreter.  We quickly ran into the problem that complex program terms (like the definition of the factorial function) were incredibly difficult to construct manually.  In this article we’ll address that problem by first establishing a shorthand for program terms: S-Expressions.  Then we’ll see how we can mechanically translate textual S-Expressions into the linked Pair structure that our interpreter requires.

Philosophical Preliminaries

We first got into trouble in our interpreter when we introduced compound expressions.  Primitive Number and Symbol values were still fairly easy to construct (and should be even simpler in our S-Expression form), but constructing a program to sum three numbers was still difficult.  It required a box-pointer diagram like this:

 

Further, we said that this corresponded to an S-Expression: (+ 1 (+ 1 1)).

This example demonstrates that we have three important cases to consider when parsing expressions: symbols (as “+” in the above example), numbers, and (trickiest of all) lists of expressions (everything between parentheses, broken by spaces).

Here we can significantly simplify the problem of parsing lists by flattening the representation.  In this case, that means recognizing that parsing (+ 1 (+ 1 1)) can be viewed as parsing (+ 1 b) where b is the result of parsing (+ 1 1).  So we need to be able to set aside a parsed-list mid-way, and resume it later with the result of a nested parsed-list.

Now we can use the “(” character as an indication that a new list is beginning, and “)” that the most current list has ended.

Finally, we have a strategy for parsing S-Expressions.  To parse an expression, we ignore characters up to the first non-whitespace character.  If it’s numeric, process the rest of the input up to a non-numeric character as data for a Number value, construct the Number, and return it.  Likewise (but for alphabetic characters), construct Symbol values.  If a “(” character is encountered, push a new Pair value onto the stack, then proceed to parse sub-expressions by allocating a new Pair for each one, linked to the end of the list on the top of the stack.  Finally, when a “)” character is found, pop the Pair from the top of the stack and either insert it in the outer list (if the stack isn’t empty), or return it as the final result.

This informal description essentially captures a push-down automaton.  By explicitly defining the set of states we can be in, rules for transitioning from one state to the next, and rules for constructing terms from the strings captured between states, we can completely define our S-Expression parser.

Actual Code

Just as we did last time, we’ll start with the basic type signature of the function we aim to define:

Value* ReadSExpression(std::istream& input);

This says that if we can provide a source of text (whether from the console, a file, the network, etc), the ReadSExpression function will produce a Value usable by our interpreter.

Using the general strategy of defining a push-down automaton outlined above, we ought to be able to create a class to encapsulate that process, which should be straightforward to use in our final implementation of ReadSExpression by the end of this article.

Following this trail of thought, we can immediately begin by making explicit the set of states we expect to be in:

enum SExprReadState
{
    LookingForValue,
    ReadingAtom
};

Here we’ll collapse the states for reading a Symbol or a Number into the ReadingAtom state (once we’ve read the “atom” we’ll determine which type to use).

Now we’ll need some machinery for understanding these states and the various transitions between them that we may define.  I’ve prepared this machinery ahead of time (which can be safely ignored until the end, when we’ve got to compile the whole thing) in a class, StreamFSM:

    StreamFSM.hpp

To use this class, we’ll need to provide it with a character type (char should be good enough, unless we’ll need to parse UNICODE), a state type (our SExprReadState above), and the implementation type (the type of the S-Expression reader we intend to define).  We can conveniently forward-declare the whole bundle like this:

class SExprReader;
typedef StreamFSM<char, SExprReadState, SExprReader> SERBase;

Now we can begin to define the actual SExprReader class.  Here we’ll inherit from the StreamFSM to import the basic machinery, which lets us define our initial state (LookingForValue), the procedures to invoke when reading characters in any of our defined states, and the procedures to invoke when transitioning between states:

class SExprReader : public SERBase
{
public:
  SExprReader() : SERBase(LookingForValue), output_value(0)
  {
   AddStateHandler(LookingForValue, &SExprReader::LookForValue);
   AddStateHandler(ReadingAtom,     &SExprReader::ReadAtom);

   AddTransitionHandler
    (ReadingAtom, AnyState, &SExprReader::AddAtom);
   AddTransitionHandler
    (AnyState, EndOfStream, &SExprReader::ValidateState);
  }

  // […]

This is the most important part of our parser, so it’s worth going over the exact meaning of the code above.  First, the SExprReader class inherits the parsing machinery from SERBase.  In its constructor, SExprReader must give SERBase its initial state (here that initial state is LookingForValue).  It also introduces the variable output_value — which we’ll see later is the accumulator for the final parse result.

The rest of the constructor here uses AddStateHandler to map states to procedures that should be invoked for each input character (while that state is active), and AddTransitionHandler to map a pair of meta-states to a procedure that should be invoked when a transition between the two meta-states occurs.  A meta-state can be the symbol AnyState (which stands in for any of our SExprReadState values), StartOfStream (the state that’s set when we first begin parsing), EndOfStream (the state that’s set after we’ve reached the end of input), or any of our standard SExprReadState values.

In English, these state and transition mappings can be read as: When in the LookingForValue state, call the LookForValue procedure.  When in the ReadingAtom state, call the ReadAtom procedure.  When transitioning between the ReadingAtom state and any other state, call the AddAtom procedure.  When transitioning between any state and the end of input, call the ValidateState procedure.

Continuing to fill out this class, we’ll need local data for building up parsed expressions:

private:
    typedef std::pair<Pair*, Pair*> ListState;
    typedef std::stack<ListState>   NestedExprs;
    NestedExprs nested_exprs;
    // […]

Here the nested_exprs stack will keep track of lists within lists, and the ListState type will hold the initial Pair node of a linked list, and the final node of the list (this is important because we’ll append new nodes to the final node, but we’ll return or insert the initial node).

Now, since it will be the first procedure to run while parsing, let’s fill out the LookForValue procedure:

void LookForValue(const char& c)
{
    switch (c)
    {
    case ‘(’:
        nested_exprs.push(ListState(0, 0));
        break;
    case ‘)’:
        if (nested_exprs.size() > 0)
        {
            ListState ne = nested_exprs.top();
            nested_exprs.pop();
            AddToResult(ne.first);
        }
        else
        {
            throw std::runtime_error(”Unexpected ‘)’.”);
        }
        break;
    default:
        if (!is_whitespace(c))
        {
            PutBack(c);
            SetState(ReadingAtom);
            break;
        }
    }
}

Again, there are a lot of things being done here, so let’s take them case by case.

If we receive the ‘(‘ character, we push a new ListState onto the stack (as described earlier).  Here we make the choice to push nil on the stack initially — this will allow us to use the (arguably more elegant) notion that nil should be equivalent to the empty list: ().  We’ll achieve this by only allocating Pair nodes for the list as they’re needed (i.e.: as we read actual Values within the list), so that if none are needed (i.e.: we read “()“) the result will be the null pointer, nil.

Next, if we receive the ‘)‘ character, we should complete the active list.  If nested_exprs is empty, that means there is no active list (so we raise an error).  Otherwise, we pop off the state for the active list, and use the AddToResult procedure (we’ll examine this later) to either set the final parse result or accumulate this newly-completed list into the next active list.  Notice the symmetry with the ‘(‘ case here, where we’ll call AddToResult(nil) if the active list has had nothing accumulated into it.

Finally, if we receive any character that isn’t whitespace (given the trivial is_whitespace function), we’ll put that character back (as the PutBack procedure makes the character available for the next state-handler) and set the state to ReadingAtom, which brings us directly to the ReadAtom state-handler:

void ReadAtom(const char& c)
{
    if (is_whitespace(c) || c == ‘(’ || c == ‘)’)
    {
        PutBack(c);
        SetState(LookingForValue);
    }
}

This state-handler is simple enough.  If we get a character that’s either whitespace or a command to begin/end an active list, the current atom must be ending.  Hopefully this jives with the notion that a number should only consist of a contiguous sequence of numeric characters, and a symbol should only consist of a contiguous sequence of alpha-numeric characters.

If we do get a character that indicates the end of the current atom, we just put it back, and switch back into the LookingForValue state (whose handler we’ve already covered).

You may wonder how the actual parsed text turns into a Value that’s either returned or inserted into the active list.  That’s the job of the transition handlers, which we’ll examine now.

void AddAtom(elem_iter begin, elem_iter end)
{
    std::string val(begin, end);
    if (is_numeric(val))
        AddToResult(allocate<Number>(from_string<double>(val)));
    else
        AddToResult(allocate<Symbol>(val));
}

Here our SERBase machinery gives our transition handlers the sequence of characters recognized between being in the first state and transitioning to the second.  We use it to construct an std::string, which is used either to construct a Number (given the trivial is_numeric function), or a Symbol.  Here we also use the allocate<T> function to allocate a garbage-collected value of type T, which in the last article we decided to defer to a later article (and which we’ll again defer to a later article).  Finally, we use the same AddToResult procedure as we used earlier to add this value (in either case) to the overall parse result. 

Additionally, if the from_string function doesn’t seem trivial, we can use some existing machinery provided by the STL to define it as easily as:

template <typename T>
    T from_string(const std::string& s)
    {
        T ret;
        std::istringstream ss(s);
        ss >> ret;
        return ret;
    }

We have one other transition handler to consider, which really just makes sure that we don’t finish parsing our input if we haven’t closed off all active lists:

void ValidateState(elem_iter begin, elem_iter end)
{
    if (nested_exprs.size() > 0)
        throw std::runtime_error(”Expected ‘)’.”);
}

Now, we’ve only left one procedure to define in order to complete our SExprReader class:

void AddToResult(Value* v)
{
    // if we’re not accumulating values on an active list
    //   return this value as the result value
    // otherwise add this value to the end of the active list

    if (nested_exprs.size() == 0)
    {
        output_value = v;
        Detach();
    }
    else
    {
        NestedExpr& ne       = nested_exprs.top();
        Pair*       cur_pair = allocate<Pair>(v, (Value*)0);

        // if we’ve already allocated the initial node,
        //   append the new node as the final node
        // else set the initial and final node to the new node
        if (ne.first != 0)
            ne.second->tail(cur_pair);
        else
            ne.first = cur_pair;

        ne.second = cur_pair;
    }
}

Finally we’ve wrapped the trickiest part of the whole problem into this procedure.  The purpose of this procedure is to add a single value to the result — if there are no active lists, the value should be the final result but if there is an active list, the value should be appended to the end of it.  Here that’s achieved essentially with two cases.

In the first case, when nested_exprs is empty that means that there are no active lists to append to.  So here we’ll set output_value to the input Value and use SERBase’s Detach procedure to stop processing input.  This is the most basic case, and in typical circumstances this is where the complete parse bottoms out.

In the next case, however, we’ve got to append the input Value to the active list.  Because we were a little clever in not allocating the initial node before any values are added to the list, we need to account for that decision here.  First we allocate a new list node containing just the input value.  Then, if the active list has no initial node set, we’ll make our new node the initial node (this will be the start of the list, and it will be used as the value for the entire list).  If the active list does have an initial node, we’ll append our new node to the active list’s final node.  In either case, this new node will be the new final node for the active list.

Now we just need to make the final result accessible:

Value* value() const
{
    return output_value;
}

And we’ll need to close off the definition of this SExprReader class:

};

Now we can bring this all together in a simple function that makes use of this class:

Value* ReadSExpression(std::istream& input)
{
    SExprReader reader;
    reader.Run(input);
    return reader.value();
}

There we have it!  We now have an easy way to parse S-Expressions from any input source (a file, the console, the network, etc), which can be used in conjunction with the Eval function we defined last time!  It’s so easy to use with Eval, in fact, that (provided we make a simple function PrintExpression to serialize a Value to an output stream) we can concisely define REPL, the procedure for running a Read-Eval-Print-Loop:

void REPL(std::istream& i, std::ostream& o, EnvironmentFrame* e)
{
    while (i)
    {
        try
        {
            // read
            Value* inexp = ReadSExpression(i);

            // eval
            Value* outexp = Eval(inexp, e);

            // print
            PrintExpression(outexp, o);

            // loop
            output << std::endl;
        }
        catch (std::runtime_error& e)
        {
            o << “*”               << std::endl
              << “** ” << e.what() << std::endl
              << “*”               << std::endl;
        }
    }
}

Which could be used just as easily to enable an RPC-like session (where input and output streams are network connections) as it could to interact with the console.  In the (maybe more typical) example of console use, all that’s necessary to run a REPL session is the following call (assuming that your root EnvironmentFrame is prepared in the variable root_env):

REPL(std::cin, std::cout, root_env);

Clearly we’ve come a long way toward making very simple what may have once seemed hopelessly complex!

In the next article, we’ll look at how to make it even more convenient to bind functions to our interpreter, and we’ll add some advanced functions that take us beyond the simple arithmetic we’ve done so far into list processing.

Comments are closed.