Josh Haberman: LL and LR Parsing Demystified

Jay Taylor's notes

Josh Haberman: LL and LR Parsing Demystified

Original source (blog.reverberate.org)

Tags: programming compilers parsers parse-trees LL LR blog.reverberate.org

Clipped on: 2013-10-11

LL and LR Parsing Demystified

My first adventures in parsing theory came when I was doing an independent study of programming languages in college. When I got to the part about algorithms such as LL, LR, and their many variations (Strong-LL, SLR, LALR, etc), I was fascinated. I felt like I was peering at some deep and powerful incantations, the significance of which I could not yet appreciate, but I was sure that someday terms like "left-to-right, rightmost derivation" would make perfect sense, and I looked forward to achieving this enlightenment.

I can say now, 10 years and a whole shelf of parsing books later, that I understand these algorithms pretty well. But the way I think about them is quite different than any of the literature that I have found. I think more from an implementation perspective than a mathematical one, which could have something to do with it. In any case, I want to explain how I think of these algorithms; hopefully some people will find this perspective intuitive, as I do.

This article will only address the parser from a black-box perspective: what are its inputs/outputs and what are its constraints? A planned future article will break open the black box for more details about the inner workings of these algorithms.

Parsing And Polish Notation

If you studied Computer Science at university, or ever owned an HP calculator, you likely came across Polish and Reverse Polish notation. These are ways of writing mathematical expressions that don't need parentheses or order-of-operations rules. We're used to writing expressions as infix, where the operators go in between the operands:

?

1	`1 + 2 * 3`

In this case, how do you know what order to do the operations in? You have to follow the conventional rules (PEDMAS), and if you want a different order you have to use parentheses, like:

?

1	`(1 + 2) * 3`

Polish and Reverse Polish notation let you write these expressions without needing arbitrary order-of-operations rules or parentheses to disambiguate. They work by putting the operators before (Polish) or after (Reverse Polish) the operands. These are also known as prefix and postfix notation.

?

// First example:
1 + 2 * 3  // Infix
+ 1 * 2 3  // Polish (prefix)
1 2 3 * +  // Reverse Polish (postfix)
 
// Second example:
(1 + 2) * 3  // Infix
* + 1 2 3    // Polish (prefix)
1 2 + 3 *    // Reverse Polish (postfix)

Besides not needing parentheses or a defined order of operations, Polish and Reverse Polish are much easier to write evaluators for (maybe the HP calculator's designers decided to use Reverse Polish so they could spend a week in the Bahamas). Here is a simple Reverse Polish evaluator in Python.

?

# Functions that define the operators and how to evaluate them.
# This example assumes binary operators, but this is easy to extend.
ops = {
  "+": (lambda a, b: a + b),
  "-": (lambda a, b: a - b)
}
  
def eval(tokens):
  stack = []
  
  for token in tokens:
    if token in ops:
      arg2 = stack.pop()
      arg1 = stack.pop()
      result = ops[token](arg1, arg2)
      stack.append(result)
    else:
      stack.append(int(token))
  
  return stack.pop()
  
print "Result:",  eval("7 2 3 + -".split())

Polish and Reverse Polish notation, as they are usually described, do require that all operators have a known arity. Arity is just the number of operands that the operator takes. This means, for example, that unary minus needs to be a different operator than subtraction. Otherwise we don't know how many operands to pop from the stack when we see an operator.

A similar formulation that avoids this problem is one like Lisp's s-expressions. S-expressions (and similar encodings like XML) avoid the need for fixed operator arity by explicitly marking both the beginning and end of each expression:

?

; Lisp-style prefix notation; the same operator can be used
; for different numbers of arguments.
(+ 1 2)
(+ 1 2 3 4 5)
 
; Lisp equivalents of our first two examples:
; Prefix: + 1 * 2 3
(+ 1 (* 2 3))
 
; Prefix: * + 1 2 3
(* (+ 1 2) 3)

This variant has slightly different tradeoffs (we traded fixed arity for required parentheses) but the underlying algorithms for parsing/processing these are quite similar, so generally we'll consider this a minor variant of prefix notation.

It might seem like I've veered off-topic, but I've been sneakily talking about LL and LR the whole time. Polish and Reverse Polish notation directly correspond, in my view, to LL and LR parsing, respectively. But to fully explore this idea we need to first describe what kind of output we expect from a parser.

For a fun exercise, try implementing an algorithm to convert Polish to Reverse Polish notation. See if you can do it without building the entire expression into a tree first; you can do it with a stack alone. Now say you wanted to do the opposite (Reverse Polish to Polish) -- you can just run the same algorithm on the input, but backwards! Of course you can build an intermediate tree too, but this takes O(input length) space, whereas the stack-only solution takes only O(tree depth) space. How about going infix to postfix? There is a very clever and efficient algorithm for that called the Shunting Yard Algorithm.

A Parser And Its Output

We can all agree that the input to a parser is a stream of tokens (which probably came from a lexer, but we can talk about that part another day). But what is a parser's output? You might be inclined to say "a parse tree," and while you can certainly use a parser to build a parse tree, it's also possible to consume a parser's output without ever actually building a tree at all. For example, this Bison example evaluates arithmetic expressions in-line with the parse. Every time a subexpression is recognized, it is immediately evaluated until the final result is just a single number. No parse tree is ever explicitly built.

For this reason, saying that the output of a parser is a parse tree is not general enough. Instead, I claim that the output of a parser, at least for the cases of LL and LR which we are discussing today, is a parse tree traversal.

Apologies if I've set off anyone's nonsense detectors. I can hear people protest that a tree traversal is an algorithm; an operation you perform on a tree. How can I say that a parser outputs a tree traversal? The answer lies in thinking back to Polish and Reverse Polish notation. These are usually thought of as just a notation for mathematical expressions, but we can think of them more generally as flat (serialized) encodings of tree traversals.

Think back to our first example of 1 + 2 * 3. Here is that expression written as a tree: