Jay Taylor's notes
back to listing indexJim McBeath: Scala Parser Combinators
[web search]
Original source (jim-mcbeath.blogspot.com)
Clipped on: 2012-08-20
Scala Parser Combinators
One of the reasons I chose to do my
StringArt applet
as my first
Scala applet
was because Scala includes a nifty capability called
parser combinators.
A parser is a function that converts a stream of input tokens into a data structure that can be more easily digested by the application. A combinator is a way of combing two functions to produce a composed function. A parser combinator is a way of combining two parsers to produce another parser.
The Parser classes in Scala include methods to combine Parsers in various ways and to add functions to be executed when a parse succeeds, to build the resulting datastructure.
Parser combinators simplify the creation of some parsers, including the simple expression parsing I wanted. When using Scala, the specification of the syntax to be parsed, which might normally be done in a separate file which is read by a separate parser (as when using a parser generator such as yacc), can be done directly in a Scala source file.
Scala's syntax has two features that make it possible to write a syntax specification directly in Scala that still looks a lot like a BNF description.
In the above code, the tilda character
Strings in the middle of these sequences of parsers, such as
You can cut and paste the above two code fragments into a file, load it into the Scala interpreter with the
We also add the
Now we can enter expressions with parentheses when we want to change the order of operations:
The minus character is already in our set of delimiters, so we don't need to change that list. We add a
Now we can parse expressions that include unary minus:
We replace the
For additional reading check out these other parser combinator pages:
A parser is a function that converts a stream of input tokens into a data structure that can be more easily digested by the application. A combinator is a way of combing two functions to produce a composed function. A parser combinator is a way of combining two parsers to produce another parser.
The Parser classes in Scala include methods to combine Parsers in various ways and to add functions to be executed when a parse succeeds, to build the resulting datastructure.
Parser combinators simplify the creation of some parsers, including the simple expression parsing I wanted. When using Scala, the specification of the syntax to be parsed, which might normally be done in a separate file which is read by a separate parser (as when using a parser generator such as yacc), can be done directly in a Scala source file.
Scala's syntax has two features that make it possible to write a syntax specification directly in Scala that still looks a lot like a BNF description.
- (Almost) any operator string can be used as a method name.
- A method that takes a single argument can be written without parentheses.
Contents
Basic Parser
We start with a simpleExpr
base class for our AST elements,
with case classes for constant values and an add operator.
The base Expr
class is a sealed class, so all of its
case classes, including whatever we add below, must go into the same
source file as the base class.
sealed abstract class Expr {
def eval():Double
}
case class EConst(value:Double) extends Expr {
def eval():Double = value
}
case class EAdd(left:Expr, right:Expr) extends Expr {
def eval():Double = left.eval + right.eval
}
Now we create a simple parser that understands expressions consisting
either of a single integer or two integers with a plus between them.
import scala.util.parsing.combinator.syntactical._
object ExprParser extends StandardTokenParsers {
lexical.delimiters ++= List("+")
def value = numericLit ^^ { s => EConst(s.toDouble) }
def sum = value ~ "+" ~ value ^^ { case left ~ "+" ~ right =>
EAdd(left, right) }
def expr = ( sum | value ) //top level expression
def parse(s:String) = {
val tokens = new lexical.Scanner(s)
phrase(expr)(tokens)
}
def apply(s:String):Expr = {
parse(s) match {
case Success(tree, _) => tree
case e: NoSuccess =>
throw new IllegalArgumentException("Bad syntax: "+s)
}
}
//Simplify testing
def test(exprstr: String) = {
parse(exprstr) match {
case Success(tree, _) =>
println("Tree: "+tree)
val v = tree.eval()
println("Eval: "+v)
case e: NoSuccess => Console.err.println(e)
}
}
//A main method for testing
def main(args: Array[String]) = test(args(0))
}
The parse
, apply
, test
and
main
methods are all helper methods so that we can run our
parser and see what it is doing.
So far our actual parser consists only of the
value
, sum
and expr
definitions.
In the above code, the tilda character
~
is a
Parser
method that composes two parsers into another parser
that sequentially calls the two composed parsers,
and the vertical bar character |
creates a parser which
calls the second parser only if the first parser fails.
The double carat ^^
passes the parsed value to a
function that
builds the AST element.
When we want to match against a parse sequence and extract more than
one value, we can use a case statement after the ^^
.
That case pattern should match the pattern of the parsers
(such as can be seen in the sum
parser and its case pattern).
Strings in the middle of these sequences of parsers, such as
"+"
in the sum
parser, are implicitly converted into
keyword
parsers that accept only that literal string.
You can cut and paste the above two code fragments into a file, load it into the Scala interpreter with the
:load
command, and test it by entering
commands such as the following:
ExprParser.test("123")
ExprParser.test("12 + 34")
Repeated Terms
Our parser only handles additions with two terms. In order to handle additions with three or more terms, we redefineExprParser.sum
using the repsep
method:
def sum = repsep(value,"+") ^^ { a:List[Expr] => ESum(a) }
We also add the ESum
case class:
case class ESum(a:List[Expr]) extends Expr {
def eval():Double = a.foldLeft(0.0)(_ + _.eval)
}
In order to add a subtract operator, we once again modify our
ExprParser.sum
method, this time using the *
operator.
We also add our subtract operator -
to the list of
delimiters:
lexical.delimiters ++= List("+", "-")
...
def sum = value * (
"+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) } |
"-" ^^^ { (a:Expr, b:Expr) => ESub(a,b) } )
Note that we are using a triple carat ^^^
here rather than
the double carat ^^
that we used earlier.
The triple carat operator does not use any data from the parser
in the return value.
In our use, we are returning a function which will be used by
the *
operator to collapse the values in the list.
We also add the
ESub
case class:
case class ESub(left:Expr, right:Expr) extends Expr {
def eval():Double = left.eval - right.eval
}
Now we can enter expressions like these:
ExprParser.test("2-1")
ExprParser.test("1+2+3-4")
Precedence
Next we add multiply and divide operators. As when we added the subtract operator, we add the*
and /
delimiters to ExprParser.lexical.delimiters
and we add case classes EMul
and EDiv
:
case class EMul(left:Expr, right:Expr) extends Expr {
def eval():Double = left.eval * right.eval
}
case class EDiv(left:Expr, right:Expr) extends Expr {
def eval():Double = left.eval / right.eval
}
class ExprParser {
....
lexical.delimiters ++= List("+", "-", "*", "/")
....
}
However, since multiplication and division have
higher precedence than add and subtract,
we can't just add those operators to the sum
parser
like we added subtract.
Instead, we create a product
parser to handle the
two new operators.
We insert this logically between the sum
and value
parsers:
class ExprParser {
....
def product = value * (
"*" ^^^ { (a:Expr, b:Expr) => EMul(a,b) } |
"/" ^^^ { (a:Expr, b:Expr) => EDiv(a,b) } )
def sum = product * ( ...
....
}
We can now enter expressions containing multiplication as well as addition
and have them properly evaluated:
ExprParser.test("3+4*5")
Parentheses
We would like to be able to use parentheses, so we add those two characters toExprParser.lexical.delimiters
.
We add a parens
parser, we add a term
parser that accepts either a number or a value in parentheses,
and we modify the product
parser to call for a
term
rather than a value
:
class ExprParser {
...
lexical.delimiters ++= List("+", "-", "*", "/", "(", ")")
...
def parens:Parser[Expr] = "(" ~> expr <~ ")"
def term = ( value | parens )
def product = term * ...
def expr = ( sum | term )
...
}
We don't need to provide a value expression for the term
parser because the default value for the |
operator is
the value of whichever expression was parsed, which is what we want.
We don't need to provide a value expression for the parens
parser because the ~>
operator throws away the value
on the left and the <~
operator throws away the value
on the right, so we are left with the value of the expr
in the middle.
Now we can enter expressions with parentheses when we want to change the order of operations:
ExprParser.test("(3+4)*5")
Unary Operators
We want to be able to specify negative numbers. We can't just make a leading negative sign an optional part of ournumericLit
token because then we can't parse a string
such as 2-1
.
Instead, we implement a unary minus operation with a precendence higher
than any of our binary operators.
The minus character is already in our set of delimiters, so we don't need to change that list. We add a
EUMinus
case class, add a unaryMinus
parser definition, and modify the term
parser to
include the unaryMinus
parser:
case class EUMinus(e:Expr) extends Expr {
def eval():Double = -e.eval
}
class ExprParser {
...
def unaryMinus:Parser[EUMinus] = "-" ~> term ^^ { EUMinus(_) }
def term = ( value | parens | unaryMinus)
...
}
We need to specify the return type for unaryMinus
because
it is recursive (through the call to term
),
so Scala does not automatically infer its type.
Now we can parse expressions that include unary minus:
ExprParser.test("3*-(2+2)")
Lexical
The standard Scala lexical analyzer implementation ofnumericLit
only parses integers, but we want floating point notation, so we
extend the standard lexical class to do that.
Note the first line of the class, where it overrides the definition
of the token
method to be either our own
floatingToken
or a token from our superclass.
The rest of ExprLexical
is the definition of the
floatingToken
token, complete with optional signed exponent.
import scala.util.parsing.combinator.lexical.StdLexical
class ExprLexical extends StdLexical {
override def token: Parser[Token] = floatingToken | super.token
def floatingToken: Parser[Token] =
rep1(digit) ~ optFraction ~ optExponent ^^
{ case intPart ~ frac ~ exp => NumericLit(
(intPart mkString "") :: frac :: exp :: Nil mkString "")}
def chr(c:Char) = elem("", ch => ch==c )
def sign = chr('+') | chr('-')
def optSign = opt(sign) ^^ {
case None => ""
case Some(sign) => sign
}
def fraction = '.' ~ rep(digit) ^^ {
case dot ~ ff => dot :: (ff mkString "") :: Nil mkString ""
}
def optFraction = opt(fraction) ^^ {
case None => ""
case Some(fraction) => fraction
}
def exponent = (chr('e') | chr('E')) ~ optSign ~ rep1(digit) ^^ {
case e ~ optSign ~ exp => e :: optSign :: (exp mkString "") :: Nil mkString ""
}
def optExponent = opt(exponent) ^^ {
case None => ""
case Some(exponent) => exponent
}
}
We then modify our ExprParser
class to use our own lexical
parser so that we can parse floats or ints:
object ExprParser extends StandardTokenParsers {
override val lexical = new ExprLexical
...
}
Now we can enter expressions with either integer or floating point numbers:
ExprParser.test("123.456")
ExprParser.test("1.2+3")
ExprParser.test("5e-1 + 2")
Precedence Revisited
If we wanted to add some more binary operators at a different level of precedence (such as the integer shift operators<<
and >>
),
we could handle that the same way as we did when we added
multiply and divide, by adding a new rule and modifying the
existing rules to insert it at the right location.
It would be nicer if we could factor out the precedence rules so
that we don't need multiple binary operator rules.
Scala's parser classes do not directly support precedence rules,
but we can add them relatively easily.
We replace the
product
and sum
parsers
with a binary
parser and we call that instead of sum
from our expr
parser.
We factor out the binary operations into a binaryOp
method
where for each precedence level
we list each binary operator and its corresponding AST node.
The binary
parser takes an argument which is the
current precedence level.
class ExprParser {
...
def binaryOp(level:Int):Parser[((Expr,Expr)=>Expr)] = {
level match {
case 1 =>
"+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) } |
"-" ^^^ { (a:Expr, b:Expr) => ESub(a,b) }
case 2 =>
"*" ^^^ { (a:Expr, b:Expr) => EMul(a,b) } |
"/" ^^^ { (a:Expr, b:Expr) => EDiv(a,b) }
case _ => throw new RuntimeException("bad precedence level "+level)
}
}
val minPrec = 1
val maxPrec = 2
def binary(level:Int):Parser[Expr] =
if (level>maxPrec) term
else binary(level+1) * binaryOp(level)
def expr = ( binary(minPrec) | term )
...
}
Now if we want to add binary operators at a new precedence level,
we only need to modify the binaryOp
method,
update the value of maxPrec
,
and add our case classes.
Wrap Up
Below is the complete example in its final form. You should be able to copy and paste this one code fragment and have a working four-function floating point expression parser.sealed abstract class Expr {
def eval():Double
}
case class EConst(value:Double) extends Expr {
def eval():Double = value
}
case class EAdd(left:Expr, right:Expr) extends Expr {
def eval():Double = left.eval + right.eval
}
case class ESub(left:Expr, right:Expr) extends Expr {
def eval():Double = left.eval - right.eval
}
case class EMul(left:Expr, right:Expr) extends Expr {
def eval():Double = left.eval * right.eval
}
case class EDiv(left:Expr, right:Expr) extends Expr {
def eval():Double = left.eval / right.eval
}
case class EUMinus(e:Expr) extends Expr {
def eval():Double = -e.eval
}
import scala.util.parsing.combinator.lexical.StdLexical
class ExprLexical extends StdLexical {
override def token: Parser[Token] = floatingToken | super.token
def floatingToken: Parser[Token] =
rep1(digit) ~ optFraction ~ optExponent ^^
{ case intPart ~ frac ~ exp => NumericLit(
(intPart mkString "") :: frac :: exp :: Nil mkString "")}
def chr(c:Char) = elem("", ch => ch==c )
def sign = chr('+') | chr('-')
def optSign = opt(sign) ^^ {
case None => ""
case Some(sign) => sign
}
def fraction = '.' ~ rep(digit) ^^ {
case dot ~ ff => dot :: (ff mkString "") :: Nil mkString ""
}
def optFraction = opt(fraction) ^^ {
case None => ""
case Some(fraction) => fraction
}
def exponent = (chr('e') | chr('E')) ~ optSign ~ rep1(digit) ^^ {
case e ~ optSign ~ exp => e :: optSign :: (exp mkString "") :: Nil mkString ""
}
def optExponent = opt(exponent) ^^ {
case None => ""
case Some(exponent) => exponent
}
}
import scala.util.parsing.combinator.syntactical._
object ExprParser extends StandardTokenParsers {
override val lexical = new ExprLexical
lexical.delimiters ++= List("+","-","*","/","(",")")
def value = numericLit ^^ { s => EConst(s.toDouble) }
def parens:Parser[Expr] = "(" ~> expr <~ ")"
def unaryMinus:Parser[EUMinus] = "-" ~> term ^^ { EUMinus(_) }
def term = ( value | parens | unaryMinus )
def binaryOp(level:Int):Parser[((Expr,Expr)=>Expr)] = {
level match {
case 1 =>
"+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) } |
"-" ^^^ { (a:Expr, b:Expr) => ESub(a,b) }
case 2 =>
"*" ^^^ { (a:Expr, b:Expr) => EMul(a,b) } |
"/" ^^^ { (a:Expr, b:Expr) => EDiv(a,b) }
case _ => throw new RuntimeException("bad precedence level "+level)
}
}
val minPrec = 1
val maxPrec = 2
def binary(level:Int):Parser[Expr] =
if (level>maxPrec) term
else binary(level+1) * binaryOp(level)
def expr = ( binary(minPrec) | term )
def parse(s:String) = {
val tokens = new lexical.Scanner(s)
phrase(expr)(tokens)
}
def apply(s:String):Expr = {
parse(s) match {
case Success(tree, _) => tree
case e: NoSuccess =>
throw new IllegalArgumentException("Bad syntax: "+s)
}
}
def test(exprstr: String) = {
parse(exprstr) match {
case Success(tree, _) =>
println("Tree: "+tree)
val v = tree.eval()
println("Eval: "+v)
case e: NoSuccess => Console.err.println(e)
}
}
//A main method for testing
def main(args: Array[String]) = test(args(0))
}
In my StringArt applet I also implemented variables and function calls,
but those did not demonstrate any additional features of Scala's
parser combinators,
so I have left that as an exercise for the reader.
For additional reading check out these other parser combinator pages:
- Debasish Ghosh blog entry in which he developers a parser for a stock (shares) trading language that looks very much like written English.
- Daniel Spiewak blog entry in which he builds a parser for a simple language. Look in the comments for a link to an on-line version of Wadler's 1985 paper.
-
Matt Malone blog entry in which be builds a parser for the
Subversion client-server message protocol syntax,
using lots of
RegexParsers
. - Part 4 of Stefan Zeiger blog on parser combinators, where he creates a parser for interger expressions using * and +. The three previous parts deal more generally with parser combinators, with the promise of additional parts going into more detail about Scala parser combinators.
- Philip Wadler's 1985 paper on parser combinators.
- Adriaan Moors's Parsing in Scala paper. One of Adriaan's examples is a parser that handles the four basic arithmetic functions. In that example, the evaluation of the expression is done by the parser, whereas in my example I create an AST for later evaluation. I did this because in my real usage, I evaluate the expression multiple times, each time with different values for the variables which can be referenced by the expression.
- The "Programming in Scala" book has a chapter on Parser Combinators.
- The scaladoc page for the Parsers.Parser class in Scala lists the parser operators, and the scaladoc for the Parsers trait lists the parser method names.
- This Parser combinator in Java was done by using parser combinators from Haskell and converting them to Java.
- A 2001 paper about the Haskell Parsec parser.