Jay Taylor's notes

back to listing index

Writing Parsers Like it is 2017 [pdf] | Hacker News

[web search]
Original source (news.ycombinator.com)
Tags: compilers parsers programming-language-design news.ycombinator.com
Clipped on: 2017-07-19

Image (Asset 1/2) alt=
Image (Asset 2/2) alt=
This should be titled more honestly: perhaps something like "Case studies in rapidly replacing dangerous C parser code with Rust using nom".

The examples are interesting and well presented. But the first sections trying to put a veneer of respectability on rust-all-the-things were a bit rough and got my cynic sense tingling.

Yes, you believe Rust will produce better results: don't try to justify that with facts you don't have ("Several languages were tested .." bullshit, unless you show some data. Likewise the assertions about type-safety and no-GC being essential properties). The data you do have (implementations produced and integrated and tested in a paper-like time frame) are valuable, unfortunately they're cheapened/buried under this false veneer.

We in fact tested multiple languages. I can even point you to various works done at the ANSSI like https://github.com/ANSSI-FR/bootcode_parser (python) or https://github.com/ANSSI-FR/caradoc (OCaml). I tried Haskell for VLC and it was not really suited for it (GC pauses in a synchronized media pipepline, and not meant to be called from C). But this was not a paper about comparing parser implementations.

Type-safety and lack of garbage collection are essential properties, could you tell me why you don't think that's the case?

Giving the reason for our language choice felt useful. Otherwise, it would have really looked like Rust developers steamrolling into projects :)

Is it possible to have "parser generators" (not necessarily in the formal sense of the term) that produce recursive descent parsers? Even if mathematically they can't be perfect, could we have "good enough" ones? Nobody uses parser generators because even though P.G.s "work" on the barest level of ingesting source code and spitting an AST, they don't do anything beyond that. For example, trying to get helpful error messages (think: from old GCC to Clang) from a P.G.s output is close So instead everyone writes custom parsers, and that introduces these problems.

Can't we have smarter parser generators that do make debugging nice, but are still formally verified?

> trying to get helpful error messages

Menhir is a parser generator that tries to do that: http://gallium.inria.fr/~fpottier/slides/fpottier-2015-11-ou... (Don't get scared by the first slide; the title is in French, the rest is in English.)

> formally verified

Menhir has been used for a formally verified parser for C: http://gallium.inria.fr/~xleroy/publi/validated-parser.pdf

The result is in CompCert: https://github.com/AbsInt/CompCert/tree/master/cparser


It generates recursive descent parsers, it even allows left-recursive grammars - it will rewrite the grammar on its own and wrangle it back into left-recursive form for your convenience, no need to shuffle a right-recursive AST back into left-recursive form by yourself.

It's a side project so it could use a lot of improvement, especially in the debugging messages side.

But it does prove left-recursive descent grammars can be generated automatically without getting into infinite loop and without hassle of manually reversing right-recursive AST's.

> Is it possible to have "parser generators" (not necessarily in the formal sense of the term) that produce recursive descent parsers?

Isn't that basically what PEG parser generators are?

For my parsing pet project(s), the following structure seems to serve me best:

- Recursive descent parser for statements

- "Canned" configurable expression parser for expressions: https://github.com/stefanhaustein/expressionparser

- Both use a simple regex based generic tokenizer

- In more complex cases, parse the expressions to an unresolved syntax tree first (constructed from the bottom), then resolve the tree in a secondary top-down step (potentially to a linear representation).

Are you just talking about making parser generators for LL(k) grammars?

It's fairly straightforward to make one. One exercise in Wirth's short book Compiler Construction is to make an LL(1) parser generator. You even make it check for ambiguities. (I've even found it useful on a couple occasions.)

I would claim ANTLR is such a tool.

As long as you're writing parsers, it is forever 1969. Knuth invented all you need to know just recently. Off you go and parse!

Nice overview of the pitfalls of C-parsers, their hardening, a presentation of Rust advantages, parser combinators, the nom crate, its usage, the application to VLC and an intrusion detector, the integration with those complex existing C codebase, fuzzing and a few ideas to improve rust for more security.

And I am happy to see the ANSSI here :)

No mentions to instaparse? https://github.com/Engelberg/instaparse

"Parse IS be something easy as regexp"

In my experience avoiding unlimited recursion/stackoverflows is one of the most annoying parts of writing a parser, so I find it surprising that the paper doesn't even mention that topic.

> In my experience avoiding unlimited recursion/stackoverflows is one of the most annoying parts of writing a parser

This, x10.

My take is that this is perhaps the main reason why everyone tends to rely on hand-written parsers instead of simply using code churned out by a parser generator. Parser generators are developed with a single use case in mind: pick up a grammar definition, perhaps with pre and post-conditions, and proceed to map it to some programming language. Yet, this fails very basic requirements for real world parser applications.

Are there any Context-sensitive algorithms/parsers/generators?

The list at https://en.wikipedia.org/wiki/Context-sensitive_grammar only contains two links, where one of them, "LuZc" seems completely dead with "lorem ipsum" under Downloads, and the other "bnf2xml" seems to be misplaced since BNF is not context-sensitive.

Monadic parser combinators are context-sensitive, because you can arbitrarily branch your parser based on previously parsed content.

Parser combinators are less powerful than what parser generators can do, in terms of expressiveness and efficiency. And we've known parser generators since the 70s.

So I'm not sure what the point is of the title of the article.

(one of the authors here): parser generators are generally good for one thing: parsing programming languages. For more complex formats, where you have to carry state around, or binary formats, they're extremely cumbersome to use. I often meet people that tell me they want the parser generator to end all parsers. But for most real world formats, you'll have to hack around the generator's limitations. Theoretically, parser combinators are less expressive, but they're just functions: you can write whatever you want inside a subparser, and integrate it with the rest. That approach works well, since we wrote a lot of complex parsers with nom.

I thought most real world compilers tend to be hand written rather than using a generator? By no means an expert but that's something I've heard and know to be true for many real world compilers.

They probably should have used this instead: http://scottmcpeak.com/elkhound/

For C and especially C++, this makes a lot of sense. There's so much context dependence in the latter when you get to templates.

Ya, heck, not just C++, but scala and C#. You can tune the heck out of error recovery when you write a parser by hand, as well as support IDE services.

Using a combinator or generator makes sense when you really care just about getting trees out as fast as possible; e.g. what a command line batch compiler or interpreter does.

I suppose that's because they take the standpoint that they have so much testing code available, that any conflicts with the grammar are easily found.

However, in general, you can easily shoot yourself in the foot with a handwritten parser, because you can't see the conflicts by looking at the code locally.

It's usually because getting proper (as in user-friendly) error reporting is a massively weak spot with ever parser generator I've seen to date.

Conflicts is generally one of the least interesting problems in writing a parser unless you're designing a new language, and especially when hand writing recursive descent parsers, potential conflicts/ambiguities tend to become obvious quickly.

For prototyping a new language, sure, use a parser generator to test the grammar.

I too fall in the category of loving the idea of parser generator (and having written a few) but always falling back on hand-written parsers because the generators I've seen have all been inadequate. I hope that chances some day.

> For more complex formats, where you have to carry state around, or binary formats, they're extremely cumbersome to use.

That's not a theoretical issue, but more a practical issue.

> you'll have to hack around the generator's limitations

You're probably thinking of LALR(1) class generators. But we have GLR parser generators for a long time now, and they are very flexible and have little limitations. See e.g. [1].

[1] http://scottmcpeak.com/elkhound/

How do you parse ASN.1 with this generator? Or JPEG?

Or especially Semantic Designs' tool that gets a ton of mileage out of GLR:


It may be true that parser generators are superior from a technical point of view (I can't comment), but parser combinators sure can be nice to use. I very recently used nom to implement a parser for a structured text format, it feels wonderfully expressive and you can knock together a reasonably sophisticated parser in a matter of hours. Performance-wise, it is zero-copy and chomps through a 100MB file in less than half a second on my crappy laptop, so plenty fast enough for most situations.

The nicest thing about parser generators, IMO, is that they can warn you if your grammar is ambiguous.

> So I'm not sure what the point is of the title of the article.

The rest of the paper elaborates on the title.

And if you really care about error recovery, error messages, incremental processing, and tooling integration, writing a parser by hand via recursive descent is always a sure bet. There are good reasons why many production PLs don't even use generators.

Parser combinators let you put a breakpoint on a rule and get a call trace. (Same with plain old recursive descent or anything just involving function calls.)

That alone pretty much wipes out any remaining advantages of parser generators, in which the "cow" of your rules has been turned into a "hamburger" state machine that is very difficult to follow, usually having very poor debug support compared to the maturity of the rest of the tooling. ("How come if I add a variant to ths rule, I get a reduce/reduce conflict in these five other rules elsewhere? Waaaah ...")

Lest there be any doubt: GCC uses a hand-written recursive descent parser for C++. (Meg and a half of code and increasing.)

That's virtually a proof that just parsing with functional decomposition is good enough for anything.

Another thing is that with functional parsing, you can use the exception handling (or other non-local, dynamic control transfers) of the programming language for recovery and speculative parsing. This parse didn't work? Chuck the whole damn branch of the parse with a dynamic return, and try going down another rabbit hole.

That's more a proof that the grammar of C++ is absolutely terrible. More modern languages such as rust have been carefully designed to be parsable by non-necronomicon-level code.

> That's more a proof that the grammar of C++ is absolutely terrible. More modern languages such as rust have been carefully designed to be parsable by non-necronomicon-level code.

I don't believe your comment is fair or correct, and even very naive. Considering GCC's case, what makes a parser complex is not the language grammar itself, but all the requirements set onto the compiler to enable it to churn out intelligible warnings and error messages, which means supporting typical errors as extensions of the language grammar.

Furthermore, GCC's C++ parser support half a dozen different versions of C++ and all the error and warning messages that come with targetting a language standard but using a feature not supported by it. Rust has no such requirement, nor it will have any time soon.

Your comment strikes me as the old and faded tendency to throw baseless complains about tried and true technology by pointing out how green and untested tech is somehow better because it's yet to satisfy real-world requirements.

You are incorrect: the grammar of both C and C++ is fundamentally hard to parse, let alone the extra complexity of giving nice error message and doing good recovery. E.g. they require context to parse https://stackoverflow.com/a/41331994/1256624 , https://en.m.wikipedia.org/wiki/Most_vexing_parse , and C++ is undecidable to parse correctly http://blog.reverberate.org/2013/08/parsing-c-is-literally-u... .

> That's more a proof that the grammar of C++ is absolutely terrible.

Oh, I took that for granted as the basis of my remarks: far from a random choice on my part.

File that next to the 'proof' that the grammar of English is absolutely terrible. More modern languages such as Esperanto have been carefully designed to be learnable by regular human beings.

I know which language I'd rather be fluent in.

What is this trying to say? If a crucial portion of your job depends on the ability to parse a language (and it does, if you're a programmer who uses an IDE), then that's a point in favor of a language that's LL(1) rather than context-dependent. Making an analogy to natural languages here isn't relevant.

In the context of the discussion, it has been pointed out that parser generators are not used in practice in multiple very successful compilers for very successful real world languages. The grandparent to my comment claimed that the fact that GCC uses a hand written recursive descent parser is proof that that approach to parsing is good enough for anything. The parent claimed that no, it's a proof that the grammar of C++ is 'terrible'.

My point was that grammatical purity doesn't appear to correlate particularly in the real world with the success/popularity of a language. My analogy to natural languages was relevant in that context. If parser generators are not well suited to some very successful existing languages, it's not particular useful to blame that on the languages and point to currently niche (and relatively young) languages that don't have that 'problem'. In the real world, the most used languages have and will continue to have for the foreseeable future 'terrible' grammars so there will continue to be a need for parsing techniques that can handle them.

I'd actually speculate further that the analogy to natural languages is relevant in that it may not be a coincidence that the most used languages have some of the most complex grammars. Why that might be the case is an interesting question to think about.

> that's a point in favor of a language that's LL(1) rather than context-dependent

and 5 points in favor of a language that's LL(0) rather than LL(1). The logical conclusion of your argument is to use Lisp everywhere.

Lisp is LL(1). If we have #, we don't know what we're looking at until we read the next character, like #s structure, #( vector, #= circle notation, etc.

Actually there can be an integer between the two; but that doesn't change what kind of syntax is being read so it arguably doesn't push things to LL(2).

Other examples: seeing (a . we don't know whether this is the consing dot notation, or the start of a floating-point token.

Speaking of tokens, the Lisp token conversion rules effectively add up to LL(k). 12345 could be an integer or symbol. If the next character is, say, "a" and the token ends, we get a symbol. Basically if we see a token constituent character then k more characters have to be scanned before we can decide what kind of token and consequently what object to reduce to.

That variety of Lisp with its particular brand of syntactic sugar is LL(1), and its lexing LL(k). Because of its circumfix notation, a Lisp language can be LL(0) when the prefix symbols and tokens are defined appropriately. That was the point of my original comment.

Are they? I thought parser combinators made parsing context sensitive grammars easier than when using parser generators.

I think it's more a response to the wave of attacks against unsafe parsers, such as those used to read subtitles in media players.

Well, we gladly trade power and efficiency all the time for more intuitiveness and ease of use in computing...

On top of that, we even have some that are formally verified or validating that could be tied into Rust.




One of those handled most of C99 standard.

Sorry for the low effort comment but Clever Cloud is an awesome name

Thanks, it's a pretty cool product too. (yeah I work there)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact