Jay Taylor's notes

back to listing index

Implementing Regular Expressions

[web search]

Original source (swtch.com)

Tags: programming regular-expressions swtch.com

Clipped on: 2016-07-31

Implementing Regular Expressions

Russ Cox
rsc@swtch.com

This page collects resources about implementing regular expression search efficiently.

Articles and Notes

“Regular Expression Matching Can Be Simple And Fast”

An introduction to using finite automata to implement regular expression matching, and why the standard backtracking implementation is a bad idea.

Supporting programs: NFA | DFA | bounded-memory DFA | timing scripts

NFAs with submatch tracking: Perl rules | POSIX rules
Transliteration of Thompson's code for bytecode machine and x86, by Jan Burgy.

“Regular Expression Matching: the Virtual Machine Approach”

An introduction to submatch tracking during efficient (non-backtracking) NFA-based regular expression matching.

Supporting programs: http://code.google.com/p/re1/

“Regular Expression Matching in the Wild”

A tour of RE2, an efficient, production regular expression implementation.

Supporting programs: http://code.google.com/p/re2/

“Regular Expression Matching with a Trigram Index”

How Google Code Search worked.

Supporting programs: http://code.google.com/p/codesearch/

“IBM 7094 Cheat Sheet”

If you want to read Ken Thompson's original 1968 paper (see below), you'll want to take this with you.

“Regular Expressions: Languages, Algorithms, and Software”
by Brian W. Kernighan and Rob Pike

The cleanest, simplest, backtracking implementation you'll ever see.
(But still backtracking, so runs very slowly on some inputs.)

See also Chapter 9 of The Practice of Programming and Chapter 1 of Beautiful Code.

Efficient Implementations

RE2 regular expression library

Efficient automaton-based implementation of Perl-syntax regular expressions (excluding backreferences). Written by Russ Cox.

Plan 9 regular expression library

Efficient (non-backtracking) NFA implementation with submatch tracking. Accepts UTF-8 and wide-character Unicode input. Traditional egrep syntax only. Written by Rob Pike.

TRE regular expression library

Efficient (non-backtracking) NFA implementation with submatch tracking. POSIX-compliant, wide-character input. Written by Ville Laurikari.

Plan 9 grep

Very fast DFA-based grep(1) implementation. Accepts UTF-8 input. Traditional egrep syntax only. Written by Ken Thompson.

References

Michael Rabin and Dana Scott, “Finite automata and their decision problems,” IBM Journal of Research and Development 3 (1959), pp. 114–125. http://www.research.ibm.com/journal/rd/032/ibmrd0302C.pdf

Introduces the concept of NFAs, shows they are equivalent in power to DFAs. Won Rabin and Scott a Turing Award

Ken Thompson, “Regular expression search algorithm,” Communications of the ACM 11(6) (June 1968), pp. 419–422. http://doi.acm.org/10.1145/363347.363387 (PDF)

The first efficient regular expression search. Four pages but dense: every time I read it I learn something new. Take an IBM 7094 cheat sheet with you. See also old slides from a lecture about the paper.

Ville Laurikari, “NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions,” in Proceedings of the Symposium on String Processing and Information Retrieval, September 2000. http://laurikari.net/ville/spire2000-tnfa.ps

How to track submatches during efficient (non-backtracking) NFA-based regular expression search.

M. Douglas McIlroy, “Enumerating the strings of regular languages,” Journal of Functional Programming 14 (2004), pp. 503–518. http://www.cs.dartmouth.edu/~doug/nfa.ps.gz (preprint)

Not relevant to implementing regular expression matching, but a beautiful example of regular expression analysis and manipulation using Haskell.

Jay Taylor's notes