Implementing Regular Expressions[web search]
Implementing Regular Expressions
This page collects resources about implementing regular expression search efficiently.
Articles and Notes
An introduction to using finite automata to implement regular expression matching, and why the standard backtracking implementation is a bad idea.
An introduction to submatch tracking during efficient (non-backtracking) NFA-based regular expression matching.
Supporting programs: http://code.google.com/p/re1/
A tour of RE2, an efficient, production regular expression implementation.
Supporting programs: http://code.google.com/p/re2/
How Google Code Search worked.
Supporting programs: http://code.google.com/p/codesearch/
If you want to read Ken Thompson's original 1968 paper (see below), you'll want to take this with you.
“Regular Expressions: Languages, Algorithms, and Software”
by Brian W. Kernighan and Rob Pike
The cleanest, simplest, backtracking implementation
you'll ever see.
(But still backtracking, so runs very slowly on some inputs.)
Efficient automaton-based implementation of Perl-syntax regular expressions (excluding backreferences). Written by Russ Cox.
Efficient (non-backtracking) NFA implementation with submatch tracking. Accepts UTF-8 and wide-character Unicode input. Traditional egrep syntax only. Written by Rob Pike.
Efficient (non-backtracking) NFA implementation with submatch tracking. POSIX-compliant, wide-character input. Written by Ville Laurikari.
Very fast DFA-based grep(1) implementation. Accepts UTF-8 input. Traditional egrep syntax only. Written by Ken Thompson.
Michael Rabin and Dana Scott, “Finite automata and their decision problems,” IBM Journal of Research and Development 3 (1959), pp. 114–125. http://www.research.ibm.com/journal/rd/032/ibmrd0302C.pdf
Introduces the concept of NFAs, shows they are equivalent in power to DFAs. Won Rabin and Scott a Turing Award
The first efficient regular expression search. Four pages but dense: every time I read it I learn something new. Take an IBM 7094 cheat sheet with you. See also old slides from a lecture about the paper.
Ville Laurikari, “NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions,” in Proceedings of the Symposium on String Processing and Information Retrieval, September 2000. http://laurikari.net/ville/spire2000-tnfa.ps
How to track submatches during efficient (non-backtracking) NFA-based regular expression search.
M. Douglas McIlroy, “Enumerating the strings of regular languages,” Journal of Functional Programming 14 (2004), pp. 503–518. http://www.cs.dartmouth.edu/~doug/nfa.ps.gz (preprint)
Not relevant to implementing regular expression matching, but a beautiful example of regular expression analysis and manipulation using Haskell.
Copyright © 2007 Russ Cox. All Rights Reserved.