Jay Taylor's notes
back to listing indexImplementing Regular Expressions
[web search]Implementing Regular Expressions
Russ Cox
rsc@swtch.com
This page collects resources about implementing regular expression search efficiently.
Articles and Notes
“Regular Expression Matching Can Be Simple And Fast”
An introduction to using finite automata to implement regular expression matching, and why the standard backtracking implementation is a bad idea.
Supporting programs: NFA | DFA | bounded-memory DFA | timing scripts
NFAs with submatch tracking:
Perl rules |
POSIX rules
Transliteration of Thompson's code for bytecode machine and x86, by Jan Burgy.
“Regular Expression Matching: the Virtual Machine Approach”
An introduction to submatch tracking during efficient (non-backtracking) NFA-based regular expression matching.
Supporting programs: http://code.google.com/p/re1/
“Regular Expression Matching in the Wild”
A tour of RE2, an efficient, production regular expression implementation.
Supporting programs: http://code.google.com/p/re2/
“Regular Expression Matching with a Trigram Index”
How Google Code Search worked.
Supporting programs: http://code.google.com/p/codesearch/
If you want to read Ken Thompson's original 1968 paper (see below), you'll want to take this with you.
“Regular Expressions: Languages, Algorithms, and Software”
by Brian W. Kernighan and Rob Pike
The cleanest, simplest, backtracking implementation
you'll ever see.
(But still backtracking, so runs very slowly on some inputs.)
See also Chapter 9 of The Practice of Programming and Chapter 1 of Beautiful Code.
Efficient Implementations
RE2 regular expression library
Efficient automaton-based implementation of Perl-syntax regular expressions (excluding backreferences). Written by Russ Cox.
Plan 9 regular expression library
Efficient (non-backtracking) NFA implementation with submatch tracking. Accepts UTF-8 and wide-character Unicode input. Traditional egrep syntax only. Written by Rob Pike.
TRE regular expression library
Efficient (non-backtracking) NFA implementation with submatch tracking. POSIX-compliant, wide-character input. Written by Ville Laurikari.
Very fast DFA-based grep(1) implementation. Accepts UTF-8 input. Traditional egrep syntax only. Written by Ken Thompson.
References
Michael Rabin and Dana Scott, “Finite automata and their decision problems,” IBM Journal of Research and Development 3 (1959), pp. 114–125. http://www.research.ibm.com/journal/rd/032/ibmrd0302C.pdf
Introduces the concept of NFAs, shows they are equivalent in power to DFAs. Won Rabin and Scott a Turing Award
Ken Thompson, “Regular expression search algorithm,” Communications of the ACM 11(6) (June 1968), pp. 419–422. http://doi.acm.org/10.1145/363347.363387 (PDF)
The first efficient regular expression search. Four pages but dense: every time I read it I learn something new. Take an IBM 7094 cheat sheet with you. See also old slides from a lecture about the paper.
Ville Laurikari, “NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions,” in Proceedings of the Symposium on String Processing and Information Retrieval, September 2000. http://laurikari.net/ville/spire2000-tnfa.ps
How to track submatches during efficient (non-backtracking) NFA-based regular expression search.
M. Douglas McIlroy, “Enumerating the strings of regular languages,” Journal of Functional Programming 14 (2004), pp. 503–518. http://www.cs.dartmouth.edu/~doug/nfa.ps.gz (preprint)
Not relevant to implementing regular expression matching, but a beautiful example of regular expression analysis and manipulation using Haskell.
Copyright © 2007 Russ Cox. All Rights Reserved.
http://swtch.com/~rsc/regexp/