It turns out that finite state machines are useful for things other than
expressing computation. Finite state machines can also be used to compactly
represent ordered sets or maps of strings that can be searched very quickly.
In this article, I will teach you about finite state machines as a data
structure for representing ordered sets and maps. This includes introducing
an implementation written in Rust called the
It comes with
complete API documentation.
I will also show you how to build them using a simple command line tool.
Finally, I will discuss a few experiments culminating in indexing over
1,600,000,000 URLs (134 GB) from the
July 2015 Common Crawl Archive.
The technique presented in this article is also how
Lucene represents a part of its inverted
Along the way, we will talk about memory maps, automaton intersection with
regular expressions, fuzzy searching with Levenshtein distance and streaming
Target audience: Some familiarity with programming and fundamental data
structures. No experience with automata theory or Rust is required.
As a teaser to show where we’re headed, let’s take a quick look at an example.
We won’t look at 1,600,000,000 strings quite yet. Instead, consider ~16,000,000
Wikipedia article titles (
384 MB). Here’s how to index them:
$ time fst set --sorted wiki-titles wiki-titles.fst
The resulting index,
157 MB. By comparison,
12 seconds and compresses to
91 MB. (For some data sets, our indexing
scheme can beat
gzip in both speed and compression ratio.)
However, here’s something
gzip cannot do: quickly find all article titles
$ time fst grep wiki-titles.fst 'Homer the.*'
Homer the Clown
Homer the Father
Homer the Great
Homer the Happy Ghost
Homer the Heretic
Homer the Moe
Homer the Smithers
0.3 seconds on the original uncompressed data.
And finally, for something that even
grep cannot do: quickly find all article
titles within a certain edit distance of
$ time fst fuzzy wiki-titles.fst --distance 2 'Homer Simpson'
Homer J Simpson
This article is quite long, so if you only came for the fan fare, then you may
skip straight to the
section where we index 1,600,000,000 keys.
Table of Contents
This article is pretty long, so I’ve put together a table of contents in case
you want to skip around.
The first section discusses finite state machines and their use as data
structures in the abstract. This section is meant to give you a mental model
with which to reason about the data structure. There is no code in this
The second section takes the abstraction developed in the first section and
demonstrates it with an implementation. This section is mostly intended to
be an overview of how to use my
library. This section contains code. We will discuss some implementation
details, but will avoid the weeds. It is okay to skip this section if you don’t
care about the code and instead only want to see experiments on real data.
The third and final section demonstrates use of a simple command line tool to
build indexes. We will look at some real data sets and attempt to reason about
the performance of finite state machines as a data structure.
Finite state machines as data structures
A finite state machine (FSM) is a collection of states and a collection of
transitions that move from one state to the next. One state is marked as the
start state and zero or more states are marked as final states. An FSM is
always in exactly one state at a time.
FSM’s are rather general and can be used to model a number of processes. For
example, consider an approximation of the daily life of my cat Cauchy:
Powered by Hugo & Pixyll