Jay Taylor's notes
back to listing indexIndex 1,600,000,000 Keys with Automata and Rust - Andrew Gallant's Blog
[web search]Index 1,600,000,000 Keys with Automata and Rust
Nov 11, 2015It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be searched very quickly.
In this article, I will teach you about finite state machines as a data
structure for representing ordered sets and maps. This includes introducing
an implementation written in Rust called the
fst
crate.
It comes with
complete API documentation.
I will also show you how to build them using a simple command line tool.
Finally, I will discuss a few experiments culminating in indexing over
1,600,000,000 URLs (134 GB) from the
July 2015 Common Crawl Archive.
The technique presented in this article is also how Lucene represents a part of its inverted index.
Along the way, we will talk about memory maps, automaton intersection with regular expressions, fuzzy searching with Levenshtein distance and streaming set operations.
Target audience: Some familiarity with programming and fundamental data structures. No experience with automata theory or Rust is required.
Teaser
As a teaser to show where we’re headed, let’s take a quick look at an example.
We won’t look at 1,600,000,000 strings quite yet. Instead, consider ~16,000,000
Wikipedia article titles (384 MB
). Here’s how to index them:
$ time fst set --sorted wiki-titles wiki-titles.fst
real 0m18.310
The resulting index, wiki-titles.fst
, is 157 MB
. By comparison, gzip
takes 12
seconds and compresses to 91 MB
. (For some data sets, our indexing
scheme can beat gzip
in both speed and compression ratio.)
However, here’s something gzip
cannot do: quickly find all article titles
starting with Homer the
:
$ time fst grep wiki-titles.fst 'Homer the.*'
Homer the Clown
Homer the Father
Homer the Great
Homer the Happy Ghost
Homer the Heretic
Homer the Moe
Homer the Smithers
...
real 0m0.023s
By comparison, grep
takes 0.3
seconds on the original uncompressed data.
And finally, for something that even grep
cannot do: quickly find all article
titles within a certain edit distance of Homer Simpson
:
$ time fst fuzzy wiki-titles.fst --distance 2 'Homer Simpson'
Home Simpson
Homer J Simpson
Homer Simpson
Homer Simpsons
Homer simpson
Homer simpsons
Hope Simpson
Roger Simpson
real 0m0.094s
This article is quite long, so if you only came for the fan fare, then you may skip straight to the section where we index 1,600,000,000 keys.
Table of Contents
This article is pretty long, so I’ve put together a table of contents in case you want to skip around.
The first section discusses finite state machines and their use as data structures in the abstract. This section is meant to give you a mental model with which to reason about the data structure. There is no code in this section.
The second section takes the abstraction developed in the first section and
demonstrates it with an implementation. This section is mostly intended to
be an overview of how to use my fst
library. This section contains code. We will discuss some implementation
details, but will avoid the weeds. It is okay to skip this section if you don’t
care about the code and instead only want to see experiments on real data.
The third and final section demonstrates use of a simple command line tool to build indexes. We will look at some real data sets and attempt to reason about the performance of finite state machines as a data structure.
- Finite state machines as data structures
- The FST library
- The FST command line tool
- Lessons and trade offs
- Conclusion
Finite state machines as data structures
A finite state machine (FSM) is a collection of states and a collection of transitions that move from one state to the next. One state is marked as the start state and zero or more states are marked as final states. An FSM is always in exactly one state at a time.
FSM’s are rather general and can be used to model a number of processes. For example, consider an approximation of the daily life of my cat Cauchy: