Index 1,600,000,000 Keys with Automata and Rust

Nov 11, 2015

It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be searched very quickly.

In this article, I will teach you about finite state machines as a data structure for representing ordered sets and maps. This includes introducing an implementation written in Rust called the fst crate. It comes with complete API documentation. I will also show you how to build them using a simple command line tool. Finally, I will discuss a few experiments culminating in indexing over 1,600,000,000 URLs (134 GB) from the July 2015 Common Crawl Archive.

The technique presented in this article is also how Lucene represents a part of its inverted index.

Along the way, we will talk about memory maps, automaton intersection with regular expressions, fuzzy searching with Levenshtein distance and streaming set operations.

Target audience: Some familiarity with programming and fundamental data structures. No experience with automata theory or Rust is required.

Teaser

As a teaser to show where we’re headed, let’s take a quick look at an example. We won’t look at 1,600,000,000 strings quite yet. Instead, consider ~16,000,000 Wikipedia article titles (384 MB). Here’s how to index them:

$ time fst set --sorted wiki-titles wiki-titles.fst

real    0m18.310

The resulting index, wiki-titles.fst, is 157 MB. By comparison, gzip takes 12 seconds and compresses to 91 MB. (For some data sets, our indexing scheme can beat gzip in both speed and compression ratio.)

However, here’s something gzip cannot do: quickly find all article titles starting with Homer the:

$ time fst grep wiki-titles.fst 'Homer the.*'
Homer the Clown
Homer the Father
Homer the Great
Homer the Happy Ghost
Homer the Heretic
Homer the Moe
Homer the Smithers
...

real    0m0.023s

By comparison, grep takes 0.3 seconds on the original uncompressed data.

And finally, for something that even grep cannot do: quickly find all article titles within a certain edit distance of Homer Simpson:

$ time fst fuzzy wiki-titles.fst --distance 2 'Homer Simpson'
Home Simpson
Homer J Simpson
Homer Simpson
Homer Simpsons
Homer simpson
Homer simpsons
Hope Simpson
Roger Simpson

real    0m0.094s

This article is quite long, so if you only came for the fan fare, then you may skip straight to the section where we index 1,600,000,000 keys.

This article is pretty long, so I’ve put together a table of contents in case you want to skip around.

The first section discusses finite state machines and their use as data structures in the abstract. This section is meant to give you a mental model with which to reason about the data structure. There is no code in this section.

The second section takes the abstraction developed in the first section and demonstrates it with an implementation. This section is mostly intended to be an overview of how to use my fst library. This section contains code. We will discuss some implementation details, but will avoid the weeds. It is okay to skip this section if you don’t care about the code and instead only want to see experiments on real data.

The third and final section demonstrates use of a simple command line tool to build indexes. We will look at some real data sets and attempt to reason about the performance of finite state machines as a data structure.

Finite state machines as data structures

A finite state machine (FSM) is a collection of states and a collection of transitions that move from one state to the next. One state is marked as the start state and zero or more states are marked as final states. An FSM is always in exactly one state at a time.

FSM’s are rather general and can be used to model a number of processes. For example, consider an approximation of the daily life of my cat Cauchy:

Jay Taylor's notes