Jay Taylor's notes
back to listing indexdatacrypt-project/hitchhiker-tree
[web search]Hitchhiker tree
This document will attempt to sketch out the big ideas in the hitchhiker tree. It will also attempt to call out various locations in the implementation where features are built.
High-level understanding
The goal of the hitchhiker tree is to wed three things: the query performance of a B+ tree, the write performance of an append-only log, and convenience of a functional, persistent datastructure. Let’s look at each of these in some detail.
B+ trees win at queries
You might remember an important theorem from data structures: the best-performing data structure for looking up sorted keys cannot do those queries faster than O(log(n))
.
Since sorted trees provide a solution for this, we’ll start with them.
Now, a common sorted tree for this purpose is the Red-Black tree, whose actual query performance is between log2(n)
and 2*log2(n)
(the write performance is log2(n)
).
The factor of 2 comes from the partial imbalances (which are still asymptotically balanced) that the algorithm allows, and the base 2 of the log comes from the fact that it’s a binary search tree.
A less popular sorted tree is the AVL tree—this tree achieves log2(n)
query performance, at the cost of always paying 2*log2(n)
for inserts.
We can already see a pattern—although many trees reach the asymptotic bound, they differ in their constant factors.
The tree that the hitchhiker tree is based off of is the B+ tree, which achieves logb(n)
query performance.
Since b
can be very large (on the order of 100s or 1000s), these trees are especially great when each node is stored on higher latency media, like remote storage or a disk.
This is because each node can contain huge numbers of keys, meaning that by only keeping the index nodes in memory, we can access most keys with fewer, often just one, data accesses.
Unlike the above sorted trees (and B trees, which we won’t discuss), B+ trees only store their data (i.e. the values) in their leaves—internal nodes only need to store keys.
Event logs win at writing data
Do you know the fastest way to write data? Append it to the end of the file. There’s no pointers, no updating of data structures, no extra IO costs incurred.
Unfortunately, to perform a query on an event log, we need to replay all the data to figure out what happened.
That replay costs O(n)
, since it touches every event written.
So, how can we fix this?
Unifying B+ trees and event logs
The first idea to understand is this: how can we combine the write performance of an event log with the query performance of a B+ tree? The answer is that we’re going to "overlay" an event log on the B+ tree!
The idea of the overlay is this: each index node of the B+ tree will contain an event log. Whenever we write data, we’ll just append the operation (insert or delete) to the end of the root index node’s event log. In order to avoid the pitfall of appending every operation to an ever-growing event log (which would leave us stuck with linear queries), we’ll put a limit on the number of events that fit in the log. Once the log has overflowed in the root, we’ll split the events in that log towards their eventual destination, adding those events to the event logs of the children of that node. Eventually, the event log will overflow to a leaf node, at which point we’ll actually do the insertion into the B+ tree.
This process gives us several properties:
-
Most inserts are a single append to the root’s event log
-
Although there are a linear number of events, nodes are exponentially less likely to overflow the deeper they are in the tree
-
All data needed for a query exists along a path of nodes between the root and a specific leaf node. Since the logs are constant in size, queries still only read
log(n)
nodes.
Thus we dramatically improve the performance of insertions without hurting the IO cost of queries.
Functional Persistence
Now that we get the sketch of how to combine event logs and B+ trees, let’s see the beauty of making the whole thing functional and persistent! Since the combined B+/log data structure primarily only modifies nodes near the root, we can take advantage of the reduced modification to achieve reduced IO when persisting the tree. We can use the standard path-copying technique from functional, persistent data structures. This gives great performance, since the structure is designed to avoid needing to copy entire paths—most writes will only touch the root. Furthermore, we can batch many modifications together, and wait to flush the tree, in order to further batch IO.
Code Structure
The hitchhiker tree’s core implementation lives in 2 namespaces: hitchhiker.tree.core
and hitchhiker.tree.messaging
.
hitchhiker.tree.core
implements the B+ tree and its extensibility hooks; hitchhiker.tree.messaging
adds the messaging layer (aka log) to the B+ tree from core
.
Protocols
In hitchhiker.tree.core
, we have several important protocols:
hitchhiker.tree.core/IKeyCompare
-
This protocol should be extended to support custom key comparators. It’s just like
clojure.core/compare
. hitchhiker.tree.core/IResolve
-
This protocol is the functionality for a minimal node. Not only will every node implement this, but also backends will use this to implement stubs which automatically & lazily load the full node into memory during queries.
last-key
is used for searches, so that entire nodes can remain unloaded from memory when we only need their boundary key.dirty?
is used to determine whether the IO layer would need to flush this node, or whether it already exists in the backing storage.resolve
loads the node into memory—this could return itself (in the case of an already loaded node), or it could return a new object after waiting on some IO. hitchhiker.tree.core/INode
-
This protocol implements a node fully in-memory. Generally, this shouldn’t need to be re-implemented; however, if the hitchhiker was to be enhanced with key & value size awareness during splits and merges, you’d want to adjust the methods of this protocol.
hitchhiker.tree.core/Config
-
This structure must be passed to the
hitchhiker.tree.core/b-tree
constructor, which is the only way to get a new tree. Theindex-b
is the fanout on index nodes; thedata-b
is the key & value fanout on data nodes, and theop-buf-size
is the the size of the log at each index node. The internet told me that choosingsqrt(b)
for theop-buf-size
andb-sqrt(b)
for theindex-b
was a good idea, but who knows? hitchhiker.tree.core/IBackend
-
This protocol implements a backend for the tree.
new-session
returns a "session" object, which is a convenient way to capture backend-specific stats.write-node
will write a a node to storage, returning the stub object which implementsIResolve
for that backend. It can record stats by mutating or logging to the session.anchor-root
is called by the persistence functionality to ensure that the backend knows which nodes are roots; this is a hint to any sorts of garbage collectors.delete-addr
removes the given node from storage.TestingBackend
is a simple implementation of a backend which bypasses serialization and is entirely in memory. It can be a useful reference for the bare minimum implementation of a backend. hitchhiker.tree.messaging/IOperation
-
This protocol describes an operation to the tree. Currently, there are only
InsertOp
andDeleteOp
, but arbitrary mutation is supported by the data structure.
Useful APIs
hitchhiker.tree.core/flush-tree
-
This takes a tree, does a depth-first search to ensure each node’s children are durably persisted before flushing the node itself. It returns the updated tree & the session under which the IO was performed.
flush-tree
does block on the writing IO—a future improvement would be to make that non-blocking. hitchhiker.tree.messaging/enqueue
-
This is the fundamental operation for adding to the event log in a hitchhiker tree.
enqueue
will handle the appending, overflow, and correct propagation of operations through the tree. hitchhiker.tree.messaging/apply-ops-in-path
-
This is the fundamental operation for reading from the event log in a hitchhiker tree. This finds all the relevant operations on the path to a leaf node, and returns the data that leaf node would contain if all the operations along the path were fully committed. This is conveniently designed to work on entire leaf nodes, so that iteration is as easy as using the same logic as a non-augmented B+ tree, and simply expanding each leaf node from the standard iteration.
lookup
,insert
,delete
,lookup-fwd-iter
-
These are the basic operations on hitchhiker trees. There are implementations in
hitchhiker.tree.core
andhitchhiker.tree.messaging
which leverage their respective tree variants. They correspond toget
,assoc
,dissoc
, andsubseq
on sorted maps. hitchhiker.core.b-tree
-
This is how to make a new hitchhiker or B+ tree. You should either use the above mutation functions on it from one or the other namespace; it probably won’t work if you mix them.
Related Work
Hitchhiker trees are made persistent with the same method, path copying, as used by Okasaki The improved write performance is made possible thanks to the same buffering technique as a fractal tree index. As it turns out, after I implemented the fractal tree, I spoke with a former employee of Tokutek, a company that commercialized fractal tree indices. That person told me that we’d actually implemented fractal reads identically! This is funny because there’s no documentation anywhere about how exactly you should structure your code to compute the query.