Jay Taylor's notes

back to listing index

datacrypt-project/hitchhiker-tree

[web search]
Original source (github.com)
Tags: algorithms big-o data-structures trees computer-science hitchhiker-tree github.com
Clipped on: 2016-08-05

Skip to content
hitchhiker-tree / doc / hitchhiker.adoc
Open this file in GitHub Desktop
146 lines (108 sloc) 9.67 KB

Hitchhiker tree

This document will attempt to sketch out the big ideas in the hitchhiker tree. It will also attempt to call out various locations in the implementation where features are built.

High-level understanding

The goal of the hitchhiker tree is to wed three things: the query performance of a B+ tree, the write performance of an append-only log, and convenience of a functional, persistent datastructure. Let’s look at each of these in some detail.

B+ trees win at queries

You might remember an important theorem from data structures: the best-performing data structure for looking up sorted keys cannot do those queries faster than  O(log(n)) . Since sorted trees provide a solution for this, we’ll start with them. Now, a common sorted tree for this purpose is the Red-Black tree, whose actual query performance is between  log2(n)  and  2*log2(n)  (the write performance is  log2(n) ). The factor of 2 comes from the partial imbalances (which are still asymptotically balanced) that the algorithm allows, and the base 2 of the log comes from the fact that it’s a binary search tree.

A less popular sorted tree is the AVL tree—​this tree achieves  log2(n)  query performance, at the cost of always paying  2*log2(n)  for inserts. We can already see a pattern—​although many trees reach the asymptotic bound, they differ in their constant factors.

The tree that the hitchhiker tree is based off of is the B+ tree, which achieves  logb(n)  query performance. Since  b  can be very large (on the order of 100s or 1000s), these trees are especially great when each node is stored on higher latency media, like remote storage or a disk. This is because each node can contain huge numbers of keys, meaning that by only keeping the index nodes in memory, we can access most keys with fewer, often just one, data accesses.

Unlike the above sorted trees (and B trees, which we won’t discuss), B+ trees only store their data (i.e. the values) in their leaves—​internal nodes only need to store keys.

Event logs win at writing data

Do you know the fastest way to write data? Append it to the end of the file. There’s no pointers, no updating of data structures, no extra IO costs incurred.

Unfortunately, to perform a query on an event log, we need to replay all the data to figure out what happened. That replay costs  O(n) , since it touches every event written. So, how can we fix this?

Unifying B+ trees and event logs

The first idea to understand is this: how can we combine the write performance of an event log with the query performance of a B+ tree? The answer is that we’re going to "overlay" an event log on the B+ tree!

The idea of the overlay is this: each index node of the B+ tree will contain an event log. Whenever we write data, we’ll just append the operation (insert or delete) to the end of the root index node’s event log. In order to avoid the pitfall of appending every operation to an ever-growing event log (which would leave us stuck with linear queries), we’ll put a limit on the number of events that fit in the log. Once the log has overflowed in the root, we’ll split the events in that log towards their eventual destination, adding those events to the event logs of the children of that node. Eventually, the event log will overflow to a leaf node, at which point we’ll actually do the insertion into the B+ tree.

This process gives us several properties:

  • Most inserts are a single append to the root’s event log

  • Although there are a linear number of events, nodes are exponentially less likely to overflow the deeper they are in the tree

  • All data needed for a query exists along a path of nodes between the root and a specific leaf node. Since the logs are constant in size, queries still only read  log(n)  nodes.

Thus we dramatically improve the performance of insertions without hurting the IO cost of queries.

Functional Persistence

Now that we get the sketch of how to combine event logs and B+ trees, let’s see the beauty of making the whole thing functional and persistent! Since the combined B+/log data structure primarily only modifies nodes near the root, we can take advantage of the reduced modification to achieve reduced IO when persisting the tree. We can use the standard path-copying technique from functional, persistent data structures. This gives great performance, since the structure is designed to avoid needing to copy entire paths—​most writes will only touch the root. Furthermore, we can batch many modifications together, and wait to flush the tree, in order to further batch IO.

Code Structure

The hitchhiker tree’s core implementation lives in 2 namespaces:  hitchhiker.tree.core  and  hitchhiker.tree.messaging .  hitchhiker.tree.core  implements the B+ tree and its extensibility hooks;  hitchhiker.tree.messaging  adds the messaging layer (aka log) to the B+ tree from  core .

Protocols

In  hitchhiker.tree.core , we have several important protocols:

 hitchhiker.tree.core/IKeyCompare 

This protocol should be extended to support custom key comparators. It’s just like  clojure.core/compare .

 hitchhiker.tree.core/IResolve 

This protocol is the functionality for a minimal node. Not only will every node implement this, but also backends will use this to implement stubs which automatically & lazily load the full node into memory during queries.

 last-key  is used for searches, so that entire nodes can remain unloaded from memory when we only need their boundary key.

 dirty?  is used to determine whether the IO layer would need to flush this node, or whether it already exists in the backing storage.

 resolve  loads the node into memory—​this could return itself (in the case of an already loaded node), or it could return a new object after waiting on some IO.

 hitchhiker.tree.core/INode 

This protocol implements a node fully in-memory. Generally, this shouldn’t need to be re-implemented; however, if the hitchhiker was to be enhanced with key & value size awareness during splits and merges, you’d want to adjust the methods of this protocol.

 hitchhiker.tree.core/Config 

This structure must be passed to the  hitchhiker.tree.core/b-tree  constructor, which is the only way to get a new tree. The  index-b  is the fanout on index nodes; the  data-b  is the key & value fanout on data nodes, and the  op-buf-size  is the the size of the log at each index node. The internet told me that choosing  sqrt(b)  for the  op-buf-size  and  b-sqrt(b)  for the  index-b  was a good idea, but who knows?

 hitchhiker.tree.core/IBackend 

This protocol implements a backend for the tree.

 new-session  returns a "session" object, which is a convenient way to capture backend-specific stats.

 write-node  will write a a node to storage, returning the stub object which implements  IResolve  for that backend. It can record stats by mutating or logging to the session.

 anchor-root  is called by the persistence functionality to ensure that the backend knows which nodes are roots; this is a hint to any sorts of garbage collectors.

 delete-addr  removes the given node from storage.

 TestingBackend  is a simple implementation of a backend which bypasses serialization and is entirely in memory. It can be a useful reference for the bare minimum implementation of a backend.

 hitchhiker.tree.messaging/IOperation 

This protocol describes an operation to the tree. Currently, there are only  InsertOp  and  DeleteOp , but arbitrary mutation is supported by the data structure.

Useful APIs

 hitchhiker.tree.core/flush-tree 

This takes a tree, does a depth-first search to ensure each node’s children are durably persisted before flushing the node itself. It returns the updated tree & the session under which the IO was performed.  flush-tree  does block on the writing IO—​a future improvement would be to make that non-blocking.

 hitchhiker.tree.messaging/enqueue 

This is the fundamental operation for adding to the event log in a hitchhiker tree.  enqueue  will handle the appending, overflow, and correct propagation of operations through the tree.

 hitchhiker.tree.messaging/apply-ops-in-path 

This is the fundamental operation for reading from the event log in a hitchhiker tree. This finds all the relevant operations on the path to a leaf node, and returns the data that leaf node would contain if all the operations along the path were fully committed. This is conveniently designed to work on entire leaf nodes, so that iteration is as easy as using the same logic as a non-augmented B+ tree, and simply expanding each leaf node from the standard iteration.

 lookup ,  insert ,  delete ,  lookup-fwd-iter 

These are the basic operations on hitchhiker trees. There are implementations in  hitchhiker.tree.core  and  hitchhiker.tree.messaging  which leverage their respective tree variants. They correspond to  get ,  assoc ,  dissoc , and  subseq  on sorted maps.

 hitchhiker.core.b-tree 

This is how to make a new hitchhiker or B+ tree. You should either use the above mutation functions on it from one or the other namespace; it probably won’t work if you mix them.

Related Work

Hitchhiker trees are made persistent with the same method, path copying, as used by Okasaki The improved write performance is made possible thanks to the same buffering technique as a fractal tree index. As it turns out, after I implemented the fractal tree, I spoke with a former employee of Tokutek, a company that commercialized fractal tree indices. That person told me that we’d actually implemented fractal reads identically! This is funny because there’s no documentation anywhere about how exactly you should structure your code to compute the query.