Jay Taylor's notes

back to listing index

CruzDB architecture part 1: it's a log-structured database

[web search]
Original source (nwat.xyz)
Tags: database nwat.xyz
Clipped on: 2018-08-25
keyboard_backspace back home

CruzDB architecture part 1: it's a log-structured database

15 February 2018

CruzDB is a distributed shared-data key-value store that manages all its data in a single, high-performance distributed shared-log. It’s been one the most challenging and interesting projects I’ve hacked on, and this post is the first in a series that will explore the current implementation of the system, and critically, where things are headed.

Throughout this series we will dig into all the details, but here is a brief, high-level description of CruzDB. The system uses multi-version concurrency control for managing transactions, and each transaction in CruzDB reads from an immutable snapshot of the database. This design is a direct result of being built on top of an immutable log. When a transaction finishes it doesn’t commit immediately. All of the information required to replay the transaction—a reference to its snapshot, and a record of the its reads and writes—are packaged into an object called an intention which is then appended to the log. Any node with access to the log can append new transaction intentions, and replay the intentions in the order that they appear in the log. This process of replaying the log is what determines if a transaction commits or aborts, and because the log is totally ordered, every node in the system deterministically makes the same decision about every transaction.

This is a high-level architectural view of the system. At the top is CruzDB which can be linked directly into an application, or accessed through a proxy. It’s API is similar to RocksDB. In the middle is a distributed shared-log that stores all of the data from CruzDB. The implementation of a shared-log that we use is ZLog which is designed to run on top of a distributed object storage system. While ZLog is designed to run on a variety of distributed storage systems, we use the Ceph distributed object storage system, represented at the lowest level in the diagram.

Image (Asset 1/18) alt=

Back to top