Basic Time Series with Cassandra

By Kelley Reynolds on March 6, 2011

editors note: as of 7 Mar 2012, this info is still current and correct.

One of the most common use cases for Cassandra is tracking time-series data. Server log files, usage, sensor data, SIP packets, stuff that changes over time. For the most part this is a straight forward process but given that Cassandra has real-world limitations on how much data can or should be in a row, there are a few details to consider.

Basic Inserts and Queries

The most basic and intuitive way to go about storing time-series data is to use a column family that has TimeUUID columns (or Long if you know that no two entries willhappen at the same timestamp), use the name of the thing you are monitoring as the row_key (server1-load for example), column_name as the timestamp, and the column_value would be the actual value of the thing (0.75 for example):

Inserting data — {:key => ‘server1-load’, :column_name => TimeUUID(now), :column_value => 0.75}

Using this method, one uses a column_slice to get the data in question:

Load at Time X — {:key => ‘server1-load’, :start => TimeUUID(X), :count => 1}
Load between X and Y – {:key => ‘server1-load’, :start => TimeUUID(X), :end => TimeUUID(Y)}

This works well enough for a while, but over time, this row will get very large. If you are storing sensor data that updates hundreds of times per second, that row will quickly become gigantic and unusable. The answer to that is to shard the data up in some way. To accomplish this, the application has to have a little more intelligence about how to store and query the information.

For this example, we’ll pick a day as our shard interval (details on picking the right shard interval later). The only change we make when we insert our data is to add a day to the row-key:

Inserting data — {:key => ‘server1-load-20110306′, :column_name => TimeUUID(now), :column_value => 0.75}

Using this method, one still uses a column slice, but you have to then also specify a different row_key depending on what you are querying:

Load at Time X — {:key => ‘server1-load-<X.strftime>’, :start => TimeUUID(X), :count => 1}
Load between Time X and Y (if X and Y are on the same day) – {:key => ‘server1-load-<X.strftime>’, :start => TimeUUID(X), :end => TimeUUID(Y)}

Now what to do if X and Y are not on the same day? No worries! You can use a multi-get to fetch more than one key at a time (or issue parallel gets for maximum performance). If your X and Y span two days, you just need to generate keys for those two days and issue them in a multiget:

Load between Time X and Y – {:key => ['server1-load-<X.strftime>', 'server1-load-<Y.strftime>'], :start => TimeUUID(X), :end => TimeUUID(Y)}

Then in your application, you will need to aggregate/concatenate/iterate those two rows however you see fit. If your data spans 3 or more days, you’ll need to also generate every key in between. Don’t be tempted to use the Order-Preserving Partitioner here, it won’t save you that much typing and it’ll will make managing your cluster much more difficult.

Calculating Shard Size

Now on the topic of determining your shard interval .. that’s a complicated topic that is often application dependent but the single biggest issue is to make sure your rows don’t get too big. The better you are at the ops side of Cassandra, the larger you can let your rows get but if I have to tell you that, you should keep them small. A quick ballpark formula for determining shard size is as follows (yes rcoli, it ignores overhead):

shard_size_in_seconds / update_frequency * avg_data_size_in_bytes == row_size_in_bytes

Set your shard size so that the row_size doesn’t get much larger than 10MB (this number can move around for many reasons but I’d consider it safe). For example, if you are storing hits on a website that gets 10 hits/sec and each entry is about 200B, then we have:

Daily — 86400 / (1 / 10) * 200 = 172800000 (165MB)
Hourly — 3600 / (1 / 10) * 200 = 7200000 (6.9MB)

Looks like sharding this up on hours hits our target row size. Of course you can use any shard size you want, 6 hours, 2 hours, seconds, months, whatever. If you code up your application properly, it should be easy to adjust. Even if you decide to change your shard partway through the life if your application, you just have to know that before a certain point, use keys with one format, and after a certain point, use another, it’s that simple.

Indexing and Aggregation

Indexing and aggregation of time-series data is a more complicated topic as they are highly application dependent. Various new and upcoming features of Cassandra also change the best practices for how things like aggregation are done so I won’t go into that. For more details, hit #cassandra on irc.freenode and ask around. There is usually somebody there to help.

About Kelley Reynolds

A full-stack software engineer, an avid trail runner, and a bassoonist. Kelley occasionally writes about one of his many projects on this blog.

View all posts by Kelley Reynolds →

,cassandra

<Beware the supercolumn, its a trap for the unwary!

FFI Ruby Bindings for Phidget Library>

Basic Time Series with Cassandra March 6, 2011
Understanding the Cassandra Data Model from a SQL Perspective September 13, 2010
Beware the supercolumn, its a trap for the unwary! December 4, 2010
Cassandra Data Model from an SQL Perspective – Consistency Level and Replication Factor September 19, 2010
Modbus/RTU via TCP Serial Gateway with Ruby July 28, 2010

kreynolds @ GitHub
cassandra-cql
DBI-like CQL driver for Cassandra in Ruby
phidgets-ffi
FFI Bindings for the Phidgets Library
fusematic
Modified parts for the Maker's Tool Works Fusematic Printer
RaphaelJS-Infobox
Wrapper for placing dynamic HTML on a RaphaelJS Canvas
Priam
Simple Cassandra connection setup for Ruby on Rails
paperclip-cassandra
Cassandra Storage Backend for Paperclip

RubyScale: an Inside Systems joint.