A vendor-independent comparison of NoSQL databases: Cassandra, HBase, MongoDB, Riak

By Sergey Bushik, senior R&D engineer at Altoros Systems Inc., special to Network World
October 22, 2012 04:26 PM ET

What makes this research unique?

Often referred to as NoSQL, non-relational databases feature elasticity and scalability in combination with a capability to store big data and work with cloud computing systems, all of which make them extremely popular. NoSQL data management systems are inherently schema-free (with no obsessive complexity and a flexible data model) and eventually consistent (complying with BASE rather than ACID). They have a simple API, serve huge amounts of data and provide high throughput.

In 2012, the number of NoSQL products reached 120-plus and the figure is still growing. That variety makes it difficult to select the best tool for a particular case. Database vendors usually measure productivity of their products with custom hardware and software settings designed to demonstrate the advantages of their solutions. We wanted to do independent and unbiased research to complement the work done by the folks at Yahoo.

Using Amazon virtual machines to ensure verifiable results and research transparency (which also helped minimize errors due to hardware differences), we have analyzed and evaluated the following NoSQL solutions:

● Cassandra, a column family store
● HBase (column-oriented, too)
● MongoDB, a document-oriented database
● Riak, a key-value store

We also tested MySQL Cluster and sharded MySQL, taking them as benchmarks.

After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.

Tools, libraries and methods

For benchmarking, we used Yahoo Cloud Serving Benchmark, which consists of the following components:

● a framework with a workload generator
● a set of workload scenarios

We have measured database performance under certain types of workloads. A workload was defined by different distributions assigned to the two main choices:

• which operation to perform
• which record to read or write

Operations against a data store were randomly selected and could be of the following types:

• Insert: Inserts a new record.
• Update: Updates a record by replacing the value of one field.
• Read: Reads a record, either one randomly selected field, or all fields.
• Scan: Scans records in order, starting at a randomly selected record key. The number of records to scan is also selected randomly from the range between 1 and 100.

Each workload was targeted at a table of 100,000,000 records; each record was 1,000 bytes in size and contained 10 fields. A primary key identified each record, which was a string, such as "user234123." Each field was named field0, field1, and so on. The values in each field were random strings of ASCII characters, 100 bytes each.

Database performance was defined by the speed at which a database computed basic operations. A basic operation is an action performed by the workload executor, which drives multiple client threads. Each thread executes a sequential series of operations by making calls to the database interface layer both to load the database (the load phase) and to execute the workload (the transaction phase). The threads throttle the rate at which they generate requests, so that we may directly control the offered load against the database. In addition, the threads measure the latency and achieved throughput of their operations and report these measurements to the statistics module.

Workload B: Read mostly
Workload C: Read only
Workload D: Read latest
Workload E: Scan short ranges
Workload F: Read-modify-write
Workload G: Write heavily

Each workload was defined by:

1) The number of records manipulated (read or written)
2) The number of columns per each record
3) The total size of a record or the size of each column
4) The number of threads used to load the system

This research also specifies configuration settings for each type of the workloads. We used the following default settings:

1) 100,000,000 records manipulated
2) The total size of a record equal to 1Kb
3) 10 fields of 100 bytes each per record
4) Multithreaded communications with the system (100 threads)

Testing environment

To provide verifiable results, benchmarking was performed on Amazon Elastic Compute Cloud instances. Yahoo Cloud Serving Benchmark Client was deployed on one Amazon Large Instance:

• 7.5GB of memory
• four EC2 Compute Units (two virtual cores with two EC2 Compute Units each)
• 850GB of instance storage
• 64-bit platform
• high I/O performance
• EBS-Optimized (500Mbps)
• API name: m1.large

Each of the NoSQL databases was deployed on a four-node cluster in the same geographical region on Amazon Extra Large Instances:

• 15GB of memory
• eight EC2 Compute Units (four virtual cores with two EC2 Compute Units each)
• 1690GB of instance storage
• 64-bit platform
• high I/O performance
• EBS-Optimized (1000Mbps)
• API name: m1.xlarge

Amazon is often blamed for its high I/O wait time and comparatively slow EBS performance. To mitigate these drawbacks, EBS disks had been assembled in a RAID0 array with stripping and after that they were able to provide up to two times higher performance.

The results

When we started our research into NoSQL databases, we wanted to get unbiased results that would show which solution is best suitable for each particular task. That is why we decided to test performance of each database under different types of loads and let the users decide what product better suits their needs.

We started with measuring the load phase, during which 100 million records, each containing 10 fields of 100 randomly generated bytes, were imported to a four-node cluster.

1) Read/update ratio: 50/50
2) Zipfian request distribution

1) Read/update ratio: 95/5
2) Zipfian request distribution

1) Read/update ratio: 100/0
2) Zipfian request distribution

This read-only workload simulated a data caching system. The data was stored outside the system, while the application was only reading it. Thanks to B-tree indexes, sharded MySQL became the winner in this competition.

1) Read/update/insert ratio: 95/0/5
2) Latest request distribution
3) Max scan length: 100 records
4) Scan length distribution: uniform

HBase performed better than Cassandra in range scans. HBase scanning is a form of hierarchical fast merge-sort operation performed by HRegionScanner. It merges the results received from HStoreScanners (one per family), which, in their turn, merge the results received from HStoreFileScanners (one for each file in the family). If caching is turned on, the server will simply provide the number of specified records instead of bouncing back to the HRegionServer to process every record.

1) Insert/Read: 90/10
2) Latest request distribution

HBase showed the best results under a workload that included large volumes of writes. Cassandra was second. The NDB engine of MySQL Cluster also managed intensive writes perfectly well.