A vendor-independent comparison of NoSQL databases: Cassandra, HBase, MongoDB, Riak
October 22, 2012 04:26 PM ET
Network World - "The more alternatives, the more difficult the choice." -- Abbe' D'Allanival
In 2010, when the world became enchanted by the capabilities of cloud systems and new databases designed to serve them, a group of researchers from Yahoo decided to look into NoSQL. They developed the YCSB framework to assess the performance of new tools and find the best cases for their use. The results were published in the paper, "Benchmarking Cloud Serving Systems with YCSB."
The Yahoo guys did a great job, but like any paper, it could not include everything:
● The research did not provide all the information we needed for our own analysis.
● Though Cassandra, HBase, Yahoo's PNUTS, and a simple sharded MySQL implementation were analyzed, some of the databases we often work with were not covered.
● Yahoo used high-performance hardware, while it would be more useful for most companies to see how these databases perform on average hardware.
IN THE NEWS: MySQL users caution against NoSQL fad
As R&D engineers at Altoros Systems Inc., a big data specialist, we were inspired by Yahoo's endeavors and decided to add some effort of our own. This article is our vendor-independent analysis of NoSQL databases, based on performance measured under different system workloads.
What makes this research unique?
Often referred to as NoSQL, non-relational databases feature elasticity and scalability in combination with a capability to store big data and work with cloud computing systems, all of which make them extremely popular. NoSQL data management systems are inherently schema-free (with no obsessive complexity and a flexible data model) and eventually consistent (complying with BASE rather than ACID). They have a simple API, serve huge amounts of data and provide high throughput.
In 2012, the number of NoSQL products reached 120-plus and the figure is still growing. That variety makes it difficult to select the best tool for a particular case. Database vendors usually measure productivity of their products with custom hardware and software settings designed to demonstrate the advantages of their solutions. We wanted to do independent and unbiased research to complement the work done by the folks at Yahoo.
Using Amazon virtual machines to ensure verifiable results and research transparency (which also helped minimize errors due to hardware differences), we have analyzed and evaluated the following NoSQL solutions:
● Cassandra, a column family store
● HBase (column-oriented, too)
● MongoDB, a document-oriented database
● Riak, a key-value store
We also tested MySQL Cluster and sharded MySQL, taking them as benchmarks.
After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.
Tools, libraries and methods
For benchmarking, we used Yahoo Cloud Serving Benchmark, which consists of the following components:
● a framework with a workload generator
● a set of workload scenarios
We have measured database performance under certain types of workloads. A workload was defined by different distributions assigned to the two main choices:
• which operation to perform
• which record to read or write
Operations against a data store were randomly selected and could be of the following types:
• Insert: Inserts a new record.
• Update: Updates a record by replacing the value of one field.
• Read: Reads a record, either one randomly selected field, or all fields.
• Scan: Scans records in order, starting at a randomly selected record key. The number of records to scan is also selected randomly from the range between 1 and 100.
Each workload was targeted at a table of 100,000,000 records; each record was 1,000 bytes in size and contained 10 fields. A primary key identified each record, which was a string, such as "user234123." Each field was named field0, field1, and so on. The values in each field were random strings of ASCII characters, 100 bytes each.
Database performance was defined by the speed at which a database computed basic operations. A basic operation is an action performed by the workload executor, which drives multiple client threads. Each thread executes a sequential series of operations by making calls to the database interface layer both to load the database (the load phase) and to execute the workload (the transaction phase). The threads throttle the rate at which they generate requests, so that we may directly control the offered load against the database. In addition, the threads measure the latency and achieved throughput of their operations and report these measurements to the statistics module.
Each workload was defined by:
1) The number of records manipulated (read or written)
2) The number of columns per each record
3) The total size of a record or the size of each column
4) The number of threads used to load the system
This research also specifies configuration settings for each type of the workloads. We used the following default settings:
1) 100,000,000 records manipulated
2) The total size of a record equal to 1Kb
3) 10 fields of 100 bytes each per record
4) Multithreaded communications with the system (100 threads)
To provide verifiable results, benchmarking was performed on Amazon Elastic Compute Cloud instances. Yahoo Cloud Serving Benchmark Client was deployed on one Amazon Large Instance:
• 7.5GB of memory
• four EC2 Compute Units (two virtual cores with two EC2 Compute Units each)
• 850GB of instance storage
• 64-bit platform
• high I/O performance
• EBS-Optimized (500Mbps)
• API name: m1.large
Each of the NoSQL databases was deployed on a four-node cluster in the same geographical region on Amazon Extra Large Instances:
• 15GB of memory
• eight EC2 Compute Units (four virtual cores with two EC2 Compute Units each)
• 1690GB of instance storage
• 64-bit platform
• high I/O performance
• EBS-Optimized (1000Mbps)
• API name: m1.xlarge
Amazon is often blamed for its high I/O wait time and comparatively slow EBS performance. To mitigate these drawbacks, EBS disks had been assembled in a RAID0 array with stripping and after that they were able to provide up to two times higher performance.
When we started our research into NoSQL databases, we wanted to get unbiased results that would show which solution is best suitable for each particular task. That is why we decided to test performance of each database under different types of loads and let the users decide what product better suits their needs.
We started with measuring the load phase, during which 100 million records, each containing 10 fields of 100 randomly generated bytes, were imported to a four-node cluster.
This read-only workload simulated a data caching system. The data was stored outside the system, while the application was only reading it. Thanks to B-tree indexes, sharded MySQL became the winner in this competition.
HBase performed better than Cassandra in range scans. HBase scanning is a form of hierarchical fast merge-sort operation performed by HRegionScanner. It merges the results received from HStoreScanners (one per family), which, in their turn, merge the results received from HStoreFileScanners (one for each file in the family). If caching is turned on, the server will simply provide the number of specified records instead of bouncing back to the HRegionServer to process every record.
HBase showed the best results under a workload that included large volumes of writes. Cassandra was second. The NDB engine of MySQL Cluster also managed intensive writes perfectly well.