Jay Taylor's notes
back to listing indexBitcask
[web search]|Start Here
|Installing
|Riak for Developers
|Riak for Operators
Building a Cluster
Tuning
Running a Cluster
Security
Advanced
Upgrading a Cluster
Multi-Datacenter
|Theory & Concepts
|Community
|APIs
Bitcask
Contents
Bitcask is an Erlang application that provides an API for storing and retrieving key/value data using log-structured hash tables that provide very fast access. The design of Bitcask was inspired, in part, by log-structured filesystems and log file merging.
Bitcask's Strengths
Low latency per item read or written
This is due to the write-once, append-only nature of Bitcask database files.
High throughput, especially when writing an incoming stream of random items
Write operations to Bitcask generally saturate I/O and disk bandwidth, which is a good thing from a performance perspective. This saturation occurs for two reasons: because (1) data that is written to Bitcask doesn't need to be ordered on disk, and (2) the log-structured design of Bitcask allows for minimal disk head movement during writes.
Ability to handle datasets larger than RAM without degradation
Access to data in Bitcask involves direct lookup from an in-memory hash table. This makes finding data very efficient, even when datasets are very large.
Single seek to retrieve any value
Bitcask's in-memory hash table of keys points directly to locations on disk where the data lives. Bitcask never uses more than one disk seek to read a value and sometimes even that isn't necessary due to filesystem caching done by the operating system.
Predictable lookup and insert performance
For the reasons listed above, read operations from Bitcask have fixed, predictable behavior. This is also true of writes to Bitcask because write operations require, at most, one seek to the end of the current open file followed by and append to that file.
Fast, bounded crash recovery
Crash recovery is easy and fast with Bitcask because Bitcask files are append only and write once. The only items that may be lost are partially written records at the tail of the last file that was opened for writes. Recovery operations need to review only the last record or two written and verify CRC data to ensure that the data is consistent.
Easy Backup
In most systems, backup can be very complicated. Bitcask simplifies this process due to its append-only, write-once disk format. Any utility that archives or copies files in disk-block order will properly back up or copy a Bitcask database.
Weaknesses
Keys must fit in memory
Bitcask keeps all keys in memory at all times, which means that your system must have enough memory to contain your entire keyspace, plus additional space for other operational components and operating- system-resident filesystem buffer space.
Installing Bitcask
Bitcask is the default storage engine for Riak. You can verify that
Bitcask is currently being used as the storage backend with the
riak
command interface:
riak config effective | grep backend
If this operation returns anything other than bitcask
, read
the next section for instructions on
switching the backend to Bitcask.
Enabling Bitcask
You can set Bitcask as the storage engine using each node's configuration files:
storage_backend = bitcask
Configuring Bitcask
Bitcask enables you to configure a wide variety of its behaviors, from filesystem sync strategy to merge settings and more.
Riak 2.0 enables you to use either the newer configuration
system based on a single riak.conf
file or the older system, based on an app.config
configuration file.
Instructions for both systems will be included below. Narrative descriptions of the various settings will be tailored to the newer configuration system, whereas instructions for the older system will largely be contained in the code tabs.
The default configuration values for Bitcask are as follows:
bitcask.data_root = ./data/bitcask
bitcask.io_mode = erlang
All of the other available settings listed below can be added to your configuration files.
Open Timeout
The open timeout setting specifies the maximum time Bitcask will block on startup while attempting to create or open the Bitcask data directory. The default is 4 seconds.
In general, you will not need to adjust this setting. If, however, you
begin to receive log messages of the form Failed to start bitcask
backend: ...
, you may want to consider using a longer timeout.
Open timeout is specified using the bitcask.sync.open_timeout
parameter, and can be set in terms of seconds, minutes, hours, etc.
The following example sets the parameter to 10 seconds:
bitcask.sync.open_timeout = 10s
Sync Strategy
Bitcask enables you to configure the durability of writes by specifying
when to synchronize data to disk, i.e. by choosing a sync strategy. The
default setting (none
) writes data into operating system buffers that
will be written to disk when those buffers are flushed by the operating
system. If the system fails before those buffers are flushed, e.g. due
to power loss, that data is lost. This possibility holds for any
database in which values are asynchronously flushed to disk.
Thus, using the default setting of none
protects against data loss in
the event of application failure, i.e. process death, but leaves open a
small window in which data could be lost in the event of a complete
system failure, e.g. hardware or OS failure.
This possibility can be prevented by choosing the o_sync
sync
strategy, which forces the operating system to flush to stable storage
at write time for every write. The effect of flushing each write is
better durability, although it should be noted that write throughput
will suffer because each write will have to wait for the write to
complete.
The following sync strategies are available:
none
— lets the operating system manage syncing writes (default)o_sync
— uses theO_SYNC
flag, which forces syncs on every write- Time interval — Riak will force Bitcask to sync at specified intervals
The following are possible configurations:
bitcask.sync.strategy = none
bitcask.sync.strategy = o_sync
bitcask.sync.strategy = interval
bitcask.sync.interval = 65s
Setting the sync interval to a value lower or equal to
riak_core.vnode_inactivity_timeout
(default: 60 seconds), will
prevent Riak from performing handoffs. A vnode must be inactive
(not receive any messages) for a certain amount of time before the
handoff process can start. The sync mechanism causes a message to be
sent to the vnode for every sync, thus preventing the vnode from ever
becoming inactive.
Max File Size
The max_file_size
setting describes the maximum permitted size for any
single data file in the Bitcask directory. If a write causes the current
file to exceed this size threshold then that file is closed, and a new
file is opened for writes. The default is 2 GB.
Increasing max_file_size
will cause Bitcask to create fewer, larger
files that are merged less frequently, while decreasing it will cause
Bitcask to create more numerous, smaller files that are merged more
frequently.
To give an example, if your ring size is 16, your servers could see as much as 32 GB of data in the bitcask directories before the first merge is triggered, irrespective of your working set size. You should plan storage accordingly and be aware that it is possible to see disk data sizes that are larger than the working set.
The max_file_size
setting can be specified using kilobytes, megabytes,
etc. The following example sets the max file size to 1 GB:
bitcask.max_file_size = 1GB
Hint File CRC Check
During startup, Bitcask will read from .hint
files in order to build
its in-memory representation of the key space, falling back to .data
files if necessary. This reduces the amount of data that must be read
from the disk during startup, thereby also reducing the time required to
start up. You can configure Bitcask to either disregard .hint
files
that don't contain a CRC value or to use them anyway.
If you are using the newer, riak.conf
-based configuration system, you
can instruct Bitcask to disregard .hint
files that do not contain a
CRC value by setting the hintfile_checksums
setting to strict
(the
default). To use Bitcask in a backward-compatible mode that allows for
.hint
files without CRC signatures, change the setting to
allow_missing
.
The following example sets the parameter to strict
:
bitcask.hintfile_checksums = strict
I/O Mode
The io_mode
setting specifies which code module Bitcask should use for
file access. The available settings are:
erlang
(default) — Writes are made via Erlang's built-in file APInif
— Writes are made via direct calls to the POSIX C API
The following example sets io_mode
to erlang
:
bitcask.io_mode = erlang
In general, the nif
IO mode provides higher throughput for certain
workloads, but it has the potential to negatively impact the Erlang VM,
leading to higher worst-case latencies and possible throughput collapse.
O_SYNC
on Linux
Synchronous file I/O via
o_sync
is
supported in Bitcask if io_mode
is set to nif
and is not supported
in the erlang
mode.
If you enable o_sync
by setting io_mode
to nif
, however, you will
still get an incorrect warning along the following lines:
[warning] <0.445.0>@riak_kv_bitcask_backend:check_fcntl:429 {sync_strategy,o_sync} not implemented on Linux
If you are using the older, app.config
-based configuration system, you
can disable the check that generates this warning by adding the
following to the riak_kv
section of your app.config
:
{riak_kv, [
...,
{o_sync_warning_logged, false},
...
]}
Disk Usage and Merging Settings
Riak stores each vnode of the ring as a separate Bitcask directory within the configured Bitcask data directory.
Each of these directories will contain multiple files with key/value data, one or more “hint” files that record where the various keys exist within the data files, and a write lock file. The design of Bitcask allows for recovery even when data isn't fully synchronized to disk (partial writes). This is accomplished by maintaining data files that are append-only (i.e. never modified in-place) and are never reopened for modification (i.e. they are only for reading).
This data management strategy trades disk space for operational efficiency. There can be a significant storage overhead that is unrelated to your working data set but can be tuned in a way that best fits your use case. In short, disk space is used until a threshold is met at which point unused space is reclaimed through a process of merging. The merge process traverses data files and reclaims space by eliminating out-of-date of deleted key/value pairs, writing only the current key/value pairs to a new set of files within the directory.
The merge process is affected by all of the settings described in the sections below. In those sections, “dead” refers to keys that no longer contain the most up-to-date values, while “live” refers to keys that do contain the most up-to-date value and have not been deleted.
Merge Policy
Bitcask enables you to select a merge policy, i.e. when during the day merge operations are allowed to be triggered. The valid options are:
always
— No restrictions on when merge operations can occur (default)never
— Merge will never be attemptedwindow
— Merge operations occur during specified hours
If you are using the newer, riak.conf
-based configuration system, you
can select a merge policy using the merge.policy
setting. The
following example sets the merge policy to never
:
bitcask.merge.policy = never
If you opt to specify start and end hours for merge operations, you can
do so with the merge.window.start
and merge.window.end
settings in addition to setting the merge policy to window
.
Each setting is an integer between 0 and 23 for hours on a 24h clock,
with 0 meaning midnight and 23 standing for 11 pm.
The merge window runs from the first minute of the merge.window.start
hour
to the last minute of the merge.window.end
hour.
The following example enables merging between 3 am and 4:59 pm:
bitcask.merge.policy = window
bitcask.merge.window.start = 3
bitcask.merge.window.end = 17
merge_window
and the Multi backendIf you are using the older configuration system and using Bitcask with
the Multi backend, please note that if you wish to use a merge
window, you must set it in the global bitcask
section of your configuration file. merge_window
settings
in per-backend sections are ignored.
If merging has a significant impact on performance of your cluster, or if your cluster has quiet periods in which little storage activity occurs, you may want to change this setting from the default.
A common way to limit the impact of merging is to create separate merge windows for each node in the cluster and ensure that these windows do not overlap. This ensures that at most one node at a time can be affected by merging, leaving the remaining nodes to handle requests. The main drawback of this approach is that merges will occur less frequently, leading to increased disk space usage.
Merge Triggers
Merge triggers determine the conditions under which merging will be invoked. These conditions fall into two basic categories:
Fragmentation — This describes the ratio of dead keys to total keys in a file that will trigger merging. The value of this setting is an integer percentage (0-100). For example, if a data file contains 6 dead keys and 4 live keys, a merge will be triggered by the default setting (60%). Increasing this value will cause merging to occur less often, whereas decreasing the value will cause merging to happen more often.
Dead Bytes — This setting describes how much data stored for dead keys in a single file will trigger merging. If a file meets or exceeds the trigger value for dead bytes, a merge will be triggered. Increasing the value will cause merging to occur less often, whereas decreasing the value will cause merging to happen more often. The default is 512 MB.
When either of these constraints are met by any file in the directory, Bitcask will attempt to merge files.
You can set the triggers described above using
merge.triggers.fragmentation
and merge.triggers.dead_bytes
,
respectively. The former is expressed as an integer between 0 and 100,
whereas the latter can be expressed in terms of kilobytes, megabytes,
gigabytes, etc. The following example sets the dead bytes threshold to
55% and the fragmentation threshold to 1 GB:
bitcask.merge.triggers.fragmentation = 55
bitcask.merge.triggers.dead_bytes = 1GB
Merge Thresholds
Merge thresholds determine which files will be chosen for inclusion in a merge operation.
Fragmentation — This setting describes which ratio of dead keys to total keys in a file will cause it to be included in the merge. The value of this setting is a percentage (0-100). For example, if a data file contains 4 dead keys and 6 live keys, it will be included in the merge at the default ratio (40%). Increasing the value will cause fewer files to be merged, while decreasing the value will cause more files to be merged.
Dead Bytes — This setting describes which ratio the minimum amount of data occupied by dead keys in a file to cause it to be included in the merge. Increasing this value will cause fewer files to be merged, while decreasing this value will cause more files to be merged. The default is 128 MB.
Small File — This setting describes the minimum size a file must be to be excluded from the merge. Files smaller than the threshold will be included. Increasing the value will cause more files to be merged, while decreasing the value will case fewer files to be merged. The default is 10 MB.
You can set the thresholds described above using the
merge.thresholds.fragmentation
, merge.thresholds.dead_bytes
, and
merge.threshold.small_file
settings, respectively.
The fragmentation
setting is expressed as an integer
between 0 and 100, and the dead_bytes
and small_file
settings can be
expressed in terms of kilobytes, megabytes, gigabytes, etc. The
following example sets the fragmentation threshold to 45%, the
dead bytes threshold to 200 MB, and the small file threshold to 25 MB:
bitcask.merge.thresholds.fragmentation = 45
bitcask.merge.thresholds.dead_bytes = 200MB
bitcask.merge.thresholds.small_file = 25MB
The values for the fragmentation and dead bytes thresholds must be equal to or less than their corresponding trigger values. If they are set higher, Bitcask will trigger merges in cases where no files meet the threshold, which means that Bitcask will never resolve the conditions that triggered merging in the first place.
Merge Interval
Bitcask periodically runs checks to determine whether merges are
necessary. You can determine how often those checks take place using
the bitcask.merge_check_interval
parameter. The default is 3 minutes.
bitcask.merge_check_interval = 3m
If merge check operations happen at the same time on different
vnodes on the same node, this can produce spikes
in I/O usage and undue latency. Bitcask makes it less likely that merge
check operations will occur at the same time on different vnodes by
applying a jitter to those operations. A jitter is a random
variation applied to merge times that you can alter using the
bitcask.merge_check_jitter
parameter. This parameter is expressed as a
percentage of bitcask.merge_check_interval
. The default is 30%.
bitcask.merge_check_jitter = 30%
For example, if you set the merge check interval to 4 minutes and the jitter to 25%, merge checks will occur at intervals between 3 and 5 minutes. With the default of 3 minutes and 30%, checks will occur at intervals between roughly 2 and 4 minutes.
Log Needs Merge
If you are using the older, app.config
-based configuration system, you
can use the log_needs_merge
setting to tune and troubleshoot Bitcask
merge settings. When set to true
(as in the example below), each time
a merge trigger is met, the partition/vnode ID and mergeable files will
be logged.
{bitcask, [
...,
{log_needs_merge, true},
...
]}
log_needs_merge
and the Multi
backendIf you are using Bitcask with the Multi backend in conjunction with
the older, app.config
-based configuration system, please
note that logneeds_merge
_must be set in the global
bitcask
section of your app.config
.
All log_needs_merge
settings in per-backend sections are
ignored.
Fold Keys Threshold
Fold keys thresholds will reuse the keydir (a) if another fold was started less than a specified time interval ago and (b) there were fewer than a specified number of updates. Otherwise, Bitcask will wait until all current fold keys complete and then start. The default time interval is 0, while the default number of updates is unlimited. Both thresholds can be disabled.
The conditions described above can be set using the fold.max_age
and
fold.max_puts
parameters, respectively. The former can be expressed in
terms of minutes, hours, days, etc., while the latter is expressed as an
integer. Each threshold can be disabled by setting the value to
unlimited
. The following example sets the max_age
to ½ second and
the max_puts
to 1000:
bitcask.max_age = 0.5s
bitcask.max_puts = 1000
Automatic Expiration
By default, Bitcask keeps all of your data. But if your data has limited time value or if you need to purge data for space reasons, you can configure object expiration, aka expiry. This feature is disabled by default.
You can enable and configure object expiry using the expiry
setting
and either specifying a time interval in seconds, minutes, hours, etc.,
or turning expiry off (off
). The following example configures objects
to expire after 1 day:
bitcask.expiry = 1d
Space occupied by stale data may not be reclaimed immediately, but the data will become immediately inaccessible to client requests. Writing to a key will set a new modification timestamp on the value and prevent it from being expired.
By default, Bitcask will trigger a merge whenever a data file contains an expired key. This may result in excessive merging under some usage patterns. You can prevent this by configuring an expiry grace time. Bitcask will defer trigger a merge solely for key expiry by the configured amount of time. The default is 0, signifying no grace time.
If you are using the newer, riak.conf
-based configuration system, you
can set an expiry grace time using the expiry.grace_time
setting and
in terms of minutes, hours, days, etc. The following example sets the
grace period to 1 hour:
bitcask.expiry.grace_time = 1h
Automatic expiration and Riak Search
If you are using Riak Search in conjunction with Bitcask, please be aware that automatic expiry does not apply to Search indexes. If objects are indexed using Search, those objects can be expired by Bitcask yet still registered in Search indexes, which means that Search queries may return keys that no longer exist. Riak's active anti-entropy (AAE) subsystem will eventually catch this discrepancy, but this depends on AAE being enabled (which is the default) and could take some time. If search queries returning expired keys is a problem for your use case, then we would recommend not using automatic expiration.
Tuning Bitcask
When tuning your environment, there are a number of things to bear in mind that can assist you in making Bitcask as stable and reliable as possible and to minimize latency and maximize throughput.
Tips & Tricks
Bitcask depends on filesystem caches
Some data storage layers implement their own page/block buffer cache in-memory, but Bitcask does not. Instead, it depends on the filesystem's cache. Adjusting the caching characteristics of your filesystem can impact performance.
Be aware of file handle limits
Review the documentation on open files limit.
Avoid the overhead of updating file metadata (such as last access time) on every read or write operation
You can achieve a substantial speed boost by adding the
noatime
mounting option to Linux's/etc/fstab
. This will disable the recording of the last accessed time for all files, which results in fewer disk head seeks. If you need last access times but you'd like some of the benefits of this optimization, you can tryrelatime
./dev/sda5 /data ext3 noatime 1 1 /dev/sdb1 /data/inno-log ext3 noatime 1 2
Small number of frequently changed keys
When keys are changed frequently, fragmentation rapidly increases. To counteract this, you should lower the fragmentation trigger and threshold.
Limited disk space
When disk space is limited, limiting the space occupied by dead keys is of paramount importance. Lower the dead bytes threshold and trigger to counteract wasted space.
Purging stale entries after a fixed period
To automatically purge stale values, set the object expiry value to the desired cutoff time. Keys that are not modified for a period equal to or greater than this time interval will become inaccessible.
High number of partitions per node
Because each cluster has many partitions running, Bitcask will have many open files. To reduce the number of open files, we suggest increasing the max file size so that larger files will be written. You could also decrease the fragmentation and dead-bytes settings and increase the small file threshold so that merging will keep the number of open files small in number.
High daytime traffic, low nighttime traffic
In order to cope with a high volume of writes without performance degradation during the day, you might want to limit merging to in non-peak periods. Setting the merge window to hours of the day when traffic is low will help.
Multi-cluster replication (Riak Enterprise)
If you are using Riak Enterprise with the replication feature enabled, your clusters might experience higher production of fragmentation and dead bytes. Additionally, because the fullsync feature operates across entire partitions, it will be made more efficient by accessing data as sequentially as possible (across fewer files). Lowering both the fragmentation and dead-bytes settings will improve performance.
FAQ
- Why does it seem that Bitcask merging is only triggered when a Riak node is restarted?
- If the size of key index exceeds the amount of memory, how does Bitcask handle it?
- Bitcask Capacity Planning
Bitcask Implementation Details
Riak will create a Bitcask database directory for each vnode in a cluster. In each of those directories, at most one database file will be open for writing at any given time. The file being written to will grow until it exceeds a specified size threshold, at which time it is closed and a new file is created for additional writes. Once a file is closed, whether purposely or due to server exit, it is considered immutable and will never again be opened for writing.
The file currently open for writes is only written by appending, which
means that sequential writes do not require disk seeking, which can
dramatically speed up disk I/O. Note that this effect can be hampered if
you have atime
enabled on your filesystem, because the disk head will
have to move to update both the data blocks and the file and directory
metadata blocks. The primary speed advantage from a log-based database
stems of its ability to minimize disk head seeks.
Deleting a value from Bitcask is a two-step process: first, a tombstone is recorded in the open file for writes, which indicates that a value was marked for deletion at that time, while references to that key are removed from the in-memory “keydir” information; later, during a merge operation, non-active data files are scanned, and only those values without tombstones are merged into the active data file. This effectively removes the obsolete data and reclaims disk space associated with it. This data management strategy may use up a lot of space over time, since Bitcask writes new values without touching the old ones.
The compaction process referred to as “merging” solves this problem. The merge process iterates over all non-active (i.e. immutable) files in a Bitcask database and produces as output a set of data files containing only the “live” or latest versions of each present key.
Bitcask Database Files
Below are two directory listings showing what you should expect to find on disk when using Bitcask. In this example, we use a 64-partition ring, which results in 64 separate directories, each holding its own Bitcask database.
ls ./data/bitcask
The result:
0
1004782375664995756265033322492444576013453623296
1027618338748291114361965898003636498195577569280
... etc ...
981946412581700398168100746981252653831329677312
Note that when starting up the directories are created for each vnode partition's data. At this point, however, there are not yet any Bitcask-specific files.
After performing one PUT (write) into the Riak cluster running Bitcask:
curl -XPUT http://localhost:8098/types/default/buckets/test/keys/test \
-H "Content-Type: text/plain" \
-d "hello"
The “N” value for this cluster is 3 (the default), so you'll see that the three vnode partitions responsible for this data now have Bitcask database files:
bitcask/
... etc ...
|-- 1118962191081472546749696200048404186924073353216-1316787078245894
| |-- 1316787252.bitcask.data
| |-- 1316787252.bitcask.hint
| `-- bitcask.write.lock
... etc ...
|-- 1141798154164767904846628775559596109106197299200-1316787078249065
| |-- 1316787252.bitcask.data
| |-- 1316787252.bitcask.hint
| `-- bitcask.write.lock
... etc ...
|-- 1164634117248063262943561351070788031288321245184-1316787078254833
| |-- 1316787252.bitcask.data
| |-- 1316787252.bitcask.hint
| `-- bitcask.write.lock
... etc ...
As more data is written to the cluster, more Bitcask files are created until merges are triggered.
bitcask/
|-- 0-1317147619996589
| |-- 1317147974.bitcask.data
| |-- 1317147974.bitcask.hint
| |-- 1317221578.bitcask.data
| |-- 1317221578.bitcask.hint
| |-- 1317221869.bitcask.data
| |-- 1317221869.bitcask.hint
| |-- 1317222847.bitcask.data
| |-- 1317222847.bitcask.hint
| |-- 1317222868.bitcask.data
| |-- 1317222868.bitcask.hint
| |-- 1317223014.bitcask.data
| `-- 1317223014.bitcask.hint
|-- 1004782375664995756265033322492444576013453623296-1317147628760580
| |-- 1317147693.bitcask.data
| |-- 1317147693.bitcask.hint
| |-- 1317222035.bitcask.data
| |-- 1317222035.bitcask.hint
| |-- 1317222514.bitcask.data
| |-- 1317222514.bitcask.hint
| |-- 1317223035.bitcask.data
| |-- 1317223035.bitcask.hint
| |-- 1317223411.bitcask.data
| `-- 1317223411.bitcask.hint
|-- 1027618338748291114361965898003636498195577569280-1317223690337865
|-- 1050454301831586472458898473514828420377701515264-1317223690151365
... etc ...
This is normal operational behavior for Bitcask.
Tutorial Nav: Choosing a Backend
- << Choosing a Backend
- | LevelDB >>
© 2011-2015 Basho Technologies, Inc.