The Loggly service utilizes Elasticsearch (ES) as the search engine underneath a lot of our core functionality. As Jon Gifford explained in his recent post on Elasticsearch vs Solr, log management imposes some tough requirements on search technology. To boil it down, it must be able to:
- Reliably perform near real-time indexing at huge scale – in our case, more than 100,000 log events per second
- Simultaneously handle high search volumes on the same index with solid performance and efficiency
When we were building our Gen2 log management service, we wanted to be sure that we were setting all configurations in the way that would optimize Elasticsearch performance for both indexing and search. Unfortunately, we found it very difficult to find this information in the Elasticsearch documentation because it’s not located in one place. This post summarizes our learnings and can serve as a checklist of configuration properties you can reference to optimize Elasticsearch (also referred to herein as ES) for your application.
Note: These tips were last updated in September 2016. Some of the comments below may reference older tips.
Tip #1: Planning for Elasticsearch index, shard, and cluster state growth: biggest factor on management overhead is cluster state size.
Linux divides its physical RAM into chunks of memory called pages. Swapping is the process whereby a page of memory is copied to the preconfigured space on the hard disk, called swap space, to free up that page of memory. The combined sizes of the physical memory and the swap space is the amount of virtual memory available.
Swapping does have a downside. Compared to memory, disks are very slow. Memory speeds can be measured in nanoseconds, while disks are measured in milliseconds; so accessing the disk can be tens of thousands times slower than accessing physical memory. The more swapping that occurs, the slower your process will be, so you should avoid swapping at all cost.
mlockall property in ES allows the ES node not to swap its memory. (Note that it is available only for Linux/Unix systems.) This property can be set in the yaml file by doing the following.
In the 5.x releases, this has changed to
mlockall is set to false by default, meaning that the ES node will allow swapping. Once you add this value to the property file, you need to restart your ES node. You can verify if the value is set properly or not by doing the following:
if you are setting this property, make sure you are giving enough memory to the ES node using the
-DXmx option or
Tip #4: discovery.zen properties control the discovery protocol for Elasticsearch
Zen discovery is the default mechanism used by Elasticsearch to discover and communicate between the nodes in the cluster. Other discovery mechanisms exist for Azure, EC2 and GCE. Zen discovery is controlled by the
In 0.x and 1.x releases both unicast and multicast are available, and multicast is the default. To use unicast with these versions of ES, you need to set
discovery.zen.ping.multicast.enabled to false.
From 2.0 onwards unicast is the only option available for Zen discovery.
To start with, you must specify the group of hosts that are used to communicate for discovery, using the property
discovery.zen.ping.unicast.hosts. To keep things simple, use the same value for this property on all hosts in your cluster. We define this list using the names of our master nodes.
discovery.zen.minimum_master_nodes control the minimum number of eligible master nodes that a node should “see” in order to operate within the cluster. It’s recommended that you set it to a higher value than 1 when running more than 2 nodes in the cluster. One way to calculate value for this will be N/2 + 1 where N is number of master nodes.
Data and master nodes detect each other in two different ways:
- By the master pinging all other nodes in the cluster and to verify they are up and running
- By all other nodes pinging the master nodes to verify if they are up and running or if an election process needs to be initiated
The node detection process is controlled by discover.zen.fd.ping_timeout property. The default value is 30s, which determines how long the node will wait for a response. This property should be adjusted if you are operating on a slow or congested network. If you are on slow network, set the value higher. The higher the value, the smaller the chance of discovery failure.
Loggly has configured our discovery.zen properties as follows:
The above properties say that node detection should happen within 30 seconds; this is done by setting
discovery.zen.fd.ping_timeout. In addition, two minimum master nodes should be detected by other nodes (we have 3 masters). Our unicast hosts are esmaster01, esmaster02, esmaster03.
Tip #5: Watch out for DELETE_all!
I would add setting the “refresh_interval” in a bigger number (default is 2s, I guess) if you work in a “high indexing rate but few searches” scenario.
In the same way, setting that value in “-1” will help you when populating a new index from the scratch with bulk operations (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk
Finally, of course I don’t know if this applies to your use case, but routing data by some criteria really improves search performance in general.