Jay Taylor's notes
back to listing indexChoosing a fast unique identifier (UUID) for Lucene
[web search]
Original source (blog.mikemccandless.com)
Clipped on: 2017-11-07
Monday, May 12, 2014
Choosing a fast unique identifier (UUID) for Lucene
Most search applications using Apache Lucene assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id. Most servers built on top of Lucene, such as Elasticsearch and Solr, require a unique id and can auto-generate one if you do not provide it.
Sometimes your id values are already pre-defined, for example if an external database or content management system assigned one, or if you must use a URI, but if you are free to assign your own ids then what works best for Lucene?
One obvious choice is Java's UUID class, which generates version 4 universally unique identifiers, but it turns out this is the worst choice for performance: it is 4X slower than the fastest. To understand why requires some understanding of how Lucene finds terms.
BlockTree terms dictionary
The purpose of the terms dictionary is to store all unique terms seen during indexing, and map each term to its metadata (
The default codec uses the BlockTree terms dictionary, which stores all terms for each field in sorted binary order, and assigns the terms into blocks sharing a common prefix. Each block contains between 25 and 48 terms by default. It uses an in-memory prefix-trie index structure (an FST) to quickly map each prefix to the corresponding on-disk block, and on lookup it first checks the index based on the requested term's prefix, and then seeks to the appropriate on-disk block and scans to find the term.
In certain cases, when the terms in a segment have a predictable pattern, the terms index can know that the requested term cannot exist on-disk. This fast-match test can be a sizable performance gain especially when the index is cold (the pages are not cached by the the OS's IO cache) since it avoids a costly disk-seek. As Lucene is segment-based, a single id lookup must visit each segment until it finds a match, so quickly ruling out one or more segments can be a big win. It is also vital to keep your segment counts as low as possible!
Given this, fully random ids (like UUID V4) should perform worst, because they defeat the terms index fast-match test and require a disk seek for every segment. Ids with a predictable per-segment pattern, such as sequentially assigned values, or a timestamp, should perform best as they will maximize the gains from the terms index fast-match test.
Testing Performance
I created a simple performance tester to verify this; the full source code is here. The test first indexes 100 million ids into an index with 7/7/8 segment structure (7 big segments, 7 medium segments, 8 small segments), and then searches for a random subset of 2 million of the IDs, recording the best time of 5 runs. I used Java 1.7.0_55, on Ubuntu 14.04, with a 3.5 GHz Ivy Bridge Core i7 3770K.
Since Lucene's terms are now fully binary as of 4.0, the most compact way to store any value is in binary form where all 256 values of every byte are used. A 128-bit id value then requires 16 bytes.
I tested the following identifier sources:
Zero-padded sequential ids, encoded in binary are fastest, quite a bit faster than non-zero-padded sequential ids. UUID V4 (using Java's
But for most applications, sequential ids are not practical. The 2nd fastest is UUID V1, encoded in binary. I was surprised this is so much faster than Flake IDs since Flake IDs use the same raw sources of information (time, node id, sequence) but shuffle the bits differently to preserve total ordering. I suspect the problem is the number of common leading digits that must be traversed in a Flake ID before you get to digits that differ across documents, since the high order bits of the 64-bit timestamp come first, whereas UUID V1 places the low order bits of the 64-bit timestamp first. Perhaps the terms index should optimize the case when all terms in one field share a common prefix.
I also separately tested varying the base from 10, 16, 36, 64, 256 and in general for the non-random ids, higher bases are faster. I was pleasantly surprised by this because I expected a base matching the BlockTree block size (25 to 48) would be best.
There are some important caveats to this test (patches welcome)! A real application would obviously be doing much more work than simply looking up ids, and the results may be different as hotspot must compile much more active code. The index is fully hot in my test (plenty of RAM to hold the entire index); for a cold index I would expect the results to be even more stark since avoiding a disk-seek becomes so much more important. In a real application, the ids using timestamps would be more spread apart in time; I could "simulate" this myself by faking the timestamps over a wider range. Perhaps this would close the gap between UUID V1 and Flake IDs? I used only one thread during indexing, but a real application with multiple indexing threads would spread out the ids across multiple segments at once.
I used Lucene's default TieredMergePolicy, but it is possible a smarter merge policy that favored merging segments whose ids were more "similar" might give better results. The test does not do any deletes/updates, which would require more work during lookup since a given id may be in more than one segment if it had been updated (just deleted in all but one of them).
Finally, I used using Lucene's default Codec, but we have nice postings formats optimized for primary-key lookups when you are willing to trade RAM for faster lookups, such as this Google summer-of-code project from last year and MemoryPostingsFormat. Likely these would provide sizable performance gains!
Sometimes your id values are already pre-defined, for example if an external database or content management system assigned one, or if you must use a URI, but if you are free to assign your own ids then what works best for Lucene?
One obvious choice is Java's UUID class, which generates version 4 universally unique identifiers, but it turns out this is the worst choice for performance: it is 4X slower than the fastest. To understand why requires some understanding of how Lucene finds terms.
BlockTree terms dictionary
The purpose of the terms dictionary is to store all unique terms seen during indexing, and map each term to its metadata (
docFreq
, totalTermFreq
, etc.), as well as the postings (documents, offsets, postings and payloads). When a term is requested, the terms dictionary must locate it in the on-disk index and return its metadata.
The default codec uses the BlockTree terms dictionary, which stores all terms for each field in sorted binary order, and assigns the terms into blocks sharing a common prefix. Each block contains between 25 and 48 terms by default. It uses an in-memory prefix-trie index structure (an FST) to quickly map each prefix to the corresponding on-disk block, and on lookup it first checks the index based on the requested term's prefix, and then seeks to the appropriate on-disk block and scans to find the term.
In certain cases, when the terms in a segment have a predictable pattern, the terms index can know that the requested term cannot exist on-disk. This fast-match test can be a sizable performance gain especially when the index is cold (the pages are not cached by the the OS's IO cache) since it avoids a costly disk-seek. As Lucene is segment-based, a single id lookup must visit each segment until it finds a match, so quickly ruling out one or more segments can be a big win. It is also vital to keep your segment counts as low as possible!
Given this, fully random ids (like UUID V4) should perform worst, because they defeat the terms index fast-match test and require a disk seek for every segment. Ids with a predictable per-segment pattern, such as sequentially assigned values, or a timestamp, should perform best as they will maximize the gains from the terms index fast-match test.
Testing Performance
I created a simple performance tester to verify this; the full source code is here. The test first indexes 100 million ids into an index with 7/7/8 segment structure (7 big segments, 7 medium segments, 8 small segments), and then searches for a random subset of 2 million of the IDs, recording the best time of 5 runs. I used Java 1.7.0_55, on Ubuntu 14.04, with a 3.5 GHz Ivy Bridge Core i7 3770K.
Since Lucene's terms are now fully binary as of 4.0, the most compact way to store any value is in binary form where all 256 values of every byte are used. A 128-bit id value then requires 16 bytes.
I tested the following identifier sources:
- Sequential IDs (0, 1, 2, ...), binary encoded.
- Zero-padded sequential IDs (00000000, 00000001, ...), binary encoded.
- Nanotime, binary encoded. But remember that nanotime is tricky.
- UUID V1, derived from a timestamp, nodeID and sequence counter, using this implementation.
- UUID V4, randomly generated using Java's
UUID.randomUUID()
.
- Flake IDs, using this implementation.
UUID K lookups/sec, 1 thread0150300450600Zero-pad sequentialUUID v1 [binary]NanotimeUUID v1SequentialFlake [binary]FlakeUUID v4 [binary]UUID v4
ID Source | K lookups/sec, 1 thread |
---|---|
Zero-pad sequential | 593.4 |
UUID v1 [binary] | 509.6 |
Nanotime | 461.8 |
UUID v1 | 430.3 |
Sequential | 415.6 |
Flake [binary] | 338.5 |
Flake | 231.3 |
UUID v4 [binary] | 157.8 |
UUID v4 | 149.4 |
UUID.randomUUID()
) is ~4X slower.
But for most applications, sequential ids are not practical. The 2nd fastest is UUID V1, encoded in binary. I was surprised this is so much faster than Flake IDs since Flake IDs use the same raw sources of information (time, node id, sequence) but shuffle the bits differently to preserve total ordering. I suspect the problem is the number of common leading digits that must be traversed in a Flake ID before you get to digits that differ across documents, since the high order bits of the 64-bit timestamp come first, whereas UUID V1 places the low order bits of the 64-bit timestamp first. Perhaps the terms index should optimize the case when all terms in one field share a common prefix.
I also separately tested varying the base from 10, 16, 36, 64, 256 and in general for the non-random ids, higher bases are faster. I was pleasantly surprised by this because I expected a base matching the BlockTree block size (25 to 48) would be best.
There are some important caveats to this test (patches welcome)! A real application would obviously be doing much more work than simply looking up ids, and the results may be different as hotspot must compile much more active code. The index is fully hot in my test (plenty of RAM to hold the entire index); for a cold index I would expect the results to be even more stark since avoiding a disk-seek becomes so much more important. In a real application, the ids using timestamps would be more spread apart in time; I could "simulate" this myself by faking the timestamps over a wider range. Perhaps this would close the gap between UUID V1 and Flake IDs? I used only one thread during indexing, but a real application with multiple indexing threads would spread out the ids across multiple segments at once.
I used Lucene's default TieredMergePolicy, but it is possible a smarter merge policy that favored merging segments whose ids were more "similar" might give better results. The test does not do any deletes/updates, which would require more work during lookup since a given id may be in more than one segment if it had been updated (just deleted in all but one of them).
Finally, I used using Lucene's default Codec, but we have nice postings formats optimized for primary-key lookups when you are willing to trade RAM for faster lookups, such as this Google summer-of-code project from last year and MemoryPostingsFormat. Likely these would provide sizable performance gains!
Posted by
Michael McCandless
on
5/12/2014
I have an off-topic question (as usual). This post provides details about the codec internals: "The default codec uses the...". I'm really interested in it. Is there such detailed explanation already published?
Nevertheless, it seems like this datastructure design exploits some sort of "block" pattern, or it's just a common sense? Can you point on any materials about designing such efficient datastructures? I need to design my own one.
Thanks!
Reply
Nevertheless, it seems like this datastructure design exploits some sort of "block" pattern, or it's just a common sense? Can you point on any materials about designing such efficient datastructures? I need to design my own one.
Thanks!
Reply
Replies
I go with id's applied at indexing gateways a v1 UUID sort64 encoded. https://code.google.com/p/guava-libraries/issues/detail?id=1415. I adapted Cassandra's UUID code which uses timestamp plus sequence plus node id for high frequency events. Then for bucketing / partioning an murmurh hash of the uuid ngram prefix.
Actually the reason for not going binary with the ids is because the speed improvement wasn't worth not being able to easily email, share.
Good post useful insight, tnxReply- It is a very informative post, and I learned a lot about UUIDs.
But there is one place I couldn't understand about the efficiency between flask and UUIDv1.
You points out the reason of this performance difference is the result of the common prefix in flask and uuidv1 avoids this by reverse the high/low bytes in timestamp.
If this theory stands, the zero-pad sequential should be slower than the sequential. But your test result seems to tell a different story.
I am the beginner in this field and I might not understand it correctly. Can you kindly point out what I may misunderstand? Thank you.
Edwin.JH.Lee@qq.com
2014/7/25ReplyReplies
That's a good point! I think the explanation is a bit tricky ... when you don't zero-pad the sequential IDs, you end up with blocks in the terms dictionary that mix terms and sub-blocks. Seen as a tree, it means terms can occur on the inside (non-leaf) nodes of the tree.
Whereas with zero-padding, all terms occur on leaf nodes, and the inner nodes just point to other blocks; those inner nodes don't contain their own terms.
This makes a difference in performance because in the terms index (the in-memory structure) we record whether a given block has terms, and if it doesn't have terms, we can avoid seeking to that block in certain cases, which can be a big savings...
I have some questions about which id implementation might have better performance for Solr. You know, in our Solr collection (Solr 4.8), we have the following unique key definition.
id
In our external java program, in the past, we generated an UUID V4 with UUID.randomUUID().toString() first. Then, we used Cryptographic hash to generate a 32 bytes length text and finally used it as id. So, my first question is, will 32 bytes length Cryptographic hash have better performance than UUID V4?
For now, we might need to post more than 20k Solr docs per second. Then UUID.randomUUID() or the Cryptographic hash stuff might take time. We might have a simple workaround to share one Cryptographic hash stuff for many Solr docs. Namely, we want to append sequence to Cryptographic hash such as 9AD0BB6DDD7AA9FE4D9EB1FF16B3BDFY000000, 9AD0BB6DDD7AA9FE4D9EB1FF16B3BDFY000001, 9AD0BB6DDD7AA9FE4D9EB1FF16B3BDFY000002, etc.
What we secondly want to know, if we use a 38 bytes length id or 36 bytes length id (both of them are to append sequence to 32 bytes length Cryptographic hash), which one might be better? In my understanding based on what you mentioned in the post, 38 bytes length id is better. Right?
Thanks,
EternalReplyReplies
I suspect the crypto hash output after processing the UUID v4 will have poor performance, the same as UUID v4 (though maybe a bit worse since you have to spend some CPU to compute the crypto hash) since there will not be any predictability in how IDs are assigned to segments.
On your 2nd question, the length of the ID (36 vs 38) bytes likely won't affect performance that much, but I would expect the shorter one to be a bit faster, assuming those 2 extra digits appended to the original crypto key is "enough" for your use case.
I want to get the "UUID v1 [binary]" performance; obviously the v1 algorithm is up to me, but I want to make sure I'm getting the binary performance. Looking at the UUIDField implementation, it looks like it's just going to be stored as a String. So it sounds like it would boil down to a UTF-8 encoded byte[]. I suppose I need to write my own UUIDField that stores a 16 byte byte[] under the covers, but still operates on the same hex format that UUIDField does (so the field would be interchangeable). Can I just extend BinaryField? How does TermToBytesRefAttribute play into it, if at all?ReplyReplies- assuming I'm building an index to search - most of my queries will be searches would they not? could this affect searching somehow too or is this only for "lots of direct document id retrievals" case?ReplyReplies
Good questions! "Lookup by ID" most affects application of deletes during indexing, or if you "update" a document in ES or Solr, which is delete + add to Lucene. Even if you append-only with ES, it asks Lucene to update, so Lucene must go and try to find the ID (which won't exist).
Searching itself doesn't normally do ID lookups, unless there's a second pass to retrieve documents by ID, which I think Solr does, whereas ES retrieves by Lucene's docID which is much more efficient since the user's ID has already been resolved to docID.
I didn't understand the answer you gave to Ryan Josal, Is it possible to use "UUID v1 [binary]" on Lucene 4.8 ? Should I adjust UUIDField to take ByetsRef ?
How does this change will impact the segments size/merging ?
Thanks!Reply- Do these findings still hold for elastic 2.3.0 i.e. lucene 5.5.0. We are planning to use random IDs to avoid hotsposts in our cluster. and our first choice was UUID v4.
Reply
Comment as: Sign outNotify me
Subscribe to:
Post Comments (Atom)
Subscribe To
991 |
Michael loves building software; he's been building search engines for more than a decade. In 1999 he co-founded iPhrase Technologies, a startup providing a user-centric enterprise search application, written primarily in Python and C. After IBM acquired iPhrase in 2005, Michael fell in love with Lucene, becoming a committer in 2006 and PMC member in 2008. Michael has remained an active committer, helping to push Lucene to new places in recent years. He's co-author of Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things.
Blog Archive
Followers
Followers (141)
Next