Auto-Generated Document IDs in Elasticsearch
When you index a document without specifying an ID, Elasticsearch automatically generates a unique ID for that document. This ID is a Base64-encoded UUID, which is composed of several parts, each serving a specific purpose.
The ID generation process is optimized for both indexing speed and storage efficiency. The code responsible for this process can be found in Elasticsearch’s TimeBasedUUIDGenerator
class on GitHub.
elasticsearch/server/src/main/java/org/elasticsearch/common/TimeBasedUUIDGenerator.java at…
Free and Open, Distributed, RESTful Search Engine. Contribute to elastic/elasticsearch development by creating an…
How are the IDs Generated?
The first two bytes of the ID are derived from a sequence ID, which is incremented for each document that’s indexed. The first and third bytes of the sequence ID are used. These bytes change frequently, which helps with indexing speed because it makes the IDs sort quickly.
The next four bytes are derived from the current timestamp. These bytes change less frequently, which helps with storage efficiency because it makes the IDs compress well. The timestamp is shifted by different amounts to generate these four bytes, which means they change at different rates.
The next six bytes are the MAC address of the machine where Elasticsearch is running. This helps ensure the uniqueness of the IDs across different machines.
The final three bytes are the remaining bytes of the timestamp and sequence ID. These bytes are likely not to be compressed at all.
The resulting byte array is then Base64-encoded to create the final ID. The Base64 encoding is URL-safe and does not include padding, which makes the IDs safe to use in URLs and efficient to store.
Probability of Collision
The probability of Elasticsearch generating a duplicate ID for a document is extremely low, almost negligible. This is because Elasticsearch uses a UUID (Universally Unique Identifier) for auto-generating IDs. UUIDs are 128-bit values and are designed to be sufficiently random such that the probability of collision (i.e., generating the same UUID more than once) is extremely low.
Example of an Auto-Generated ID
Let’s consider an example auto-generated ID: “5PMM3nYBgTGA2v2S6qve”. This ID is a Base64-encoded UUID. The first two bytes are derived from a sequence ID, the next four bytes are derived from the current timestamp, the next six bytes are the MAC address of the machine where Elasticsearch is running, and the final three bytes are the remaining bytes of the timestamp and sequence ID.
Q&A
Q: Are auto-generated IDs unique across all indices in a cluster?
A: While the auto-generated IDs are unique within an index, they are not globally unique across all indices in a cluster. If you have two documents with the same auto-generated ID in two different indices, they are considered as two different documents.
Q: What is the probability of collision in auto-generated IDs?
A: The probability of Elasticsearch generating a duplicate ID for a document is extremely low, almost negligible. This is because Elasticsearch uses a UUID for auto-generating IDs, which are designed to be sufficiently random such that the probability of collision is extremely low.
To give you an idea of how low: The number of random version 4 UUIDs (which are the type of UUIDs used by Elasticsearch) that need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion (2.71 x 1⁰¹⁸). This number is so large that even if you were generating 1 billion UUIDs per second, it would take you over 85 years to generate this many UUIDs.
Conclusion
Elasticsearch’s approach to ID generation is a trade-off between indexing speed, storage efficiency, and lookup speed. It’s optimized for append-only workloads, where documents are continually being added to the index and rarely updated or deleted.