Overview Elasticsearch sorting allows you to order your search results based on specific criteria. However, when handling case sensitivity in sorting, Elasticsearch treats uppercase and lowercase letters as different characters, sorting them separately. This is because of the ASCII table order, which goes from uppercase A to lowercase z. By default, Elasticsearch sorts strings in the following order: numbers first, then uppercase letters, and finally lowercase letters. For instance, if you have the terms “Apple,” “apple,” “banana,” “Carrot,” and “1apple,” they would be sorted in ascending order as “1apple,” “Apple,” “Carrot,” “apple,” “banana.”POST /test_casing/_bulk{ “index” : {} }{ “my_field” : “Apple” }{ “index” : {} }{ “my_field” : “apple” }{ “index” : {} }{ “my_field” : “banana” }{ “index” : {} }{ “my_field” : “Carrot” }{ “index” : {} }{ “my_field” : “1apple” }GET /test_casing/_search{“size”: 0,“aggs”: {“my_terms”: {“terms”: {“field”: “my_field.keyword”}}}} This default behavior might not always be desirable. For example, if you have indexed the values “Apple,” “banana,” and “Carrot,” and you’re using ascending order, you would get “Apple,” “Carrot,” “banana.” However, you might want to get “Apple,” “banana,” “Carrot” instead. To achieve that, you can use a feature in Elasticsearch called a normalizer. A normalizer is used with the keyword field type and allows you to preprocess the input of keyword fields in a way that’s similar to analyzing text. However, unlike analyzers, a normalizer does not break the input into tokens. This makes it suitable for keyword field types where the entire input needs to be indexed or sorted.PUT /test_casing2{“settings”: {“analysis”: {“normalizer”: {“my_normalizer”: {“type”: “custom”,“filter”: [“lowercase”]}}}},“mappings”: {“properties”: {“my_field”: {“type”: “keyword”,“normalizer”: “my_normalizer”}}}}POST /test_casing2/_bulk{“index”:{}}{“my_field”:”Apple”}{“index”:{}}{“my_field”:”banana”}{“index”:{}}{“my_field”:”Carrot”}GET /test_casing2/_search{“size”: 0,“aggs”: {“my_terms”: {“terms”: {“field”: “my_field”}}}} It’s important to note that using a normalizer will change the values in your index. If you want to keep the original values, such as “Apple” with a capital “A,” you can use sub-fields. That allows you to keep both the original and normalized field values. In the aggregation results, Elasticsearch will only show the field that you used in the aggregation. Unfortunately, Elasticsearch does not support case-insensitive sorting directly in the terms aggregation. Even with the use of script aggregations and normalizers, it’s not possible to sort in a case-insensitive manner and display the result with case sensitivity. This is a limitation that users should be aware of when working with Elasticsearch. How can you add a normalizer to an existing index? Let’s look at a practical example of the process of adding a normalizer to an existing index in Elasticsearch. This process involves several steps: closing the index, updating the settings, reopening the index, updating the mapping, updating the data index, and finally, running a query. First, you need to close the index using the following command:POST test_casing/_close Next, you update the settings of the index to add the normalizer. In this case, we’re adding a custom normalizer that applies a lowercase filter:PUT test_casing/_settings{“analysis”: {“normalizer”: {“my_normalizer”: {“type”: “custom”,“filter”: [“lowercase”]}}}} After updating the settings, you can reopen the index:POST test_casing/_open Now, you need to update the mapping of the index to use the normalizer. Here, we’re adding a sub-field to “my_field” that uses the normalizer:PUT test_casing/_mapping{“properties”: {“my_field”: {“type”: “text”,“fields”: {“normalized”: {“type”: “keyword”,“normalizer”: “my_normalizer”}}}}} Note that my_field.normalized is the field name. Next, you can update the data index by running update_by_query, which will add data inside of the my_field.normalized field:POST test_casing/_update_by_query Finally, you can run a search query on the index. In this case, we’re running an aggregation on the new normalized field:GET /test_casing/_search{“size”: 0,“aggs”: {“my_terms”: {“terms”: {“field”: “my_field.normalized”}}}} This process demonstrates how you can add a normalizer to an existing index in Elasticsearch, allowing you to handle case sensitivity with greater flexibility. Conclusion In conclusion, while Elasticsearch provides powerful features for handling and manipulating data, it’s important to understand its limitations and how to work around them. Using features like normalizers and sub-fields can help you achieve the desired results.
Enabling Elasticsearch Xpack Security on an Unsecured Cluster
High-Level Steps: To enhance the security of your Elasticsearch cluster, you will need to perform a FULL CLUSTER RESTART, as well as make some changes on the client side. Once authentication is enabled, all requests to index and search data will require a username and password or a token. Here are the high-level steps to achieve this: 1. Create SSL Elastic Certificates ./bin/elasticsearch-certutil ca –days 3650[Press Enter][Press Enter] ./bin/elasticsearch-certutil cert –ca elastic-stack-ca.p12 –days 3650[Press Enter][Press Enter] 2. Copy the SSL Certificate to All Nodes The ‘elastic-certificates.p12’ file must be created under ‘/usr/share/elasticsearch’. After creating the SSL certificate (elastic-certificates.p12), copy it and paste it onto all nodes.mv /usr/share/elasticsearch/elastic-certificates.p12 /etc/elasticsearch/chown elasticsearch:elasticsearch /etc/elasticsearch/ -R 3. Update the elasticsearch.yml xpack.security.enabled: truexpack.security.transport.ssl.enabled: truexpack.security.transport.ssl.verification_mode: certificatexpack.security.transport.ssl.client_authentication: requiredxpack.security.transport.ssl.keystore.path: elastic-certificates.p12xpack.security.transport.ssl.truststore.path: elastic-certificates.p12 Note: The default path for the certificate is ‘/etc/elasticsearch/elastic-certificates.p12’. Note 2: Ensure there is no duplication of settings. 4. Stop All Elasticsearch Nodes service elasticsearch stop 5. Start All Elasticsearch Nodes Start all Elasticsearch nodes, beginning with the master nodes.service elasticsearch start Starting with the master nodes ensures that the core of your Elasticsearch cluster is up and running before the data nodes come online. This sequence is important to maintain cluster stability and data integrity. 6. Create/Reset the Built-In Users’ Passwords bin/elasticsearch-reset-password -u -i elastic This command will reset the password for the “elastic” user automatically. You should replace “-i” to “auto” with the actual password if you want to set a the password automatically. After running this command, the new password for the “elastic” user will be displayed in the terminal. Make sure to securely store this password as it’s crucial for authentication. Common Problems and Solutions: Official Documentations: Official Notes:
Understanding Elasticsearch’s Auto-Generated document _id: Is Duplication a Concern?
Auto-Generated Document IDs in Elasticsearch When you index a document without specifying an ID, Elasticsearch automatically generates a unique ID for that document. This ID is a Base64-encoded UUID, which is composed of several parts, each serving a specific purpose. The ID generation process is optimized for both indexing speed and storage efficiency. The code responsible for this process can be found in Elasticsearch’s TimeBasedUUIDGenerator class on GitHub. elasticsearch/server/src/main/java/org/elasticsearch/common/TimeBasedUUIDGenerator.java at… Free and Open, Distributed, RESTful Search Engine. Contribute to elastic/elasticsearch development by creating an… github.com How are the IDs Generated? The first two bytes of the ID are derived from a sequence ID, which is incremented for each document that’s indexed. The first and third bytes of the sequence ID are used. These bytes change frequently, which helps with indexing speed because it makes the IDs sort quickly. The next four bytes are derived from the current timestamp. These bytes change less frequently, which helps with storage efficiency because it makes the IDs compress well. The timestamp is shifted by different amounts to generate these four bytes, which means they change at different rates. The next six bytes are the MAC address of the machine where Elasticsearch is running. This helps ensure the uniqueness of the IDs across different machines. The final three bytes are the remaining bytes of the timestamp and sequence ID. These bytes are likely not to be compressed at all. The resulting byte array is then Base64-encoded to create the final ID. The Base64 encoding is URL-safe and does not include padding, which makes the IDs safe to use in URLs and efficient to store. Probability of Collision The probability of Elasticsearch generating a duplicate ID for a document is extremely low, almost negligible. This is because Elasticsearch uses a UUID (Universally Unique Identifier) for auto-generating IDs. UUIDs are 128-bit values and are designed to be sufficiently random such that the probability of collision (i.e., generating the same UUID more than once) is extremely low. Example of an Auto-Generated ID Let’s consider an example auto-generated ID: “5PMM3nYBgTGA2v2S6qve”. This ID is a Base64-encoded UUID. The first two bytes are derived from a sequence ID, the next four bytes are derived from the current timestamp, the next six bytes are the MAC address of the machine where Elasticsearch is running, and the final three bytes are the remaining bytes of the timestamp and sequence ID. Q&A Q: Are auto-generated IDs unique across all indices in a cluster? A: While the auto-generated IDs are unique within an index, they are not globally unique across all indices in a cluster. If you have two documents with the same auto-generated ID in two different indices, they are considered as two different documents. Q: What is the probability of collision in auto-generated IDs? A: The probability of Elasticsearch generating a duplicate ID for a document is extremely low, almost negligible. This is because Elasticsearch uses a UUID for auto-generating IDs, which are designed to be sufficiently random such that the probability of collision is extremely low. To give you an idea of how low: The number of random version 4 UUIDs (which are the type of UUIDs used by Elasticsearch) that need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion (2.71 x 1⁰¹⁸). This number is so large that even if you were generating 1 billion UUIDs per second, it would take you over 85 years to generate this many UUIDs. Conclusion Elasticsearch’s approach to ID generation is a trade-off between indexing speed, storage efficiency, and lookup speed. It’s optimized for append-only workloads, where documents are continually being added to the index and rarely updated or deleted.