Troubleshooting Elasticsearch ILM: Common issues and fixes

Hiya! Our Elasticsearch team is continually improving our index Lifecycle Management (ILM) feature. When I first joined Elastic Support, I quickly got up to speed via our Automate rollover with ILM tutorial. I noticed after helping multiple users set up ILM that escalations mainly emerge from a handful of configuration issues.  In the following sections, […]

Hiya! Our Elasticsearch team is continually improving our index Lifecycle Management (ILM) feature. When I first joined Elastic Support, I quickly got up to speed via our Automate rollover with ILM tutorial. I noticed after helping multiple users set up ILM that escalations mainly emerge from a handful of configuration issues. 

In the following sections, I’d like to cover frequent tickets, diagnostic flow, and common error recoveries. All commands shown can be run via Kibana’s Dev Tools.

Configure

The ILM backend process runs by default but requires user configuration to affect indices. You can verify ILM is running via ILM status returning operation_mode:RUNNING.

Common issue 1: ILM is not running
ILM runs by default. If you’ve previously stopped ILM, you’ll want to start up ILM again.

ILM is set up to hold data across six consecutive phases. Phase “new” is implicit at index creation and is followed by five configurable phases.

Common issue 2: Data doesn’t delete
It is a common misconception that configuring the hot phase’s rollover will automatically delete data. The delete data phase must be explicitly configured for data to be deleted. Each configurable phase must be explicitly designated.

Each configurable phase has a set of allowed consecutive actions. These actions are configured at your discretion, but most users enable at least set priority, rollover, and delete actions. Policies and actions can be configured via the Kibana UI or via the Elasticsearch API. I frequently see and use policies (which can be accessed from Get ILM Policy) that look like

 GET _ilm/policy/INDEX_POLICY_NAME
{
    "policy": {
        "phases": {
            "hot": {
                "actions": {
                    "rollover": {
                        "max_age": "30d",
                        "max_size": "50gb"
                    },
                    "set_priority": { "priority": 100 }
                },
                "min_age": "0ms"
            },
            "warm": {
                "min_age": "7d",
                "actions": {
                    "set_priority": { "priority": 50 },
                    "shrink": { "number_of_shards": 1},
                    "forcemerge": { "max_num_segments": 1 }
                }
            },
            "delete": {
                "min_age": "365d",
                "actions": {
                    "delete": {}
                }
            }
        }
    }
}

This policy tells the system to immediately send data into the hot phase, creating new indices and rolling over previous data every 30 days or 50 GB, whichever comes first. Seven days after rollover, indices will move into the warm phase once they no longer need document updates. At this point, the policy appends the following two actions: shrink, which reduces shard counts, and force merge, which compresses the data and wipes deleted records. The data stays in the warm phase until being deleted after 365 days since its rollover. 

Common issue 3: min_age calculation clarification
When working with customers, I have seen confusion about how min_age works. The min_age must increase between subsequent phases. If rollover is used, min_age is calculated off the rollover date. This is because rollover generates a new index and the new index’s creation date is used in the calculation. Otherwise, min_age is calculated off the original index’s creation date. 

Once a policy is created, it needs to be explicitly attached to the index to take effect. 

Common issue 4: Explicitly connect policy to index
Giving your policy and index the same name does not connect the two together. For example, calling your policy filebeat-* does not connect it to your filebeat-* indices; you’ll still need to explicitly attach the indices to the policy.

You can manually attach a policy to an existing index, but usually you’ll set up templates to automatically attach policies when the index creates via a Beat YAML configuration file (as examples: Filebeat and Metricbeat) or via the index template configuration

 PUT _index_template/TEMPLATE_NAME
{
    "index_patterns": ["INDEX_NAME-*"],
    "template": {
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 1,
            "index.lifecycle.name": "POLICY_NAME",
            "index.lifecycle.rollover_alias": "INDEX_ALIAS"
        }
    }
}

You can also configure the index template via the Kibana UI. More and more, users are switching to Data Streams, which automatically handle these configurations for you.

Common issue 5: Manually manage existing indices
A new policy doesn’t automatically apply to any existing indices. Index templates can attach a policy, but the templates only apply at index creation. If you’ve updated an index template to automatically attach a policy and alias, it’ll apply going forward but you’ll need to manually attach the policy to any existing indices.

You can check which policy is currently attached to an index by retrieving its index settings

GET INDEX_NAME-000001/_settings?filter_path=*.settings.index.lifecycle
{
    "INDEX_NAME-000001" : {
        "settings" : {
            "index" : {
                "lifecycle" : {
                    "name" : "INDEX_POLICY_NAME",
                    "rollover_alias" : "INDEX_ALIAS"
                }
            }
        }
    }
}

Common issue 6: Misconfiguring induces errors
If these are NULL or misconfigured, you’ll encounter rollover action errors. These are the most common ILM errors I see, which we’ll cover below, because they depend on user configuration prerequisites rather than just backend system processing. You may consider using Data Streams instead which do not require configuration of the rollover alias.

Policy updates are versioned with only the latest version stored.

Common issue 7: Only latest policy version stored
Users cannot revert to previous policy versions once overwritten. Policy PUT requests create or fully overwrite previous versions and do not partially update the policy’s JSON.

To maintain consistency when transitioning between actions, indices cache the current phase of the policy they are executing under the index’s metadata phase_execution. You can see which version is cached and will apply to an index and what phase/action/step it’s in by checking the ILM explain output

GET INDEX_NAME-000001/_ilm/explain
{
    "indices": {
        "INDEX_NAME-000001": {
            "index": "INDEX_NAME-000001",
            "managed": true,
            "policy": "INDEX_POLICY_NAME",
            "lifecycle_date_millis": 1538475653281,
            "lifecycle_date": "2021-06-01T13:45:21.981Z",
            "age": "25.14s",
            "phase": "hot",
            "phase_time_millis": 1538475653317,
            "phase_time": "2021-06-01T13:45:22.577Z",
            "action": "rollover",
            "action_time_millis": 1538475653317,
            "action_time": "2021-06-01T13:45:22.577Z",
            "step": "attempt-rollover",
            "step_time_millis": 1538475653317,
            "step_time": "2021-06-01T13:45:22.577Z",
            "phase_execution": {
                "policy": "my_lifecycle3",
                "phase_definition": {
                    "min_age": "0ms",
                    "actions": {
                        "rollover": {
                            "max_age": "30m"
                        }
                    }
                },
                "version": 2,
                "modified_date": "2021-06-01T11:00:11.576Z",
                "modified_date_in_millis": 1539609701576
            }
        }
    }
}

The phase_execution shows the policy cached its hot phase content to rollover to a new index every 30 minutes. If the attached policy is updated in the future, the policy cache will update to the policy’s latest version when safe to do so.

Common issue 8: Policy versions safely update
Some policy version updates will not reflect in the index’s phase_execution cache until the index moves into the policy’s next phase. This is for your data protection and everything is working as expected.

Sometimes users escalate tickets after inheriting a new system. Usually this is because

  • They either need to enable or disable a data tier as business/network requirements change.
  • Their newly setup Elasticsearch user had insufficient permissions when they edited a policy which caused it to start failing.

By default, you’ll want privileges manage_ilm on the cluster and manage on the associated indices, for example via the super_user role.

Common issue 9: ILM runs as last editing user
ILM performs its operations as the last editing user with permissions equal to that user at the time the policy was last edited. These errors will appear as action [x] is unauthorized for user [y]. Here’s an example Elastic Discuss issue.

Diagnose

If ILM explain reports an ERROR step, you may need to resolve the issue for ILM to continue. Here are the most common errors and their resolutions

  • rollover alias [x] can point to multiple indices, found duplicated alias [x] in index template [z]

  • index.lifecycle.rollover_alias [x] does not point to index [y]
    • You’ll need to check your index settings index.lifecycle.rollover_alias. Either the index is pointing to the wrong alias or the alias does not exist. You can check the latter by running Get Aliases. Here’s an example Elastic Discuss issue. You may consider Data Streams which handle this for you.
  • setting [index.lifecycle.rollover_alias] for index [y] is empty or not defined

  • alias [x] has more than one write index [y,z]
    • When you run Get Aliases, you’ll notice that two indices are marked as is_write_index:true when only one should be per alias. You’ll want to toggle is_write_index:false on one of the indices via the Aliases API.


  • index name [x] does not match pattern ^.*-d+
    • The index name’s regex pattern matching is a prerequisite for rollover to work. The most common implication users miss is not realizing the index name needs to end with trailing digits, e.g. my-index-000001 and instead only use my-index which does match the pattern requirement. Here’s an example Elastic Discuss issue. You may consider Data Streams which handle this for you.

  • circuitBreakingException: [x] data too large, data for [y]

  • high disk watermark [x] exceeded on [y]
    • As a subset resource warning, this indicates your cluster data storage in particular has hit its limits. This usually comes up if your hot-to-warm node ILM rollover has not been set up. You’ll need to generate some breathing room for your cluster by increasing resources or deleting unneeded index or document data (or via delete_by_query).

The system will retry failed steps automatically every ten minutes or, once resolved, you can manually trigger a retry via the Retry Policy Execution

POST INDEX_NAME-000001/_ilm/retry

If you want to temporarily override this interval for testing, you’ll update cluster settings to reduce the indices.lifecycle.poll_interval. The default ILM cluster settings are

GET _cluster/settings?include_defaults=true&filter_path=*.indices.lifecycle*,*.xpack.ilm*
{ 
     "defaults" : {
         "indices.lifecycle.history_index_enabled" : "true",
         "indices.lifecycle.poll_interval" : "10m",
          "indices.lifecycle.step.master_timeout" : "30s"
      }
}

If the policy configuration is correct and no errors are reported but your action isn’t progressing, you’ll need to investigate if it’s waiting on prerequisite conditions to run.

Common issue 10: Good cluster hygiene helps ILM runs smoothly
UNASSIGNED shards can cause the policy execution to be unable to proceed, as ILM waits for the index to be in the “green” state for some actions. For example, the migrate action can be caught by this.

Since we’ve already looked at current state configurations and are turning towards time series investigation, we’ll look towards the ILM history. ILM history is enabled by default via Elasticsearch cluster settings indices.lifecycle.history_index_enabled:true. Depending on your deployment version, you can view its data by creating a Kibana index pattern on either .ds-ilm-history-* or ilm-history-* system indices. In Kibana Discover, I prefer to browse the created index pattern by toggling the table columns: [index, policy, state.phase, state.action, state.step, success]

If ILM history doesn’t give you enough details, you can enable more verbose cluster logging via

PUT /_cluster/settings
{
  "transient": {
    "logger.org.elasticsearch.xpack.core.indexlifecycle": "TRACE",
    "logger.org.elasticsearch.xpack.indexlifecycle": "TRACE"
  }
}

This is very heavy and should only be temporarily enabled. For on-prem clusters, you can see the more verbose logging in the Elasticsearch logs. For Elastic Cloud deployments, see My Elastic Cloud Setup on how to enable and view these.

Conclusion

We’ve covered ILM’s frequent questions, diagnostic flow, and common error recovery. At this point, if you’re stuck resolving an issue, feel free to reach out. We’re here and happy to help! You can contact us via Elastic Discuss, Elastic Community Slack, consulting, training, and support.

Source: Elastic