Diagnosing and Resolving Elasticsearch Index Lifecycle Management Rollover Failures
Automated index rollover stalling is the most common production bottleneck in high-throughput search pipelines. When rollover fails, write queues back up and query latency spikes due to oversized primary shards. Index Lifecycle Management (ILM) directly governs storage efficiency and routing performance. Teams scaling infrastructure must align ILM execution with broader Search Engine Selection & Architecture principles. This guide isolates the exact failure modes and provides immediate remediation steps.
Prerequisites for ILM Policy Execution
ILM requires a healthy cluster state before any policy execution begins. The cluster must maintain green or yellow status with active master nodes. Index templates must explicitly define the write alias used for routing. Misconfigured aliases are the primary cause of rollover blocks. You must establish baseline routing standards by reviewing Elasticsearch Fundamentals for Engineers.
Apply the exact template configuration below to bootstrap alias routing.
PUT _index_template/app_logs_template
{
"index_patterns": ["app_logs-*"],
"template": {
"settings": {
"index.lifecycle.name": "app_logs_policy",
"index.lifecycle.rollover_alias": "app_logs_write"
},
"aliases": {
"app_logs_write": { "is_write_index": true }
}
}
}
Verify cluster health thresholds before proceeding. Disk watermarks must remain below 85% to prevent automatic read-only blocks.
Structuring Production-Ready ILM Policies
Production policies must define explicit phase transitions with non-overlapping thresholds. Overlapping triggers cause phase deadlocks where the state machine cannot advance. Use the exact JSON structure below for hot, warm, cold, and delete phases.
PUT _ilm/policy/app_logs_policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "30d",
"max_docs": 20000000
},
"set_priority": { "priority": 100 }
}
},
"warm": {
"min_age": "31d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 },
"allocate": { "number_of_replicas": 1, "require": { "data": "warm" } }
}
},
"cold": {
"min_age": "90d",
"actions": {
"allocate": { "require": { "data": "cold" } }
}
},
"delete": {
"min_age": "180d",
"actions": { "delete": {} }
}
}
}
}
Ensure max_size, max_age, and max_docs do not conflict with cluster capacity limits. The shrink action requires a single primary shard in the target phase.
Step-by-Step Diagnostic Workflow for Stuck Rollovers
Execute the following API sequence to isolate stuck indices. Start by querying the ILM state machine for the target index.
curl -X GET "localhost:9200/_ilm/explain/app_logs-000001?pretty"
Parse the step_info and phase fields. A WAITING_FOR_ALIAS error indicates the write alias is detached or missing. An ILM_POLICY_NOT_FOUND error means the policy was deleted or never attached. A CLUSTER_BLOCK state points to disk watermark violations or read-only indices.
Verify index metrics and alias routing simultaneously.
curl -s "localhost:9200/_cat/indices/app_logs-*?v&h=index,health,status,docs.count,store.size"
curl -s "localhost:9200/_cat/aliases/app_logs_write?v"
Use jq to extract critical failure codes from the explain response.
curl -s "localhost:9200/_ilm/explain/app_logs-000001" | jq '.indices[].step_info.type'
Resolution Paths and Production Rollout
Force the state machine to advance after correcting underlying issues. Retry the failed step immediately.
curl -X POST "localhost:9200/_ilm/retry/app_logs-000001"
Reassign the write alias if it points to a stale index. This operation guarantees zero query downtime.
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d '{
"actions": [
{ "remove": { "index": "app_logs-000001", "alias": "app_logs_write" } },
{ "add": { "index": "app_logs-000002", "alias": "app_logs_write", "is_write_index": true } }
]
}'
Reattach a missing policy directly to the index settings.
curl -X PUT "localhost:9200/app_logs-000001/_settings" -H 'Content-Type: application/json' -d '{
"index.lifecycle.name": "app_logs_policy",
"index.lifecycle.rollover_alias": "app_logs_write"
}'
Roll back failed shrink operations by restoring the original shard count. Delete the partially merged index and retry only after confirming disk watermarks are below 85%.
Post-Implementation Validation & Monitoring
Confirm successful rollover by verifying the new index exists and holds the write alias. Monitor shard allocation across node tiers to prevent hot node saturation.
curl -s "localhost:9200/_cat/shards/app_logs-*?v&h=index,shard,prirep,state,node"
Track disk usage trends using the cluster stats API. Set DevOps alert thresholds at 70% disk utilization per node.
Query phase transition latency to detect policy drift.
GET _ilm/status
GET _cat/indices/app_logs-*?v&h=index,lifecycle.phase,lifecycle.step
Integrate ILM metrics into your observability stack. Alert on step_info.error fields and lifecycle_date_millis gaps exceeding 15 minutes. Continuous validation prevents silent pipeline degradation.