Alerting on Indexing Lag with Freshness SLOs
A document committed to your primary database but not yet visible in the index is a silent correctness bug: the user searches, gets stale results, and never sees an error. Page-on-everything alerting drowns the on-call engineer; page-on-nothing ships staleness to production. This guide defines a freshness service level objective on indexing lag and alerts on it with multi-window, multi-burn-rate rules, the same discipline applied to availability in the broader observability and SRE practice for search and across search engine selection and architecture as a whole. The goal is an alert that fires when, and only when, freshness is genuinely at risk of breaching its budget.
Prerequisites
- A Prometheus server (v2.40+) scraping your indexing pipeline.
- A way to read the source-of-truth
updated_atclock and the indexedupdated_atclock — either an exporter that queries both, or Kafka consumer-group lag for the sink connector. - Alertmanager (v0.25+) wired to a paging channel and a ticketing channel.
- An agreed freshness target with product owners, expressed as a percentile and a threshold (for example, 99% of documents indexed within 30 seconds).
Diagnosis: what “indexing lag” actually measures
There are two defensible definitions of lag, and they answer different questions.
Wall-clock freshness is now() - max(indexed.updated_at): how old is the newest document the index has caught up to. This is what users feel. It captures the entire pipeline — capture, transform, sink, refresh — in one number, and it keeps climbing when the pipeline stalls, even if the message queue looks empty.
Consumer-group lag is the count of unconsumed offsets on the sink topic. It is cheaper to scrape and isolates the sink stage, but it reports zero during a stall that occurs upstream of the topic, and it counts messages, not seconds. Use it as a leading indicator and a stage-localizer, not as the SLI itself.
The SLI — the service level indicator you build the SLO on — should be wall-clock freshness, expressed as the fraction of measured intervals where lag stayed under threshold. A raw lag exporter emits something like this every scrape:
# HELP search_index_lag_seconds Freshness lag: now minus newest indexed updated_at
# TYPE search_index_lag_seconds gauge
search_index_lag_seconds{index="products"} 4.812
search_index_lag_seconds{index="orders"} 41.307 # <-- breaching 30s threshold
That orders reading is the failure mode you are alerting on: a sustained gauge above threshold means users are searching stale data right now.
Solution Steps
1. Emit a boolean “good event” per scrape
Burn-rate math needs a ratio of good events to total events, not a gauge. Convert the lag gauge into a per-scrape boolean with a recording rule. A document interval is “good” when measured lag is under the threshold.
# rules/search-freshness.recording.yml
groups:
- name: search_freshness_sli
interval: 15s # match or exceed your scrape interval
rules:
# 1 when fresh, 0 when stale — the per-scrape good/bad signal
- record: search:freshness_good:ratio
expr: |
(search_index_lag_seconds < bool 30)
The < bool 30 operator returns 1 or 0 instead of filtering the series, so a stale sample stays in the timeseries as a zero rather than disappearing — disappearing samples would silently shrink your denominator and hide the breach.
2. Aggregate the SLI over rolling windows
The SLO is “99% of intervals fresh.” Compute the achieved ratio over each alert window with avg_over_time. Define one recording rule per window you will reference in the alert; pre-computing them keeps alert evaluation cheap.
# rules/search-freshness.windows.yml
groups:
- name: search_freshness_windows
interval: 30s
rules:
- record: search:freshness_good:ratio_rate5m
expr: avg_over_time(search:freshness_good:ratio[5m])
- record: search:freshness_good:ratio_rate1h
expr: avg_over_time(search:freshness_good:ratio[1h])
- record: search:freshness_good:ratio_rate30m
expr: avg_over_time(search:freshness_good:ratio[30m])
- record: search:freshness_good:ratio_rate6h
expr: avg_over_time(search:freshness_good:ratio[6h])
3. Frame the error budget
With a 99% target, the error budget is 1 - 0.99 = 0.01 — 1% of intervals may be stale over the SLO window (say, 30 days) without breaching. Burn rate is how fast you are spending that budget relative to a steady linear spend. Burn rate 1 exhausts the budget exactly at the window’s end; burn rate 14.4 exhausts it in roughly 2 days; burn rate 6 in roughly 5 days. The 1 - ratio term below is the observed bad-event fraction, and dividing by the budget gives the multiplier:
# Burn rate over the last hour: how many budgets/period we are spending
(1 - search:freshness_good:ratio_rate1h) / (1 - 0.99)
A value of 14.4 means freshness is being consumed 14.4 times faster than the budget allows.
4. Page on fast burn, ticket on slow burn
A single threshold is either too twitchy or too slow. The multi-window, multi-burn-rate pattern requires a long window to confirm a sustained problem and a short window to confirm it is still happening, so a recovered blip stops paging immediately. Fast burn pages; slow burn opens a ticket.
# rules/search-freshness.alerts.yml
groups:
- name: search_freshness_slo
rules:
# FAST BURN — 14.4x: budget gone in ~2 days. Page.
- alert: SearchFreshnessFastBurn
expr: |
(1 - search:freshness_good:ratio_rate1h) / (1 - 0.99) > 14.4
and
(1 - search:freshness_good:ratio_rate5m) / (1 - 0.99) > 14.4
for: 2m
labels:
severity: page
slo: search_freshness
annotations:
summary: "Index freshness burning budget fast ({{ $labels.index }})"
description: "1h and 5m windows both >14.4x burn; users seeing stale results."
# SLOW BURN — 6x: budget gone in ~5 days. Ticket, no page.
- alert: SearchFreshnessSlowBurn
expr: |
(1 - search:freshness_good:ratio_rate6h) / (1 - 0.99) > 6
and
(1 - search:freshness_good:ratio_rate30m) / (1 - 0.99) > 6
for: 15m
labels:
severity: ticket
slo: search_freshness
annotations:
summary: "Index freshness slowly eroding ({{ $labels.index }})"
description: "6h and 30m windows both >6x burn; investigate before it pages."
5. Route by severity in Alertmanager
The two severities must reach different places, or slow burn becomes noise that trains the on-call to ignore the page.
# alertmanager.yml (route excerpt)
route:
receiver: search-tickets
group_by: ['slo', 'index']
routes:
- matchers: [ 'slo="search_freshness"', 'severity="page"' ]
receiver: search-pager
group_wait: 30s
repeat_interval: 1h
- matchers: [ 'slo="search_freshness"', 'severity="ticket"' ]
receiver: search-tickets
repeat_interval: 12h
receivers:
- name: search-pager
pagerduty_configs:
- routing_key: '${PD_ROUTING_KEY}'
- name: search-tickets
webhook_configs:
- url: 'http://ticket-bot:8080/alert'
Verification
Load the rules and confirm Prometheus accepts them:
promtool check rules rules/search-freshness.*.yml
# Expected:
# rules/search-freshness.alerts.yml: 2 rules found
# rules/search-freshness.recording.yml: 1 rules found
# rules/search-freshness.windows.yml: 4 rules found
Confirm the SLI series exists and looks sane (a healthy pipeline sits near 1.0):
search:freshness_good:ratio_rate5m{index="products"}
# => 1 (all of the last 5m fresh)
Force a breach in a dev environment by pausing the sink connector, then assert the alert pends and fires:
# Pause the sink, wait one fast-burn window, then query the alert state
curl -s 'http://localhost:9090/api/v1/alerts' \
| jq '.data.alerts[] | select(.labels.alertname=="SearchFreshnessFastBurn") | .state'
# Expected progression: "pending" -> after `for: 2m` -> "firing"
Unit-test the alert logic so a refactor cannot silently break paging:
# tests/freshness_alert_test.yml (run: promtool test rules tests/freshness_alert_test.yml)
rule_files: [ '../rules/search-freshness.alerts.yml' ]
tests:
- interval: 1m
input_series:
- series: 'search:freshness_good:ratio_rate1h{index="orders"}'
values: '0.5x10' # 50% bad -> burn rate 50, well over 14.4
- series: 'search:freshness_good:ratio_rate5m{index="orders"}'
values: '0.5x10'
alert_rule_test:
- eval_time: 3m
alertname: SearchFreshnessFastBurn
exp_alerts:
- exp_labels: { severity: page, slo: search_freshness, index: orders }
Common Pitfalls
Alerting on the raw lag gauge instead of a budget ratio
Root cause: a rule like search_index_lag_seconds > 30 for: 5m pages on every transient spike — a single slow batch, a refresh-interval hiccup, a GC pause. It has no memory of how much budget remains, so a pipeline that is 99.9% fresh still pages constantly. Remediation: alert on burn rate against the budget (Step 4), not on the instantaneous gauge. Keep the gauge for dashboards and post-incident drill-down only.
Clock skew between the source and the indexer poisons the SLI
Root cause: now() - max(indexed.updated_at) mixes two clocks. If the indexer host runs a few seconds ahead of the database, lag reads negative or artificially low and the SLI reports “fresh” during a real stall. Remediation: compute lag from a single clock — stamp source.ts_ms at capture time (as a CDC pipeline already does) and compare against now() on the same host that read the source, or enforce NTP sync and add a guard clamp_min(search_index_lag_seconds, 0) before the boolean rule.
An idle index reports zero lag and masks a dead pipeline
Root cause: if no documents have changed, max(indexed.updated_at) stops advancing but so does the source, so lag looks fine — yet a fully crashed consumer produces the same flat reading. Remediation: emit a heartbeat document on a timer (or alert separately on consumer-group liveness / absent() of the lag metric) so a stalled pipeline is distinguishable from a quiet one. The absent(search_index_lag_seconds) alert is your dead-man’s switch.
Related
- Observability and SRE for search — the parent area where freshness SLOs sit alongside availability and latency objectives.
- Instrumenting search with OpenTelemetry — emit the lag and trace spans that feed these recording rules.
- Change data capture setup — the pipeline whose consumer-group lag and capture timestamps power the freshness SLI.