Alerting on Indexing Lag with Freshness SLOs

A document committed to your primary database but not yet visible in the index is a silent correctness bug: the user searches, gets stale results, and never sees an error. Page-on-everything alerting drowns the on-call engineer; page-on-nothing ships staleness to production. This guide defines a freshness service level objective on indexing lag and alerts on it with multi-window, multi-burn-rate rules, the same discipline applied to availability in the broader observability and SRE practice for search and across search engine selection and architecture as a whole. The goal is an alert that fires when, and only when, freshness is genuinely at risk of breaching its budget.

Prerequisites

  1. A Prometheus server (v2.40+) scraping your indexing pipeline.
  2. A way to read the source-of-truth updated_at clock and the indexed updated_at clock — either an exporter that queries both, or Kafka consumer-group lag for the sink connector.
  3. Alertmanager (v0.25+) wired to a paging channel and a ticketing channel.
  4. An agreed freshness target with product owners, expressed as a percentile and a threshold (for example, 99% of documents indexed within 30 seconds).

Diagnosis: what “indexing lag” actually measures

There are two defensible definitions of lag, and they answer different questions.

Wall-clock freshness is now() - max(indexed.updated_at): how old is the newest document the index has caught up to. This is what users feel. It captures the entire pipeline — capture, transform, sink, refresh — in one number, and it keeps climbing when the pipeline stalls, even if the message queue looks empty.

Consumer-group lag is the count of unconsumed offsets on the sink topic. It is cheaper to scrape and isolates the sink stage, but it reports zero during a stall that occurs upstream of the topic, and it counts messages, not seconds. Use it as a leading indicator and a stage-localizer, not as the SLI itself.

The SLI — the service level indicator you build the SLO on — should be wall-clock freshness, expressed as the fraction of measured intervals where lag stayed under threshold. A raw lag exporter emits something like this every scrape:

# HELP search_index_lag_seconds Freshness lag: now minus newest indexed updated_at
# TYPE search_index_lag_seconds gauge
search_index_lag_seconds{index="products"} 4.812
search_index_lag_seconds{index="orders"}  41.307   # <-- breaching 30s threshold

That orders reading is the failure mode you are alerting on: a sustained gauge above threshold means users are searching stale data right now.

Freshness SLI to burn-rate alert flow Two lag clocks feed a recording rule that computes a good-event ratio, which a multi-window multi-burn-rate alert evaluates against the error budget. now() minus indexed updated_at consumer-group lag (sink, leading) recording rule: good ratio (SLI) burn-rate alert error budget = 1 minus SLO target fast burn pages, slow burn tickets

Solution Steps

1. Emit a boolean “good event” per scrape

Burn-rate math needs a ratio of good events to total events, not a gauge. Convert the lag gauge into a per-scrape boolean with a recording rule. A document interval is “good” when measured lag is under the threshold.

# rules/search-freshness.recording.yml
groups:
  - name: search_freshness_sli
    interval: 15s            # match or exceed your scrape interval
    rules:
      # 1 when fresh, 0 when stale — the per-scrape good/bad signal
      - record: search:freshness_good:ratio
        expr: |
          (search_index_lag_seconds < bool 30)

The < bool 30 operator returns 1 or 0 instead of filtering the series, so a stale sample stays in the timeseries as a zero rather than disappearing — disappearing samples would silently shrink your denominator and hide the breach.

2. Aggregate the SLI over rolling windows

The SLO is “99% of intervals fresh.” Compute the achieved ratio over each alert window with avg_over_time. Define one recording rule per window you will reference in the alert; pre-computing them keeps alert evaluation cheap.

# rules/search-freshness.windows.yml
groups:
  - name: search_freshness_windows
    interval: 30s
    rules:
      - record: search:freshness_good:ratio_rate5m
        expr: avg_over_time(search:freshness_good:ratio[5m])
      - record: search:freshness_good:ratio_rate1h
        expr: avg_over_time(search:freshness_good:ratio[1h])
      - record: search:freshness_good:ratio_rate30m
        expr: avg_over_time(search:freshness_good:ratio[30m])
      - record: search:freshness_good:ratio_rate6h
        expr: avg_over_time(search:freshness_good:ratio[6h])

3. Frame the error budget

With a 99% target, the error budget is 1 - 0.99 = 0.01 — 1% of intervals may be stale over the SLO window (say, 30 days) without breaching. Burn rate is how fast you are spending that budget relative to a steady linear spend. Burn rate 1 exhausts the budget exactly at the window’s end; burn rate 14.4 exhausts it in roughly 2 days; burn rate 6 in roughly 5 days. The 1 - ratio term below is the observed bad-event fraction, and dividing by the budget gives the multiplier:

# Burn rate over the last hour: how many budgets/period we are spending
(1 - search:freshness_good:ratio_rate1h) / (1 - 0.99)

A value of 14.4 means freshness is being consumed 14.4 times faster than the budget allows.

4. Page on fast burn, ticket on slow burn

A single threshold is either too twitchy or too slow. The multi-window, multi-burn-rate pattern requires a long window to confirm a sustained problem and a short window to confirm it is still happening, so a recovered blip stops paging immediately. Fast burn pages; slow burn opens a ticket.

# rules/search-freshness.alerts.yml
groups:
  - name: search_freshness_slo
    rules:
      # FAST BURN — 14.4x: budget gone in ~2 days. Page.
      - alert: SearchFreshnessFastBurn
        expr: |
          (1 - search:freshness_good:ratio_rate1h) / (1 - 0.99) > 14.4
          and
          (1 - search:freshness_good:ratio_rate5m) / (1 - 0.99) > 14.4
        for: 2m
        labels:
          severity: page
          slo: search_freshness
        annotations:
          summary: "Index freshness burning budget fast ({{ $labels.index }})"
          description: "1h and 5m windows both >14.4x burn; users seeing stale results."
      # SLOW BURN — 6x: budget gone in ~5 days. Ticket, no page.
      - alert: SearchFreshnessSlowBurn
        expr: |
          (1 - search:freshness_good:ratio_rate6h) / (1 - 0.99) > 6
          and
          (1 - search:freshness_good:ratio_rate30m) / (1 - 0.99) > 6
        for: 15m
        labels:
          severity: ticket
          slo: search_freshness
        annotations:
          summary: "Index freshness slowly eroding ({{ $labels.index }})"
          description: "6h and 30m windows both >6x burn; investigate before it pages."

5. Route by severity in Alertmanager

The two severities must reach different places, or slow burn becomes noise that trains the on-call to ignore the page.

# alertmanager.yml (route excerpt)
route:
  receiver: search-tickets
  group_by: ['slo', 'index']
  routes:
    - matchers: [ 'slo="search_freshness"', 'severity="page"' ]
      receiver: search-pager
      group_wait: 30s
      repeat_interval: 1h
    - matchers: [ 'slo="search_freshness"', 'severity="ticket"' ]
      receiver: search-tickets
      repeat_interval: 12h
receivers:
  - name: search-pager
    pagerduty_configs:
      - routing_key: '${PD_ROUTING_KEY}'
  - name: search-tickets
    webhook_configs:
      - url: 'http://ticket-bot:8080/alert'

Verification

Load the rules and confirm Prometheus accepts them:

promtool check rules rules/search-freshness.*.yml
# Expected:
#   rules/search-freshness.alerts.yml: 2 rules found
#   rules/search-freshness.recording.yml: 1 rules found
#   rules/search-freshness.windows.yml: 4 rules found

Confirm the SLI series exists and looks sane (a healthy pipeline sits near 1.0):

search:freshness_good:ratio_rate5m{index="products"}
# => 1   (all of the last 5m fresh)

Force a breach in a dev environment by pausing the sink connector, then assert the alert pends and fires:

# Pause the sink, wait one fast-burn window, then query the alert state
curl -s 'http://localhost:9090/api/v1/alerts' \
  | jq '.data.alerts[] | select(.labels.alertname=="SearchFreshnessFastBurn") | .state'
# Expected progression:  "pending"  -> after `for: 2m` ->  "firing"

Unit-test the alert logic so a refactor cannot silently break paging:

# tests/freshness_alert_test.yml  (run: promtool test rules tests/freshness_alert_test.yml)
rule_files: [ '../rules/search-freshness.alerts.yml' ]
tests:
  - interval: 1m
    input_series:
      - series: 'search:freshness_good:ratio_rate1h{index="orders"}'
        values: '0.5x10'        # 50% bad -> burn rate 50, well over 14.4
      - series: 'search:freshness_good:ratio_rate5m{index="orders"}'
        values: '0.5x10'
    alert_rule_test:
      - eval_time: 3m
        alertname: SearchFreshnessFastBurn
        exp_alerts:
          - exp_labels: { severity: page, slo: search_freshness, index: orders }

Common Pitfalls

Alerting on the raw lag gauge instead of a budget ratio

Root cause: a rule like search_index_lag_seconds > 30 for: 5m pages on every transient spike — a single slow batch, a refresh-interval hiccup, a GC pause. It has no memory of how much budget remains, so a pipeline that is 99.9% fresh still pages constantly. Remediation: alert on burn rate against the budget (Step 4), not on the instantaneous gauge. Keep the gauge for dashboards and post-incident drill-down only.

Clock skew between the source and the indexer poisons the SLI

Root cause: now() - max(indexed.updated_at) mixes two clocks. If the indexer host runs a few seconds ahead of the database, lag reads negative or artificially low and the SLI reports “fresh” during a real stall. Remediation: compute lag from a single clock — stamp source.ts_ms at capture time (as a CDC pipeline already does) and compare against now() on the same host that read the source, or enforce NTP sync and add a guard clamp_min(search_index_lag_seconds, 0) before the boolean rule.

An idle index reports zero lag and masks a dead pipeline

Root cause: if no documents have changed, max(indexed.updated_at) stops advancing but so does the source, so lag looks fine — yet a fully crashed consumer produces the same flat reading. Remediation: emit a heartbeat document on a timer (or alert separately on consumer-group liveness / absent() of the lag metric) so a stalled pipeline is distinguishable from a quiet one. The absent(search_index_lag_seconds) alert is your dead-man’s switch.