Canary Deploying a Relevance Model Update

A ranking model that wins offline can still lose in production: an NDCG gain on a frozen judgment set says nothing about live click-through, latency tails, or the rare-query distribution the labels never covered. Shipping it to 100% of traffic on faith risks a relevance regression that no unit test catches and that users feel as “search got worse.” This guide canary-deploys a relevance model — shadow first, then a small live percentage gated by guardrail metrics and automated rollback — as a routine part of the observability and SRE practice for search and the wider discipline of search engine selection and architecture. The objective is to detect a regression on a 1% slice before it reaches everyone.

Prerequisites

  1. A ranking layer that can load two model versions concurrently and route per-request (a feature-store or model-server abstraction, not a hardcoded scorer).
  2. Query-level logging of impressions, clicks, and result-set fingerprints, with a stable request_id.
  3. A guardrail metric pipeline emitting CTR, zero-result rate, latency percentiles, and an online relevance proxy.
  4. A traffic-splitting mechanism keyed on a hashed, sticky identity (user or session) so a user never flickers between models mid-session.

Diagnosis: why offline wins do not guarantee online wins

A learning-to-rank model is trained against historical judgments, but production traffic drifts — new products, seasonal queries, a different head/tail mix. Offline NDCG is computed on the queries you happened to label; the model can regress badly on the unlabeled tail while improving the head, and the aggregate metric hides it. Worse, a model can win on relevance and lose on latency, pushing the p99 over the budget and degrading every query, ranked well or not.

The failure you are guarding against looks like this in the guardrail log — relevance nudges up while a guardrail quietly breaks:

model=ltr-v7  ctr=0.182  zero_result_rate=0.031  p99_ms=141  ndcg_online=0.612
model=ltr-v8  ctr=0.179  zero_result_rate=0.067  p99_ms=388  ndcg_online=0.628
#                              ^^^^ 2x worse        ^^^ over budget

ltr-v8 improved online NDCG yet doubled the zero-result rate and blew the latency budget — a net loss that a single relevance number would have green-lit. Canarying exists to catch exactly this trade.

Canary traffic split with guardrail gate Live queries split between the stable model and a small canary slice; guardrail metrics gate promotion or trigger automated rollback. live queries sticky hash split stable model 99% of traffic canary model 1% of traffic guardrail gate promote or roll back

Solution Steps

1. Shadow the new model first — no user impact

Before any user sees v8, run it in shadow: score the same live queries with both models, log both result sets, serve only v7. This catches latency regressions and crashes against real query distribution at zero risk.

# ranking_handler.py — shadow stage
def rank(query, ctx):
    stable = stable_model.score(query, ctx)          # served to the user
    # Fire-and-forget; never block the response or surface canary errors.
    executor.submit(_shadow_eval, query, ctx, stable)
    return stable

def _shadow_eval(query, ctx, stable):
    t0 = time.monotonic()
    try:
        canary = canary_model.score(query, ctx)
    except Exception as e:
        metrics.incr("canary.error", tags={"model": "ltr-v8"})
        return
    metrics.timing("canary.latency_ms", (time.monotonic() - t0) * 1000)
    # Rank-correlation between served and shadow rankings; 1.0 == identical
    log_interleave(query, stable, canary, kendall_tau(stable, canary))

Decision gate: only proceed past shadow when canary p99 sits inside the latency budget and the error rate is zero across a full traffic cycle (typically 24h to cover daily seasonality).

2. Interleave to compare relevance without a population split

Interleaving (team-draft) is more sensitive than an A/B test because both models compete within the same result list for the same user, removing population variance. Each click is attributed to the model that contributed the clicked document.

# team_draft_interleave.py
import random

def team_draft(list_a, list_b, k=10):
    """Blend two rankings; track which model placed each doc."""
    out, credit, used = [], {}, set()
    ia = ib = 0
    while len(out) < k:
        pick_a = len(out) % 2 == 0 if random.random() < 0.5 else random.random() < 0.5
        src, lst, idx = ("a", list_a, ia) if pick_a else ("b", list_b, ib)
        while idx < len(lst) and lst[idx] in used:
            idx += 1
        if idx >= len(lst):
            break
        doc = lst[idx]
        out.append(doc); used.add(doc); credit[doc] = src
        if src == "a": ia = idx + 1
        else: ib = idx + 1
    return out, credit   # credit[clicked_doc] tells you which model won that click

A reliable preference signal (model B wins clicks with p < 0.05 over a few thousand impressions) is the green light to put real traffic on the canary.

3. Split a small live percentage on a sticky hash

Route 1% of users to the canary using a hash of a stable identity, so the same user always lands in the same arm — flickering models mid-session corrupts the guardrail comparison and confuses users.

# traffic_split.py
import hashlib

CANARY_PCT = 1  # start at 1, ramp 1 -> 5 -> 25 -> 50 -> 100 only on clean gates

def model_for(user_id: str) -> str:
    bucket = int(hashlib.sha256(user_id.encode()).hexdigest(), 16) % 100
    return "ltr-v8" if bucket < CANARY_PCT else "ltr-v7"

4. Define guardrail metrics and rollback thresholds

Pick a small set of guardrails; trip any one and the canary rolls back automatically. Latency and zero-result rate are symmetric guardrails (regression in either direction is bad); CTR and online NDCG are the win metrics that justify promotion.

Metric Threshold Direction Action on breach
zero_result_rate +0.5pp vs stable worse auto-rollback
p99_latency_ms > 250 (budget) worse auto-rollback
ctr -1.0pp vs stable worse auto-rollback
ndcg_online -0.5pp vs stable worse hold ramp, alert

5. Wire the automated rollback trigger

Encode the guardrails as alerting rules that flip the canary percentage to zero. A breach should page and act — humans are too slow to catch a fast relevance regression.

# rules/canary-guardrails.yml
groups:
  - name: relevance_canary
    rules:
      - alert: CanaryZeroResultRegression
        expr: |
          (ranking_zero_result_rate{model="ltr-v8"}
           - ranking_zero_result_rate{model="ltr-v7"}) > 0.005
        for: 5m
        labels: { severity: page, action: rollback }
        annotations:
          summary: "Canary ltr-v8 zero-result rate +{{ $value | humanizePercentage }}"
      - alert: CanaryLatencyRegression
        expr: histogram_quantile(0.99,
                sum by (le) (rate(ranking_latency_bucket{model="ltr-v8"}[5m]))) > 0.25
        for: 5m
        labels: { severity: page, action: rollback }
        annotations:
          summary: "Canary ltr-v8 p99 over latency budget"

A webhook receiver on action="rollback" sets CANARY_PCT = 0 and redeploys config, so traffic drains to stable within one config-reload cycle.

Verification

Confirm the split sends roughly the configured fraction to the canary:

curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=sum by (model) (rate(ranking_requests_total[5m]))' \
  | jq -r '.data.result[] | "\(.metric.model) \(.value[1])"'
# Expected (1% canary):
#   ltr-v7  98.9
#   ltr-v8  1.07

Verify the guardrail comparison is populated for both arms before trusting it:

ranking_zero_result_rate{model="ltr-v8"} - ranking_zero_result_rate{model="ltr-v7"}
# => 0.0003   (well under the 0.005 rollback threshold — safe to ramp)

Force a rollback in staging by injecting bad results and assert traffic drains:

# After tripping the guardrail, the alert should fire with action=rollback
curl -s 'http://localhost:9090/api/v1/alerts' \
  | jq '.data.alerts[] | select(.labels.action=="rollback") | .state'
# Expected:  "firing"   then ranking_requests_total{model="ltr-v8"} -> 0

Common Pitfalls

Reading CTR as relevance when the result set shrank

Root cause: a model that returns fewer, safer results can show higher CTR per impression while serving worse coverage — users click more often on a tiny list but find less. Remediation: always pair CTR with zero_result_rate and result-set depth as guardrails, and prefer interleaving over raw CTR comparison, since interleaving controls for the population and the query mix that confound a naive CTR delta.

Non-sticky bucketing corrupts the experiment

Root cause: hashing per-request instead of per-user means the same session bounces between models, so clicks cannot be attributed and the user sees results re-shuffle on every keystroke in a search-as-you-type flow. Remediation: hash a stable identity (logged-in user id, else a long-lived session cookie), never the request_id, and assert the bucket function is deterministic in a unit test before rollout.

Promoting on a metric that never reached significance

Root cause: a 1% canary on a low-traffic index may take days to accumulate enough impressions for the CTR delta to clear noise; ramping on an early, lucky reading promotes a regression. Remediation: gate each ramp step on a minimum impression count and a confidence interval that excludes zero, not on elapsed time alone, and freshen the model’s training judgments from production interactions captured during the canary before the next iteration.