Query-Time Boosting Strategies

Query-time boosting is where business intent meets the retrieval score. A clean BM25 Tuning & Weights baseline answers “how well does this document match the words,” but it cannot answer “should fresh inventory outrank a stale exact match” or “should sponsored results sit above organic ones.” That gap is closed at query time by layering deterministic boosts, decay functions, and demotions on top of the lexical score. This guide, part of Ranking Algorithms & Relevance Tuning, resolves a recurring engineering decision: when to inject relevance signals as hand-tuned query-time boosts versus pushing them into a learned model, and how to bound the cost and non-determinism that aggressive boosting introduces.

The decision is rarely binary. Most production systems run hand-tuned boosts for high-confidence business rules (recency, geo, stock status, editorial pins) and reserve learned models for the sparse tail of interactions they cannot enumerate by hand. The mechanics below are written against Elasticsearch and OpenSearch syntax, but the principles — multiplicative versus additive composition, saturation, and rescore windows — transfer directly to Solr, Vespa, and Typesense.

There are four levers worth naming up front, because nearly every boosting decision is a choice among them. First, where the signal enters: a clause-level boost, a function_score function, or a rescore pass. Second, how it composes: multiplicatively (scaling relevance) or additively (offsetting it). Third, how much it is allowed to move a result: a bounded multiplier versus an open-ended one that can dominate the score. Fourth, whether it is a constraint or a preference: a pinned query and a hard negative_boost express rules that must hold, while a decay function expresses a soft preference that relevance can override. Getting the first two right prevents the majority of production incidents; getting the last two right is what separates a search that feels intentional from one that feels arbitrary.

Prerequisites

Elasticsearch 8.x or OpenSearch 2.x reachable at localhost:9200, with a populated index you can issue real queries against.
A baseline multi_match or bool query that already returns relevant results ordered by BM25 — boosting on top of a broken baseline only hides the breakage.
Numeric and date fields you intend to boost on (e.g. published_at, geo_point, popularity, in_stock) are mapped and indexed; function_score cannot read a field that is not in the mapping.
explain: true available in your client so you can read per-clause score contributions during tuning.
A held-out query set with judged results (or click logs) so every boost change is evaluated against nDCG@10 / MRR rather than eyeballed.

Concept Deep-Dive

A query-time boost transforms the raw relevance score S_bm25 of each hit into a final score S_final before the result list is sorted. The two composition modes matter more than any single function:

Multiplicative (function_score with score_mode/boost_mode: "multiply"): the boost scales the relevance score. A document that matches poorly stays low even with a strong recency boost, because low * 1.5 is still low. This preserves the integrity of the lexical match and is the safe default for recency and geo.
Additive (boost_mode: "sum", or bool/function_score score_mode: "sum"): the boost adds a flat term. This can promote a near-zero-relevance document purely on freshness, which is usually a bug, not a feature. Reserve additive composition for cases where the boost itself is the ranking signal (e.g. a curated popularity prior).

The worked example below shows the multiplicative path: BM25 produces a base score per hit, each function_score function emits a multiplier (a field_value_factor on popularity, a gauss decay on published_at), the multipliers combine, and the list re-sorts. Note how document D, a strong lexical match but old, is overtaken by B, a slightly weaker match that is recent and popular — exactly the reorder a recency boost is meant to produce, without burying D entirely.

The decay functions — gauss, linear, exp — share an interface of origin, scale, offset, and decay. origin is the ideal value (now, or the user’s location), offset is a flat zone around the origin where no penalty applies, scale is the distance at which the multiplier drops to decay (default 0.5), and the curve shape differs: gauss is smooth and forgiving near the origin then falls off, linear is constant-slope, exp punishes deviation hardest near the origin. Recency and geo are the canonical uses; the narrow recency case is covered in depth in boosting recent documents by recency.

Choosing among the three curves is a product decision, not a mathematical one. Use gauss when there is a comfortable middle zone — most recency and geo problems, where “a few days old” or “a few kilometers away” should barely matter and the penalty should accelerate only as distance grows. Use exp when proximity to the origin is everything and you want to punish even small deviations hard, for example a flash-sale feed where anything older than hours is nearly worthless. Use linear when you want a predictable, explainable slope that stakeholders can reason about without a calculus refresher; its constant gradient makes “every extra week costs X” trivial to communicate, at the cost of a kink at the point where it clips to zero. In practice gauss is the default for 80% of decay boosts because its forgiving shoulder near the origin is exactly what keeps near-fresh documents from being penalized against each other.

The contrast between hand-tuned boosts and learned models is the strategic axis of this entire topic. Hand-tuned boosts are transparent, debuggable, and ship in an afternoon: every multiplier is visible in an explain payload, a product manager can read the query JSON, and a regression is traced to a single function. Their weakness is combinatorial — each new signal interacts non-linearly with every existing one, and beyond a handful of functions nobody can predict what changing scale from 30d to 45d does to the geo boost three clauses down. Learned models invert this: a gradient-boosted ranker trained on click data via Learning to Rank (LTR) discovers feature interactions you would never hand-encode and adapts as behavior shifts, but it is opaque, needs judged or logged training data, and fails in ways that are far harder to attribute. The mature pattern is layered: hand-tuned boosts express the non-negotiable business rules (an out-of-stock item must never outrank an in-stock one; a paid placement must sit above organic) as hard, auditable constraints, while a learned model orders everything inside those constraints. Boosting is also where determinism quietly breaks — origin: "now" changes every millisecond, score ties resolve by internal doc id which differs across shards, and floating-point products are not associative — so any system that asserts on exact ordering must round time origins, pin an explicit tie-breaker, and avoid chains of unbounded multipliers.

Step-by-Step Implementation

1. Establish the baseline query

Start from a plain multi_match. Confirm it returns sensible ordering before adding any boost. Everything that follows wraps this query.

{
  "query": {
    "multi_match": {
      "query": "wireless headphones",
      "fields": ["title^3", "description^1", "brand^1.5"],
      "type": "best_fields",
      "tie_breaker": 0.3
    }
  }
}

Verify: run it and capture the top scores as your reference list.

curl -s "localhost:9200/products/_search?size=5" \
  -H 'Content-Type: application/json' \
  -d @baseline_query.json | jq '.hits.hits[] | {id: ._id, score: ._score}'

2. Wrap in function_score with a recency decay and a popularity factor

function_score multiplies the baseline by each function’s output. field_value_factor reads a numeric field; modifier: "log1p" and factor keep popular documents from dominating. The gauss decay over published_at supplies freshness. boost_mode: "multiply" composes them with the BM25 score.

{
  "query": {
    "function_score": {
      "query": { "multi_match": { "query": "wireless headphones",
        "fields": ["title^3", "description^1", "brand^1.5"], "tie_breaker": 0.3 } },
      "functions": [
        { "field_value_factor": { "field": "popularity", "modifier": "log1p",
          "factor": 1.2, "missing": 1 } },
        { "gauss": { "published_at": { "origin": "now", "scale": "30d",
          "offset": "7d", "decay": 0.5 } } }
      ],
      "score_mode": "multiply",
      "boost_mode": "multiply"
    }
  }
}

Verify: diff the new order against the baseline; recent/popular hits should rise without near-misses vanishing.

curl -s "localhost:9200/products/_search?size=5&explain=true" \
  -H 'Content-Type: application/json' -d @boosted_query.json \
  | jq '.hits.hits[] | {id: ._id, score: ._score}'

3. Add clause-level boosts and dis_max for OR-style matching

boost on a bool clause is additive within the boolean score. dis_max (or bool.should with tie_breaker) takes the single best-matching clause rather than summing them, which prevents a document that matches several weak fields from outranking one strong exact match.

{
  "query": {
    "dis_max": {
      "queries": [
        { "match": { "title": { "query": "noise cancelling", "boost": 2.0 } } },
        { "match": { "description": { "query": "noise cancelling", "boost": 1.0 } } }
      ],
      "tie_breaker": 0.3
    }
  }
}

Verify: confirm a title-only match outranks a description-only match of the same term.

4. Pin business rules and demote with negative boosts

Editorial pinning is a hard override: a pinned query (Elasticsearch) forces specific ids to the top regardless of score. Demotion uses boosting with a negative_boost between 0 and 1 to push down — never fully remove — documents matching an undesirable signal (e.g. out-of-stock).

{
  "query": {
    "pinned": {
      "ids": ["sku-9001", "sku-9002"],
      "organic": {
        "boosting": {
          "positive": { "match": { "title": "wireless headphones" } },
          "negative": { "term": { "in_stock": false } },
          "negative_boost": 0.4
        }
      }
    }
  }
}

Verify: the pinned ids appear first; out-of-stock items sink but remain present.

5. Bound cost with a rescore window

Decay functions and field_value_factor are cheap per document, but applying them across millions of candidates is wasteful when only the head matters. rescore runs the expensive boosting on only the top window_size candidates per shard, leaving the cheap BM25 to select them.

{
  "query": { "multi_match": { "query": "wireless headphones",
    "fields": ["title^3", "description"] } },
  "rescore": {
    "window_size": 200,
    "query": {
      "rescore_query": { "function_score": {
        "functions": [ { "gauss": { "published_at": { "origin": "now",
          "scale": "30d", "decay": 0.5 } } } ], "boost_mode": "multiply" } },
      "query_weight": 0.7,
      "rescore_query_weight": 1.3
    }
  }
}

Verify: latency drops while the top 10 ordering is unchanged versus full function_score.

curl -s "localhost:9200/products/_search" -H 'Content-Type: application/json' \
  -d @rescore_query.json | jq '.took, (.hits.hits[0:10] | map(._id))'

Configuration Reference

Name	Default	Type	Effect
`boost_mode`	`multiply`	enum (`multiply`, `sum`, `avg`, `first`, `max`, `min`, `replace`)	How `function_score` combines the functions’ result with the query score. `multiply` preserves match integrity; `sum` can promote weak matches.
`score_mode`	`multiply`	enum (`multiply`, `sum`, `avg`, `first`, `max`, `min`)	How multiple functions combine with each other before `boost_mode` is applied.
`decay`	`0.5`	float (0–1)	The multiplier value reached at exactly one `scale` from the origin. Lower = steeper falloff.
`offset`	`0`	string/number	Flat zone around `origin` where the decay multiplier stays `1.0`. Keeps near-fresh docs unpenalized.
`factor`	`1`	float	Multiplier applied inside `field_value_factor` before the `modifier`. Amplifies or dampens the field’s influence.
`modifier`	`none`	enum (`none`, `log1p`, `sqrt`, `ln1p`, `reciprocal`, …)	Shapes the `field_value_factor` curve; `log1p`/`sqrt` tame heavy-tailed fields like view counts.
`negative_boost`	(required)	float (0–1)	Multiplier applied to documents matching the `negative` clause in a `boosting` query. Demotes without excluding.
`window_size`	`10`	integer	Per-shard candidate count the `rescore` phase re-ranks. Larger = more thorough, more CPU.
`query_weight`	`1.0`	float	Weight of the original score when blended with the rescore score.
`min_score`	(unset)	float	Drops hits below this final score. Use with care — combined with boosting it silently changes recall.

Failure Modes & Debugging

Symptom: irrelevant fresh documents flood the top of results

Root cause: the recency boost is composed additively (boost_mode: "sum") or the field_value_factor has no modifier/factor ceiling, so freshness alone outweighs lexical relevance. A document that barely matches the query gets promoted purely for being new.

Remediation: switch to multiplicative composition and inspect a flooded hit’s score breakdown to confirm the decay term is dominating.

curl -s "localhost:9200/products/_explain/sku-bad" -H 'Content-Type: application/json' \
  -d @boosted_query.json | jq '.explanation.details'

Set boost_mode: "multiply" and bound the factor with modifier: "log1p".

Symptom: scores blow up to huge or NaN values, ordering looks random

Root cause: field_value_factor with modifier: "none" on a field containing zero or negative values (e.g. ln of 0), or chained multiplicative functions producing unbounded products. This also surfaces as non-deterministic ordering when ties are broken by internal doc id across shards.

Remediation: clamp inputs and add a tie-breaker. Verify no negative/zero values reach an unguarded modifier.

curl -s "localhost:9200/products/_search" -H 'Content-Type: application/json' -d '{
  "size": 0, "aggs": { "min_pop": { "min": { "field": "popularity" } } } }' | jq '.aggregations'

Set "missing": 1 and modifier: "log1p", and add a deterministic sort tie-breaker such as _id.

Symptom: boosting works in staging but p99 latency spikes in production

Root cause: function_score runs over every matching candidate, not just the head. On a high-recall query matching millions of documents, the decay math is evaluated millions of times per shard.

Remediation: move the expensive functions into a rescore block with a bounded window_size, leaving cheap BM25 to do candidate selection.

curl -s "localhost:9200/products/_search?size=10" -H 'Content-Type: application/json' \
  -d @rescore_query.json | jq '.took'

Tune window_size down until took is acceptable and top-10 ordering still matches the full-scan result.

Symptom: pinned/editorial results disappear after a relevance model change

Root cause: pinning was implemented as a large additive boost rather than a pinned query, so a sufficiently high organic score now overtakes the intended pin — boost arithmetic is fragile across baseline shifts.

Remediation: replace additive pin hacks with the pinned query, which guarantees position independent of score, and keep demotion in a separate boosting clause.

A subtle but important distinction sits inside function_score: score_mode controls how the individual functions combine with each other, while boost_mode controls how that combined function result merges with the query’s relevance score. They are tuned independently. A common production shape is score_mode: "sum" across several field_value_factor functions (so a popularity signal and a quality signal accumulate) combined with boost_mode: "multiply" against BM25 (so that accumulated prior still cannot rescue an irrelevant match). Reaching for score_mode: "max" is the right move when several decay functions overlap and you want the single strongest signal — say, the most relevant of “near the user” or “recently updated” — rather than a product that double-penalizes a document that is merely average on both axes.

Performance & Scale Notes

Hand-tuned boosts are effectively free per document — a gauss decay and a field_value_factor add single-digit microseconds each — but cost scales with candidate count, not result count. On a corpus of ~10M documents, a high-recall function_score query matching ~2M candidates routinely adds 40–120ms to p99 versus the same query under rescore with window_size: 200, which keeps the boost cost flat at roughly 200 × shard_count evaluations. Set window_size to 2–5× your page size and measure with ?profile=true; profiling shows the function_score time isolated from query and fetch.

Decay scale is the highest-leverage knob: a scale that is too small collapses the multiplier toward decay for almost every document, flattening relevance into pure recency. Validate every boost change against nDCG@10 on a judged query set rather than spot checks — boosting is the single most common source of silent relevance regression. When the number of interacting signals exceeds what you can reason about (typically 4–6 hand-tuned functions), the maintenance cost crosses over and a learned approach via Learning to Rank (LTR) becomes cheaper to operate than the boost stack. Until then, encode signals as reusable Custom Scoring Functions so the same recency and geo logic is shared across query templates.

BM25 Tuning & Weights — the lexical baseline that every query-time boost multiplies against.
Custom Scoring Functions — packaging recency, geo, and popularity logic as reusable scoring scripts.
Learning to Rank (LTR) — when hand-tuned boosts hit their maintenance ceiling and a model should take over.
Boosting recent documents by recency — the focused recipe for Gaussian decay over a date field.
Ranking Algorithms & Relevance Tuning — the broader relevance architecture this topic sits within.