Boosting Recent Documents by Recency

In time-sensitive corpora — news, listings, support tickets, release notes — a perfect lexical match from three years ago is often the wrong answer, yet a naive recency sort destroys relevance entirely by surfacing fresh-but-irrelevant noise. The precise decision this guide resolves: apply a function_score Gaussian decay over a date field and multiply it into the BM25 score so freshness reorders the head of the result list without burying old-but-highly-relevant documents. This is the focused recipe behind the broader query-time boosting strategies area, which itself sits inside Ranking Algorithms & Relevance Tuning. The whole game is in tuning origin, scale, offset, and decay so the curve is gentle enough that a strong textual match from last year still beats a weak match from this morning.

Prerequisites

Elasticsearch 8.x or OpenSearch 2.x at localhost:9200 with a working BM25 query.
A date-mapped field on every document (e.g. published_at); documents missing it need an explicit missing fallback or they score as maximally stale.
A judged or click-logged query set to measure nDCG@10 before and after the boost.

Diagnosis / Context

The failure shows up two ways. Without any recency signal, a query for an evergreen term returns the oldest authoritative document first, and users complain the results feel stale. The over-correction — sorting by date or using an additive recency boost — looks like this in an explain payload, where a barely-matching fresh document outscores a strong match:

{
  "_id": "fresh-noise",
  "_score": 8.9,
  "_explanation": {
    "value": 8.9,
    "description": "sum of: bm25 (0.4) + recency_boost (8.5)",
    "details": [
      { "value": 0.4, "description": "weight(title:term)" },
      { "value": 8.5, "description": "additive recency term — dominates the match" }
    ]
  }
}

The recency term (8.5) swamps the lexical score (0.4). The fix is multiplicative composition: a fresh document gets a multiplier near 1.0 and a stale one a multiplier near decay, but the BM25 score still decides ordering among documents of similar age. A weak match scaled by a high freshness multiplier is still a weak match — 0.4 * 1.0 < 9.0 * 0.6 — so the strong older document wins, which is exactly the behavior you want.

Why Gaussian rather than the linear or exponential decay variants? The gauss curve is flat near the origin and only begins to fall off as the age approaches scale, which gives it a forgiving shoulder: documents that are all “recent enough” are treated as roughly equivalent, so freshness does not micro-shuffle this week’s results against each other. exp decays hardest right at the origin, which makes it overreact to small age differences and is rarely what you want for general recency. linear is occasionally useful when stakeholders need an explainable “every week costs a fixed amount of score,” but its constant slope penalizes near-fresh documents more aggressively than gauss and clips abruptly to zero past its range. For the vast majority of recency boosts, gauss is the correct default precisely because its shape matches how users actually perceive freshness — a wide band of “current,” then a gradual fade.

It also matters that recency is almost never the only signal. A bare decay on published_at will, on its own, treat a fresh listicle and a fresh authoritative reference as equal. In production you typically multiply the recency decay against a popularity or quality field_value_factor as well, so that the head of the result list is “relevant AND fresh AND trusted” rather than merely “fresh.” Keeping each of those as a separate multiplicative function — rather than folding them into one additive term — is what preserves the property that any single weak axis drags the whole score down, which is the behavior that keeps junk out of the top results.

Solution Steps

1. Apply a Gaussian decay multiplied into BM25

{
  "query": {
    "function_score": {
      "query": {
        "match": { "title": "kubernetes networking" }
      },
      "functions": [
        {
          "gauss": {
            "published_at": {
              "origin": "now",
              "scale": "30d",
              "offset": "7d",
              "decay": 0.5
            }
          }
        }
      ],
      "boost_mode": "multiply"
    }
  }
}

Read the four knobs deliberately:

origin: "now" — the ideal date; documents here get multiplier 1.0.
offset: "7d" — a grace window; anything within 7 days of now is treated as fully fresh, so today and last-Tuesday are not penalized against each other.
scale: "30d" — at 30 days past the offset, the multiplier reaches decay.
decay: 0.5 — the multiplier value at one scale out. With this curve a 37-day-old doc keeps half its score, not a tenth.

boost_mode: "multiply" is the load-bearing flag — it is what keeps relevance dominant.

A useful way to sanity-check your knobs is to compute the multiplier at a few representative ages by hand. With scale: "30d", offset: "7d", and decay: 0.5, a document published today sits inside the offset window and scores a multiplier of 1.0; one published 37 days ago (7-day offset + one 30-day scale) scores exactly 0.5; one published 67 days ago (two scales out) scores roughly 0.06 under a Gaussian curve. That last number is the warning sign — Gaussian decay falls off fast past two scales, so if your corpus contains valuable documents older than offset + 2*scale, they are effectively erased unless you widen the curve in the next step.

2. Soften the curve so old relevant docs survive

If genuinely relevant old documents still get buried, raise decay toward 0.7–0.9 and widen scale. A higher decay means even very old documents retain most of their score, so freshness only breaks ties among comparably-relevant hits rather than overriding relevance.

{
  "gauss": {
    "published_at": {
      "origin": "now",
      "scale": "90d",
      "offset": "14d",
      "decay": 0.8
    }
  }
}

3. Handle missing dates and multi-valued fields explicitly

Documents without published_at are treated as infinitely far from origin and collapse to the decay floor. If that is wrong for your data, backfill the field or guard the query so undated documents are not silently demoted. When the date field is multi-valued — a document with several updated_at timestamps, for instance — control which one drives the decay with multi_value_mode (min, max, avg, or sum); max picks the most recent timestamp, which is usually the intended “last touched” semantics.

{
  "gauss": {
    "updated_at": { "origin": "now/h", "scale": "30d", "decay": 0.5 },
    "multi_value_mode": "max"
  }
}

For more elaborate freshness logic — blending recency with popularity or geo — promote it into reusable custom scoring functions rather than repeating the decay block in every query template. And calibrate the underlying lexical score first: if BM25 itself is mis-weighted, recency boosting only masks the problem, so settle the fine-tuning of BM25 b and k1 parameters before layering decay on top. Note the origin: "now/h" here — rounding to the hour is what keeps the query cacheable, covered in the pitfalls below.

Verification

Run the boosted query with explain and confirm the recency multiplier composes multiplicatively with BM25, not additively:

curl -s "localhost:9200/articles/_search?size=5&explain=true" \
  -H 'Content-Type: application/json' -d @recency_query.json \
  | jq '.hits.hits[] | {id: ._id, score: ._score,
        why: (._explanation.description)}'

Expected output shows a product of: (not sum of:) at the top of each explanation, with the BM25 value and a decay multiplier between decay and 1.0:

{
  "id": "strong-old-match",
  "score": 5.4,
  "why": "product of: bm25 (9.0) * gauss decay (0.6)"
}

A strong old match at 5.4 should still sit above a weak fresh match (0.4 * 1.0 = 0.4). If the explanation says sum of:, your boost_mode is wrong. Then confirm nDCG@10 did not regress versus the unboosted baseline on your judged set — recency boosting that improves perceived freshness while quietly dropping relevance is the most common silent failure.

For a sharper validation, build a two-document probe directly: index one strongly-matching document with an old date and one weakly-matching document with today’s date, then assert the strong-old document still outranks the weak-fresh one. If it does not, your curve is too steep — the recency multiplier on the old document has fallen below the relevance gap. Run this probe as a fixture on every change to the decay parameters; it is far cheaper than re-judging a full query set and catches the “freshness ate relevance” regression immediately.

curl -s "localhost:9200/articles/_search?size=2" \
  -H 'Content-Type: application/json' -d @recency_query.json \
  | jq '[.hits.hits[] | {id: ._id, score: ._score}]'
# Expect: strong-old-match ranked above weak-fresh-match

Common Pitfalls

scale too small collapses ranking into pure recency

A scale of 1d–2d drives nearly every document to the decay floor, so the multiplier stops discriminating and order is dominated by the handful of documents inside the offset window. Widen scale to weeks or months so the curve actually grades across your corpus’s age distribution, and re-check nDCG@10.

using boost_mode sum instead of multiply

Additive composition lets the recency term dominate BM25, promoting fresh-but-irrelevant documents — the exact failure shown in the diagnosis above. Always use boost_mode: "multiply" for recency so freshness scales relevance rather than replacing it.

"now" is not cacheable and breaks shard-level query caching

origin: "now" is evaluated per request and is non-deterministic to the millisecond, which prevents the request from being served from the shard request cache and can produce tiny ordering jitter between identical queries. Round it — e.g. origin: "now/h" — to snap the curve to the current hour, restoring cache hits and deterministic ordering within the window.

Query-time boosting strategies — the parent area covering decay, demotion, rescore, and pinning.
Fine-tuning BM25 b and k1 parameters — calibrate the base score your recency multiplier scales against.
Custom scoring functions — package recency, popularity, and geo decay as reusable scoring logic.