Learning to Rank (LTR) for Search

Learning to rank replaces hand-tuned scoring with a model trained on relevance judgments. Instead of guessing the right field boosts, you log features for query-document pairs, attach graded labels, and let a gradient-boosted tree learn the weighting that maximizes a ranking metric. This guide resolves a concrete engineering decision: when your ranking signals exceed roughly five interacting factors — text match, recency, popularity, category affinity, price — manual tuning stops converging and a learned model earns its operational cost. It sits inside Ranking Algorithms & Relevance Tuning as the step beyond static weighting, taking the baseline scores from BM25 Tuning & Weights and treating them as one input feature among many rather than the final answer.

The mechanism is a two-phase retrieve-then-rerank: a cheap query collects a candidate window, and a sltr rescore re-orders only that window using the model. This bounds latency because the expensive feature computation and tree traversal touch only the top-N documents, not the whole index. The rest of this guide covers feature engineering, judgment-list construction, model training, deployment to the feature store, and both offline and online evaluation.

The decision to adopt a learned model is not free. You take on a feature pipeline that must produce identical values at training and serving time, a judgment-collection process that needs continuous refresh as the catalog and query mix drift, and a retraining cadence with its own evaluation gates. The payoff is that the model captures interactions between signals — title weight conditioned on freshness, popularity conditioned on category — that no flat set of boosts can express, and it does so by fitting to observed relevance rather than to an engineer’s intuition. Treat this guide as the operational contract for that trade: every section below names a thing that can silently break feature parity, label quality, or rollout safety, because those three are where learned ranking systems fail in production rather than in the math.

Prerequisites

Elasticsearch 7.x/8.x with the ltr plugin installed (elasticsearch-plugin install http://...ltr-...zip), or OpenSearch 2.x with the opensearch-learning-to-rank-base plugin.
A populated production-like index with realistic term statistics — IDF and document length distributions must match production or feature values will not transfer.
A relevance judgment source: explicit graded labels, or an interaction log (impressions, clicks, conversions) to derive labels from a click model.
Python 3.10+ with xgboost>=1.7, numpy, and pandas for feature-matrix assembly and training.
An offline evaluation harness (ranx or ir-measures) and an A/B framework able to split traffic by rendered ranking.
Write access to the .ltrstore feature-store index and cluster permission to register feature sets and models.

Concept Deep-Dive

A learning-to-rank system has three artifacts: a feature set (named queries that each emit a score for a query-document pair), a judgment list (query-document pairs with graded relevance labels), and a model (the trained tree ensemble, uploaded in a format the plugin can evaluate at query time). At training time you join judgments against logged feature values to build a matrix. At query time the sltr rescorer recomputes those same features for the candidate window and feeds them to the model.

Features fall into three families. Query-dependent features depend on both the query and the document — BM25 score on title, BM25 on body, phrase-match flags, match_phrase on shingles. Document-only features depend only on the document — popularity (log of view count), recency (a gauss decay over published_at), in-stock flags, average rating. Query-only features depend only on the query — token count, detected intent class. Each feature is expressed as a templated Elasticsearch query with a {{keywords}} parameter; the feature store stores the template, and feature logging executes all templates against known judged documents to capture the numeric values.

Labels are graded, not binary. A common scale is 0 (irrelevant), 1 (marginal), 2 (relevant), 3 (perfect). Explicit editorial judgments are accurate but expensive and sparse. Most production systems derive labels from clicks using a click model — a Dynamic Bayesian Network or the simpler Position-Based Model — to correct for position bias, because a click on rank 1 says far less than the same click at rank 8. The output is a graded label per query-document pair plus a propensity weight.

Position bias is the central hazard of click-derived labels. Users click what they see, and they see the top results first, so a naive “clicked = relevant” rule simply re-teaches the model the ranking that produced the clicks — a feedback loop that freezes whatever bias the current system already has. A click model estimates the probability that a document at a given rank was examined at all, then attributes a click to relevance only after discounting that examination probability. The Position-Based Model assumes examination depends solely on rank; a Dynamic Bayesian Network additionally models that users stop scanning after a satisfying click. Whichever you choose, the practical output is the same shape the trainer needs: a grade and a weight per pair. Mixing a small, trusted editorial set with a large click-derived set — editorial labels as anchors, click labels as volume — tends to outperform either alone, because the editorial anchors keep the click model honest in the head queries where bias is strongest.

Coverage matters as much as label quality. A judgment list skewed toward head queries trains a model that excels on common searches and ignores the rare-query tail where most zero-result and abandonment pain actually lives. Sample judged queries to mirror the production frequency distribution, and deliberately over-sample a slice of tail queries so the model sees enough rare-intent examples to generalize. The number of judged documents per query is equally load-bearing: a query with only one judged document contributes nothing to a within-group ranking objective, so aim for several graded documents per query spanning the grade scale, including hard negatives that look textually relevant but are not.

The training objective determines what the model optimizes. Pointwise treats each query-document pair as an independent regression or classification problem — simple but blind to the fact that ranking is relative. Pairwise learns from ordered pairs within a query, minimizing inversions; RankNet is the canonical example. Listwise optimizes a list-level metric directly. LambdaMART — the default for search — is listwise in spirit: it is gradient-boosted regression trees (the MART part) driven by lambda gradients weighted by the change in NDCG that swapping two documents would cause. In practice you train it with XGBoost’s rank:ndcg or rank:pairwise objective, or with RankLib. The detailed mechanics live in the guide on training an LTR model with XGBoost.

A worked example: a query keywords = "running shoes" retrieves 1000 candidates from a bool query. The rescorer computes, per candidate, BM25 on title, BM25 on description, a recency decay, and log1p(views). The LambdaMART model — which learned from thousands of judged ("running shoes", doc) pairs that title matches matter more than description matches but that very old documents should be demoted regardless of text score — outputs a final score, and the top 24 are returned. The model captured an interaction no single boost expresses: title weight is high unless the document is stale, in which case recency dominates.

Feature parity is the invariant that makes this work. The numeric value of title_bm25 at training time must equal its value at serving time for the same query-document pair, which means the analyzer, the index’s term statistics, and the feature template must all be identical across logging and querying. The most common silent failure in production LTR is logging features against a staging index whose IDF distribution differs from production: the model learns weights tuned to feature magnitudes it will never see again, and offline NDCG looks fine because it was computed against those same staging features. Log against a production-equivalent index, version your feature set alongside the model, and treat any change to an analyzer or a feature template as a retraining trigger. Document-only features such as recency and popularity deserve special care — if you index them as precomputed fields to save query-time cost, the precomputation must run on the same schedule in both environments, or a document’s recency will read differently in training than at serving.

Step-by-Step Implementation

1. Initialize the feature store

The plugin keeps feature sets and models in a hidden .ltrstore index. Create it once.

curl -X PUT "localhost:9200/_ltr"
# Verify: the default feature store exists
curl -s "localhost:9200/_ltr" | jq .

Verify: the response reports "acknowledged": true (or 200 OK listing the store) and a .ltrstore index now appears in GET _cat/indices?v.

2. Define and register a feature set

Each feature is a templated query. Register them as a named feature set the model will reference by ordinal.

curl -X POST "localhost:9200/_ltr/_featureset/product_features" \
  -H 'Content-Type: application/json' -d '{
  "featureset": {
    "features": [
      { "name": "title_bm25", "params": ["keywords"],
        "template": { "match": { "title": "{{keywords}}" } } },
      { "name": "desc_bm25", "params": ["keywords"],
        "template": { "match": { "description": "{{keywords}}" } } },
      { "name": "recency", "params": ["keywords"],
        "template": { "function_score": { "functions": [
          { "gauss": { "published_at": { "origin": "now", "scale": "30d", "decay": 0.5 } } }
        ] } } },
      { "name": "popularity", "params": ["keywords"],
        "template": { "function_score": {
          "field_value_factor": { "field": "views", "modifier": "log1p", "missing": 1 } } } }
    ]
  }
}'

Verify: fetch it back and confirm four features registered in order.

curl -s "localhost:9200/_ltr/_featureset/product_features" | jq '.featureset.features | length'
# expected: 4

3. Log features for judged documents

For every query-document pair in your judgment list, run the base query filtered to the judged document IDs and attach an sltr log rescorer that emits the feature vector. The mechanics of turning these vectors into a matrix and group file are covered in training an LTR model with XGBoost; the closely related additive and multiplicative signal shaping you may reuse as features comes from Custom Scoring Functions.

import requests

def log_features(query_text, doc_ids):
    body = {
        "query": {"bool": {"filter": [{"terms": {"_id": doc_ids}}]}},
        "ext": {"ltr_log": {"log_specs": {
            "name": "log_entry",
            "named_query": "logged_featureset"}}},
        "query": {"bool": {
            "filter": [{"terms": {"_id": doc_ids}}],
            "should": [{"sltr": {
                "_name": "logged_featureset",
                "featureset": "product_features",
                "params": {"keywords": query_text}}}]}},
        "size": len(doc_ids)
    }
    r = requests.post("http://localhost:9200/products/_search", json=body)
    return r.json()["hits"]["hits"]

Verify: each returned hit carries fields.ltr_log with one entry per feature; assert the vector length equals the feature-set size.

4. Train and upload the model

Assemble the matrix, train with rank:ndcg, convert to the plugin’s model format, and POST it under the feature set. The full conversion script is in the XGBoost training guide.

curl -X POST "localhost:9200/_ltr/_featureset/product_features/_createmodel" \
  -H 'Content-Type: application/json' -d @model_payload.json

Verify: the model registers and a dry-run rescore returns scores.

curl -s "localhost:9200/_ltr/_model/product_ltr_v1" | jq '.model.name'
# expected: "product_ltr_v1"

5. Apply the sltr rescore at query time

Retrieve a candidate window with a cheap query, then rescore only the top-N with the model. Keep window_size bounded — it is the number of documents the model scores per shard.

{
  "query": { "multi_match": {
    "query": "running shoes", "fields": ["title^2", "description"] } },
  "rescore": {
    "window_size": 100,
    "query": {
      "rescore_query": { "sltr": {
        "params": { "keywords": "running shoes" },
        "model": "product_ltr_v1" } },
      "query_weight": 0,
      "rescore_query_weight": 1
    }
  }
}

Verify: compare top-10 ordering with and without the rescore clause; the order must differ on queries where the model adds signal.

Configuration Reference

Name	Default	Type	Effect
`rescore.window_size`	`10`	integer	Documents per shard the model scores; larger improves recall of the rerank but raises per-query CPU linearly. Set to 100–500 for product search.
`rescore_query_weight`	`1`	float	Multiplier on the model score; set to `1` and `query_weight` to `0` to use the model score alone for the window.
`query_weight`	`1`	float	Multiplier on the original base-query score before combining; set to `0` to discard base scores in the rescore.
`objective` (XGBoost)	`reg:squarederror`	string	Training objective; use `rank:ndcg` for listwise LambdaMART-style optimization of the ranking metric.
`max_depth` (XGBoost)	`6`	integer	Tree depth; 4–8 is typical for LTR, deeper risks overfitting sparse judgment lists.
`eta` (XGBoost)	`0.3`	float	Learning rate; lower (0.05–0.1) with more rounds generalizes better on small judgment sets.
`feature_store.index`	`.ltrstore`	string	Index holding feature sets and models; use a named store per environment to isolate staging from production.

Failure Modes & Debugging

Symptom: rescored results are identical to the base query order

Root cause: the model is being applied but query_weight still dominates, or the model returns a near-constant score because feature values at query time differ from training (feature parity break). Confirm the rescore actually runs and inspect the model’s per-feature contribution.

curl -s "localhost:9200/products/_search" -H 'Content-Type: application/json' -d '{
  "query": {"sltr": {"params": {"keywords": "running shoes"},
    "model": "product_ltr_v1"}}, "explain": true, "size": 1}' | jq '.hits.hits[0]._explanation'

Remediation: set query_weight to 0, and verify logged training features were captured against the same analyzer and index that serves queries.

Symptom: NDCG improves offline but online CTR is flat or worse

Root cause: position bias was not corrected when deriving labels, so the model learned to reproduce the old ranking rather than true relevance; or the offline judgment set is not representative of live query distribution.

Remediation: derive labels through a click model with propensity weighting, and re-weight the offline eval query sample to match the production head-and-tail mix. Gate every model behind an A/B test regardless of offline gains — see canary deploying relevance models.

Symptom: query latency spikes after enabling the rescore

Root cause: window_size is too large, or a feature template runs an expensive function_score/script_score per document across the whole window on every shard.

curl -s "localhost:9200/products/_search?pretty" -H 'Content-Type: application/json' -d '{
  "profile": true, "query": {...}, "rescore": {...}}' | jq '.profile.shards[].searches[].rewrite_time'

Remediation: lower window_size to the smallest value that preserves NDCG, and precompute document-only features (popularity, recency band) as indexed fields instead of computing them at query time.

Symptom: createmodel rejects the payload with a feature-count mismatch

Root cause: the model’s feature ordinals do not match the registered feature set — features were reordered, renamed, or added after the model was trained.

Remediation: regenerate the model payload from the exact feature set version used for logging, and pin the feature-set name in the model metadata so retraining cannot silently desync ordinals.

Performance & Scale Notes

The cost of LTR is the rescore window, not the index size. Scoring is roughly window_size × num_features × tree_traversal_cost per shard. With 100 features and a 100-document window, a LambdaMART ensemble of 100 trees adds single-digit milliseconds per shard; pushing window_size to 1000 turns that into tens of milliseconds and is rarely justified — recall gains from the reranker flatten well before then because the base query already surfaces the relevant candidates. Benchmark by replaying production query logs through both the base query and the rescored query, measuring p95/p99 with the profile API and comparing NDCG@10 on a held-out judgment set; promote only when NDCG gain clears statistical significance (p < 0.05). Document-only features moved to indexed fields cut per-query work materially. For judgment sets under ~5,000 graded pairs, keep trees shallow (max_depth 4–6) and eta low to avoid overfitting; learned ranking begins to clearly beat a well-tuned static model once interacting signals exceed five.

Offline and online evaluation answer different questions and you need both. Offline NDCG@10 on a held-out judgment set is fast, deterministic, and cheap, so it gates which candidate model is worth shipping — but it can only measure relevance against labels you already trust, and those labels may carry the very bias the model should correct. Online A/B testing measures what offline cannot: real user behavior on the live query distribution, including the tail and the novel intents your judgment set never captured. Run them in series. Offline eval kills bad models before they cost traffic; the survivors go to an A/B test that compares the model arm against the current production ranking on CTR, click-through at depth, conversion, and zero-result rate, sized for the effect you expect to detect. A model that wins offline but loses online is the signal that your labels diverged from reality, and that result is information, not waste. Roll the winner out gradually rather than flipping all traffic at once, so a regression in a query segment you did not test surfaces on a small slice — the guarded rollout path is covered in canary deploying relevance models. Retraining is not a one-time event: catalog turnover, seasonal intent shifts, and the slow drift of the click distribution all degrade a frozen model, so schedule periodic retraining on refreshed judgments and re-run the same offline-then-online gate each time.

BM25 Tuning & Weights — the base relevance scores LTR consumes as input features.
Custom Scoring Functions — signal-shaping functions you can register as recency and popularity features.
Training an LTR Model with XGBoost — the matrix, group file, and model export workflow in detail.
Vector Search Integration Strategies — add embedding similarity as an LTR feature for hybrid ranking.
Canary Deploying Relevance Models — the safe online rollout path for a newly trained model.