Training an LTR Model with XGBoost

You have logged feature vectors and graded judgments, and now need a model the Elasticsearch LTR plugin can evaluate at query time. The specific failure this guide prevents is the silent feature-ordinal desync: XGBoost indexes features positionally (f0, f1, …), the plugin references them by name, and if the order in your training matrix does not exactly match the registered feature set, the model scores nonsense without erroring. This guide covers building the matrix and group file, training with the rank:ndcg objective, and exporting to the plugin format. It is the training step of Learning to Rank (LTR) for Search, which itself lives under Ranking Algorithms & Relevance Tuning.

The work breaks into four mechanical steps with one invariant threading through all of them: feature order. The matrix you feed XGBoost, the group file that defines query boundaries, and the names you stamp onto the exported splits all reference features by position, and position is the thing most likely to drift between the moment you log features and the moment you upload a model. Everything below is structured to make that drift impossible to commit silently — you fetch the canonical feature order once, derive matrix columns and split names from the same list, and verify at the end that scores are non-constant. Get the order right and the math is routine; get it wrong and you ship a model that scores every document plausibly but ranks them by a feature it thinks is recency and is actually popularity.

Prerequisites

Python 3.10+ with xgboost>=1.7, numpy, and pandas installed.
A registered feature set (e.g. product_features) and logged feature vectors per judged query-document pair.
A judgment list mapping (qid, doc_id) -> grade on a graded scale (0–3).
The exact ordered list of feature names from the feature set, fetched from _ltr/_featureset/<name>.

Diagnosis / Context

Ranking models train on grouped data: rows are query-document pairs, and a group file tells XGBoost which contiguous rows belong to the same query so the rank:ndcg objective compares documents only within their own query. Get the grouping wrong and the model optimizes comparisons across unrelated queries, producing high training NDCG that collapses in production.

The matrix is dense and ordinal. Every row is one judged document; columns are feature values in feature-set order; the label is the grade. A typical SVMRank-style line looks like this, and any gap in ordinals or mismatch against the registered set is the root of most failures:

2 qid:101 1:8.21 2:3.04 3:0.88 4:2.30   # grade=2, title_bm25=8.21, desc_bm25=3.04, recency=0.88, popularity=2.30
0 qid:101 1:0.00 2:1.12 3:0.40 4:0.10   # same query, irrelevant doc
3 qid:102 1:9.90 2:4.51 3:0.95 4:3.80   # new query group begins

Why rank:ndcg rather than rank:pairwise or a plain regression? Regression (reg:squarederror) treats the grade as an absolute number to predict, which wastes the model’s capacity matching label magnitudes you do not care about — only the order within a query matters at serving time. rank:pairwise learns from ordered pairs and ignores how far apart they are. rank:ndcg is the listwise objective: it weights each pairwise correction by the NDCG gain that swapping the two documents would produce, so the model spends its effort on the swaps that move the metric you actually serve. This is the LambdaMART formulation, and it is why XGBoost LTR models are trained against the same NDCG cutoff the application returns. The grouping is what makes the objective coherent — without a correct group file, “within a query” has no meaning and the gradients are computed across documents that were never alternatives for the same search.

Solution Steps

1. Assemble the feature matrix and group file

Join judgments to logged vectors, sort by qid so each query’s rows are contiguous, and emit the group sizes.

import numpy as np
import xgboost as xgb

# feature_names MUST match the registered feature set order exactly
feature_names = ["title_bm25", "desc_bm25", "recency", "popularity"]

# rows: list of dicts {qid, label, features: {name: value}}
rows = load_logged_judgments()                      # your loader
rows.sort(key=lambda r: r["qid"])                   # contiguous query groups

X = np.array([[r["features"][f] for f in feature_names] for r in rows], dtype=np.float32)
y = np.array([r["label"] for r in rows], dtype=np.float32)

# group file: number of rows per qid, in order
groups = []
for qid in dict.fromkeys(r["qid"] for r in rows):   # preserves first-seen order
    groups.append(sum(1 for r in rows if r["qid"] == qid))

dtrain = xgb.DMatrix(X, label=y)
dtrain.set_group(groups)                            # critical for rank objectives

2. Train with the rank:ndcg objective

rank:ndcg is the listwise, LambdaMART-style objective. Keep trees shallow and the learning rate low for sparse judgment sets.

params = {
    "objective": "rank:ndcg",     # listwise NDCG optimization
    "eval_metric": ["ndcg@10"],   # report NDCG at the cutoff you serve
    "eta": 0.1,                   # low rate generalizes on small judgment sets
    "max_depth": 5,               # 4-6 avoids overfitting LTR judgments
    "min_child_weight": 1,
}
model = xgb.train(params, dtrain, num_boost_round=200,
                  evals=[(dtrain, "train")], verbose_eval=50)

Hold out a validation group split by query, not by row — randomly splitting rows leaks documents from the same query into both train and validation and inflates NDCG. Add the held-out DMatrix to evals and watch for the gap between train and validation NDCG widening, which is your overfitting signal; if it widens early, lower eta and max_depth or add min_child_weight before adding rounds. With a few thousand judged pairs, 100–300 rounds at eta 0.1 is a reasonable starting envelope.

3. Export to the Elasticsearch LTR format

The plugin reads XGBoost ensembles as JSON with named splits. Dump the booster, then rewrite the f<n> split identifiers to feature names so ordinals can never desync. Wrap the result in the plugin’s model definition payload. Naming the splits is the single most important defensive step in the whole pipeline: an XGBoost dump references features positionally as f0, f1, and so on, and the plugin will happily evaluate that ensemble against whatever feature ordinals the feature set currently exposes. If anyone has since reordered or inserted a feature, f3 now points at a different signal and the model scores confidently wrong. Rewriting every split to its feature name binds the model to meaning rather than position, so a mismatch fails loudly at _createmodel instead of silently at query time.

import json

raw = json.loads("[" + ",".join(model.get_dump(dump_format="json")) + "]")

def rename(node):
    if "split" in node:                              # f3 -> recency
        node["split"] = feature_names[int(node["split"][1:])]
    for child in node.get("children", []):
        rename(child)
    return node

trees = [rename(t) for t in raw]

payload = {
    "model": {
        "name": "product_ltr_v1",
        "model": {
            "type": "model/xgboost+json",
            "definition": json.dumps({"objective": "rank:ndcg", "splits": trees})
        }
    }
}
with open("model_payload.json", "w") as fh:
    json.dump(payload, fh)

4. Upload via the feature-store API

POST the payload under the feature set the model was trained against. The plugin binds the model to that set’s ordinals.

curl -X POST "localhost:9200/_ltr/_featureset/product_features/_createmodel" \
  -H 'Content-Type: application/json' \
  -d @model_payload.json   # registers product_ltr_v1 against product_features

Verification

Confirm the model registered and produces non-constant scores when applied as a query.

curl -s "localhost:9200/_ltr/_model/product_ltr_v1" | jq '.model.name'
# expected output:
# "product_ltr_v1"

curl -s "localhost:9200/products/_search" -H 'Content-Type: application/json' -d '{
  "size": 3,
  "query": { "sltr": { "params": { "keywords": "running shoes" },
    "model": "product_ltr_v1" } }
}' | jq '[.hits.hits[]._score]'
# expected: three DISTINCT scores, e.g. [ 4.31, 2.07, 0.55 ]

If the three scores are identical, the feature ordinals desynced — re-export with named splits from step 3. Beyond a smoke test, confirm the model actually learned the signal you intended by inspecting feature importance from the trained booster (model.get_score(importance_type="gain")): a model where title_bm25 carries near-zero gain while popularity dominates is either telling you something real about your traffic or telling you that a feature was logged as a constant. Cross-check a single explained query so each feature reads a plausible value before you let the model touch any production-shaped evaluation, and only then carry it into the offline NDCG comparison and the guarded online rollout described in the parent guide.

Common Pitfalls

Group sizes do not sum to the number of rows

If sum(groups) differs from X.shape[0], XGBoost silently misaligns query boundaries and the NDCG objective compares wrong documents. Assert sum(groups) == len(rows) before set_group, and always sort rows by qid first so each group is contiguous.

Feature order in the matrix differs from the registered set

The matrix columns and the feature-set ordinals must be byte-for-byte the same order. Fetch feature_names from _ltr/_featureset/<name> at training time rather than hardcoding, and rename splits to feature names on export so a positional mismatch becomes impossible.

Missing feature values default to zero and distort splits

A document with no views logs popularity as missing; if your loader fills it with 0 instead of the feature’s missing default, the tree learns a false boundary. Mirror the feature template’s missing value in the loader, and verify with a single explained query that no feature reads 0.0 unexpectedly.

Learning to Rank (LTR) for Search — the feature store, judgment lists, and sltr rescore this model plugs into.
Fine-Tuning BM25 b and k1 Parameters — calibrate the BM25 features before logging them for training.
Canary Deploying Relevance Models — roll the trained model out behind a guarded online test.