Result Highlighting & Snippet Generation

Intro

A result list that shows a matched document but no matched text forces the user to open every hit to judge relevance. Highlighting closes that gap: it returns the specific passage that satisfied the query, with the query terms wrapped in markup the frontend can style. Done well, the snippet is the single most persuasive element on the results page — it is the evidence that this result answers the query, and it is what the eye lands on before the title. Done badly, it leaks scripts, ships kilobytes of unneeded body text, or returns the wrong passage and quietly erodes trust in the search. The engineering decision this guide resolves is which highlighter to run and how to feed it, because the three Elasticsearch highlighters differ by an order of magnitude in latency, memory cost, and fidelity to the original query. This guide sits within Search Frontend & UX Patterns, and the choices here interact directly with how you score documents in BM25 tuning and field weights and how you lay out fields in your index mapping.

Prerequisites

  • localhost:9200, with a running cluster you can reindex into.
  • text (not keyword) — highlighters re-analyze field content and need the analyzer chain.
  • <em>…</em> tags, or custom pre_tags/post_tags.
  • term_vector: with_positions_offsets or index_options: offsets, which requires a reindex.

Concept Deep-Dive

Highlighting is a two-phase problem. First, locate the positions in a field where query terms matched. Second, cut the field into candidate fragments, score each by how many (and which) query terms it contains, and return the top number_of_fragments passages with the terms wrapped in markup. Elasticsearch ships three highlighter implementations that solve phase one differently.

The unified highlighter (the default since 7.0) breaks the field into sentences using a BreakIterator, then uses the Lucene UnifiedHighlighter to score passages with its own BM25-like model. It supports all query types — phrases, prefixes, wildcards, match_phrase proximity — accurately, because it re-runs the query against the extracted text. It can use postings (index_options: offsets), term vectors, or, as a fallback, re-analyze the field on the fly. It is the right default for almost every field.

The plain highlighter re-analyzes the field content in memory for every hit and replays the query token-by-token. It is faithful to query semantics but pays the analysis cost per document per field, so on large bodies across many hits it dominates query latency. Use it only for small fields where you have not stored offsets.

The fvh (fast vector highlighter, fast-vector-highlighter) reads precomputed offsets straight from term_vector: with_positions_offsets, skipping re-analysis entirely. On multi-kilobyte fields it is the fastest option, and it is the only highlighter that supports matched_fields (combining several analyzed variants of one field into a single highlighted result) and per-token boosting. Its cost is storage: term vectors with positions and offsets can add 30–100% to a field’s on-disk size. It also handles some complex queries less precisely than the unified highlighter.

To make the engine choice concrete, the trade space across the three implementations looks like this. The unified highlighter is the default because it is correct for every query type and needs no special mapping; you reach past it only when latency on large fields forces the issue (fvh) or when you are stuck on a legacy index with neither offsets nor term vectors and want exact query semantics on a tiny field (plain).

Engine Reads offsets from Best for Notable limitation
unified postings, term vectors, or re-analysis the default for all fields slower than fvh on very large fields without stored offsets
plain re-analysis only tiny legacy fields, exact query replay re-analyzes every hit; dominates latency on long fields
fvh term_vector: with_positions_offsets large fields, matched_fields, per-token boost needs a reindex; less precise on some complex queries

Phase two — fragment selection — is identical across all three engines and is where most snippet-quality complaints originate. The field text is divided into candidate windows. With the span fragmenter the window boundaries are nudged so that a phrase or proximity match is never split across two fragments; with the simple fragmenter the field is chopped at a fixed character count regardless of where matches fall. Each candidate window is then scored: a passage containing two distinct query terms outranks one containing the same term twice, and a passage where the terms appear adjacently (satisfying a phrase query) outranks one where they are scattered. The top number_of_fragments windows are returned in score order, and order: score versus the default source order controls whether the strongest passage or the earliest passage leads.

A worked example: a 4,000-character body field, query match_phrase: "vector search". With fragment_size: 150 and number_of_fragments: 3, the highlighter slices the body into ~150-character windows, scores each by phrase proximity, and returns the three best non-overlapping passages. The single passage containing the exact adjacent phrase scores highest and is returned first, with both terms wrapped in <em>. Had the query been a plain match rather than match_phrase, a window containing only the more selective term vector (higher IDF) would often outscore a window containing only search, because the passage scorer weights rarer terms more heavily — the same intuition that drives document ranking, applied at passage granularity.

The require_field_match flag is the single setting most teams get wrong. Left at its default true, a field is highlighted only if the query explicitly targeted that field. A multi_match over title and body that matched on title will return an empty body highlight, because as far as the highlighter is concerned the body clause never matched. Setting require_field_match: false tells the highlighter to surface a query term wherever it appears, even in fields the query did not name — useful for a unified preview field, but it can highlight incidental term collisions, so scope it deliberately.

Fragment selection and highlight marking A document field split into four fragments, each scored by query-term matches; the best-scoring fragment is selected and the matched terms are wrapped in em tags. body field one long analyzed text field fragment 1 score 0.4 fragment 2 score 1.9 best fragment 3 score 0.7 fragment 4 score 0.2 returned snippet run a fast <em>vector search</em> over

The frontend never receives raw query terms — it receives the field text with markup already inserted, which is why pre_tags/post_tags and require_field_match matter so much: they decide exactly what string crosses the wire and how it must be rendered safely.

One subtlety worth internalizing before you choose an engine: the three highlighters can return different snippets for the same query and document. The unified highlighter’s sentence-boundary segmentation produces grammatically clean passages that often read better in a UI; the fvh’s offset-driven windows can begin mid-sentence; the plain highlighter mirrors the unified output for simple queries but diverges on phrase and span queries because it lacks the unified scorer. If you A/B test highlighters in production, treat the snippet text as part of the experiment, not just the latency — users notice ragged passages even when they cannot name why a result feels lower quality.

Step-by-Step Implementation

1. Map the fields you intend to highlight

For most fields, a plain text mapping is enough — the unified highlighter will re-analyze on demand. Reserve term vectors for large fields you highlight on every query.

PUT /articles
{
  "mappings": {
    "properties": {
      "title":   { "type": "text" },
      "body":    {
        "type": "text",
        "term_vector": "with_positions_offsets"
      }
    }
  }
}

Verify: confirm the term vector setting took effect.

curl -s "localhost:9200/articles/_mapping" | jq '.articles.mappings.properties.body'

2. Index a document with both a short and a long field

curl -s -X POST "localhost:9200/articles/_doc/1" \
  -H 'Content-Type: application/json' \
  -d '{
    "title": "Tuning vector search",
    "body": "Before you run a fast vector search over millions of documents, profile the index. A vector search pipeline batches embeddings and caches hot vectors."
  }'

Verify: the document is searchable.

curl -s "localhost:9200/articles/_count?q=vector" | jq '.count'

3. Request the unified highlighter (the safe default)

POST /articles/_search
{
  "query": { "match": { "body": "vector search" } },
  "highlight": {
    "fields": {
      "body": {
        "type": "unified",
        "fragment_size": 150,
        "number_of_fragments": 2,
        "require_field_match": true
      }
    }
  }
}

Verify: the response carries a highlight.body array with <em> tags around matched terms.

curl -s "localhost:9200/articles/_search" -H 'Content-Type: application/json' \
  -d '{"query":{"match":{"body":"vector search"}},"highlight":{"fields":{"body":{}}}}' \
  | jq '.hits.hits[0].highlight'

4. Switch the long field to fvh for offset-backed speed

Because body carries with_positions_offsets, the fvh highlighter reads positions directly and skips re-analysis. The fragmenter: span mode keeps phrase matches intact within a fragment, where simple would split on a fixed character count.

POST /articles/_search
{
  "query": { "match_phrase": { "body": "vector search" } },
  "highlight": {
    "fields": {
      "body": {
        "type": "fvh",
        "fragment_size": 120,
        "number_of_fragments": 3,
        "fragmenter": "span",
        "pre_tags": ["<mark class=\"hl\">"],
        "post_tags": ["</mark>"]
      }
    }
  }
}

Verify: confirm the custom <mark> tags appear and the phrase stays whole.

curl -s "localhost:9200/articles/_search" -H 'Content-Type: application/json' \
  -d '{"query":{"match_phrase":{"body":"vector search"}},
       "highlight":{"fields":{"body":{"type":"fvh","pre_tags":["<mark>"],"post_tags":["</mark>"]}}}}' \
  | jq -r '.hits.hits[0].highlight.body[]'

5. Combine analyzed variants with matched_fields

When you index a field two ways — a stemmed body and an exact body.exactmatched_fields lets a single highlight reflect matches from both. This is fvh-only.

POST /articles/_search
{
  "query": { "query_string": { "query": "search", "fields": ["body","body.exact"] } },
  "highlight": {
    "fields": {
      "body": {
        "type": "fvh",
        "matched_fields": ["body", "body.exact"]
      }
    }
  }
}

Verify: matches from either analyzer are wrapped in the highlighted body output.

curl -s "localhost:9200/articles/_search" -H 'Content-Type: application/json' \
  -d '{"query":{"query_string":{"query":"search","fields":["body","body.exact"]}},
       "highlight":{"fields":{"body":{"type":"fvh","matched_fields":["body","body.exact"]}}}}' \
  | jq '.hits.hits[0].highlight'

The matched_fields mechanism deserves a closer look because it solves a real and common problem: you want a single highlighted snippet that reflects both a stemmed match and an exact match. You index body with the standard analyzer and a sub-field body.exact with a minimal analyzer, query both, and ask the fvh highlighter to merge their offsets into one result on the parent field. The fields must share the same underlying text and both carry with_positions_offsets, because the highlighter overlays their position maps. The payoff is a snippet where a stemmed hit on searching and an exact hit on search both light up in one coherent passage rather than producing two competing fragments.

6. Render the snippet XSS-safe on the frontend

Elasticsearch does not escape document content before inserting your tags. If the body contains <script>, that script string is in the snippet verbatim. The default <em> tags compound the danger: document text can itself contain a literal <em>, so you cannot distinguish “tag the highlighter added” from “text that happened to look like a tag” after the fact. The only safe pattern is to escape everything, then re-introduce only your known highlight tags — and to choose pre_tags/post_tags sentinels rare enough that no document realistically contains them.

// Escape all HTML, then unescape ONLY our sentinel highlight tags.
// Use rare sentinel tags as pre_tags/post_tags so document text can never forge them.
const PRE = "HL";   // sent as pre_tags
const POST = "HL";  // sent as post_tags

function renderSnippet(raw) {
  const escaped = raw
    .replace(/&/g, "&amp;")
    .replace(/</g, "&lt;")
    .replace(/>/g, "&gt;");
  return escaped
    .split(PRE).join("<mark class=\"hl\">")
    .split(POST).join("</mark>");
}

// element.innerHTML = renderSnippet(hit.highlight.body[0]);

Verify: index a hostile document and confirm the script tag renders inert.

curl -s -X POST "localhost:9200/articles/_doc/2?refresh=true" -H 'Content-Type: application/json' \
  -d '{"title":"xss","body":"a vector search <script>alert(1)</script> example"}'
# The highlight string should contain &lt;script&gt;, never an executable <script>.

Configuration Reference

Name Default Type Effect
type unified enum (unified/plain/fvh) Selects the highlighter engine; fvh requires term_vector: with_positions_offsets.
fragment_size 100 integer (chars) Target length of each returned passage; ignored when number_of_fragments is 0.
number_of_fragments 5 integer Max passages returned per field; 0 returns the whole field highlighted instead of fragments.
fragmenter span enum (span/simple) span keeps phrase/proximity matches inside one fragment; simple splits on fixed size (fvh only).
require_field_match true boolean When true, only fields named in the query are highlighted; set false to highlight matches found in any field.
matched_fields [] array (fvh only) Merges highlights from several analyzed variants of one field into a single result.
pre_tags / post_tags ["<em>"] / ["</em>"] array of strings Markup wrapped around matched terms; use rare sentinels for XSS-safe rendering.
no_match_size 0 integer (chars) Chars of the field to return as a snippet when nothing matched; 0 returns no snippet.

Failure Modes & Debugging

Symptom: highlight array is empty even though the document matched

Root cause: require_field_match is true (the default) and the query matched a different field than the one you asked to highlight — for example a multi_match hit on title while you only highlight body. Remediation: either highlight the field that actually matched, or relax the match requirement.

curl -s "localhost:9200/articles/_search" -H 'Content-Type: application/json' \
  -d '{"query":{"multi_match":{"query":"vector","fields":["title","body"]}},
       "highlight":{"require_field_match":false,"fields":{"body":{}}}}' \
  | jq '.hits.hits[0].highlight'
Symptom: highlight query latency spikes on long documents

Root cause: the plain highlighter (or unified without stored offsets) re-analyzes every field of every hit at query time, which scales with field length times hit count. Remediation: store offsets and switch the field to fvh, then re-measure p95.

# Confirm whether offsets exist for the field; if not, reindex with term_vector.
curl -s "localhost:9200/articles/_mapping" | jq '.articles.mappings.properties.body.term_vector // "none"'
Symptom: matched_fields throws "matched_fields is not supported"

Root cause: matched_fields is implemented only by the fvh highlighter, and all listed fields must share the same term_vector settings and analysis offsets. Remediation: set type: fvh and ensure each variant carries with_positions_offsets.

curl -s "localhost:9200/articles/_search" -H 'Content-Type: application/json' \
  -d '{"query":{"match":{"body":"search"}},
       "highlight":{"fields":{"body":{"type":"fvh","matched_fields":["body","body.exact"]}}}}' \
  | jq '.error.reason // "ok"'
Symptom: rendered snippets show raw <em> text instead of styled marks

Root cause: the frontend HTML-escaped the highlight string after the tags were inserted, so the tags themselves became visible text. Remediation: escape document content first, then re-inject only your sentinel tags as real markup, never the reverse.

Performance & Scale Notes

Highlighting is a per-hit, per-field cost, so it scales with size (hits per page) multiplied by the number of highlighted fields multiplied by field length. The single biggest lever is whether the highlighter re-analyzes text at query time or reads stored offsets. The plain and offset-less unified paths re-tokenize each field of each returned hit; the fvh path and the unified path backed by index_options: offsets or term vectors read precomputed positions and skip tokenization entirely.

Storing term_vector: with_positions_offsets typically inflates a text field’s on-disk footprint by 30–100% depending on token cardinality; budget for it before enabling on every field, and prefer index_options: offsets (cheaper than full term vectors) when you only need unified-highlighter offsets rather than matched_fields. The payoff is concrete: on a 4 KB body highlighted across 20 hits per query, switching from plain to fvh removed roughly 60–80% of the highlighting time in our local benchmark (measured as the delta in took with and without the highlight block, averaged over 200 queries against a single-node Docker cluster, size: 20, warm filesystem cache). The win shrinks toward zero on short fields like title, where re-analysis is cheap — there, the storage overhead of term vectors rarely pays for itself, so leave title on the unified highlighter.

Keep number_of_fragments at 2–3 for result lists — each additional fragment is more passage-scoring work and more bytes on the wire — and resist highlighting more than two or three fields per hit. A common regression is enabling highlighting on a wildcard fields: { "*": {} }: on a wide document that re-analyzes every analyzed field of every hit and can multiply highlight time tenfold. When you measure, isolate highlighting from retrieval by issuing the identical query with and without the highlight block and comparing took; the difference is your highlighting budget. For the per-field details of fragment_size, no_match_size, and choosing fvh for large fields, see configuring highlight fragments. Highlighting interacts with scoring, so validate snippets after any change to BM25 weights, and surface snippets alongside the counts produced by faceted navigation and filtering.