Synonym & Stopword Management for Relevance
Synonym and stopword configuration is where lexical recall is won or lost. The decision this guide resolves is concrete: should you expand synonyms at index time or search time, which token filter handles multi-word terms correctly, and how aggressively should you strip stopwords before they distort phrase matching and inverse document frequency. These choices sit inside the broader discipline of Ranking Algorithms & Relevance Tuning, and they interact directly with BM25 Tuning & Weights: every term you inject or delete shifts the IDF table that BM25 scores against. Get this layer wrong and no amount of k1/b calibration will recover the lost recall. The default answer for almost every production system is search-time synonym_graph with a small, surgical stopword list — but the reasoning behind that default is what makes it safe to deviate from.
Prerequisites
_close/_openan index, or to call_reload_search_analyzersfor updateable filters.synonyms_path), or inline rules small enough to embed.standardandwhitespacediffer in ways that matter for multi-word synonyms.
Concept Deep-Dive
A synonym filter rewrites the token stream after tokenization. The legacy synonym filter is positional but flat: it cannot represent a single input term expanding into a multi-token phrase without corrupting downstream phrase and span queries, because it stacks tokens at one position and loses the notion that “new york” occupies two consecutive positions. The flat filter assigns a positionLength of 1 to every token it emits, so when it injects york it has nowhere to place it except on top of the next real token, shifting everything that follows. Phrase queries, which depend on exact position deltas between terms, then either miss or match the wrong spans. The synonym_graph filter solves this by emitting a token graph — a structure where a single token can expand into a multi-position path that runs in parallel with the original token. Each branch of the graph carries its own positionLength, so the parser knows that new york spans two positions while ny spans one, and it can reconcile both against the surrounding tokens. This graph is what lets a phrase query over the expanded stream still behave correctly.
Consider the user query ny apartment against a rule ny, new york. With synonym_graph, the analyzer produces a graph: at position 0, the token ny and the multi-token path new → york both originate; apartment follows at the next available position. The query parser consumes the graph and builds a disjunction so that documents containing either ny apartment or new york apartment match, with phrase integrity preserved. The diagram below shows the resulting positions.
The critical design decision is when this expansion runs. Index-time synonyms bake the expanded tokens into the inverted index: every document containing ny also indexes new and york. This makes queries cheap — no expansion happens at query time because the alternatives already live in the postings lists — but couples your synonym set to your data. Changing a rule means reindexing, and the injected tokens distort document frequencies, which feeds back into IDF and skews BM25 scoring across the whole corpus, not just the queries that use the rule. There is also no graph at index time: Lucene cannot store a multi-position graph in segments, so multi-word index-time synonyms rely on a flattening step that can subtly misalign positions. Search-time synonyms expand only the query, leaving the index pristine. The cost is a slightly larger query and the requirement that synonym_graph (not the flat synonym filter) sit in the search analyzer, because only the query side needs graph-aware parsing and the parser is the component that knows how to consume a graph. For nearly all systems, search-time is the correct default: synonym rules evolve faster than corpora, search-time keeps IDF honest, multi-word rules behave correctly, and updates can be applied without a full rebuild as covered in updating synonyms without reindexing. The narrow case for index-time expansion is a frozen, rarely changing synonym set paired with extreme query-latency constraints where you cannot afford even a few milliseconds of expansion per request.
Stopwords compound the picture. Removing high-frequency words like the, a, of shrinks the index and speeds matching, but every removed token is a position deleted from the stream. A phrase query for "king of england" will silently match "king england" if of is a stopword, and worse, multi-word synonyms that contain stopwords can break entirely — a rule like bank of america, bofa loses its anchor when of is stripped before the synonym filter even sees the stream, because filter order matters and a stop filter placed ahead of the synonym filter mutates its input. The interaction with IDF is the deeper reason to be conservative. In the original information-retrieval era, stopword lists existed because term-frequency models had no principled way to discount ubiquitous words, so they were removed wholesale. BM25 changed that: its IDF component already assigns near-zero weight to terms that appear in almost every document, so common words contribute almost nothing to the score whether or not you strip them. Removing them buys a little storage and speed but throws away the positional information phrase and proximity queries rely on. The production posture is therefore a minimal stopword list — strip only true noise, and let scoring handle the rest. Many high-relevance systems run with no stopword filter at all and accept the marginal storage cost in exchange for intact phrase behavior.
Solr vs WordNet formats
Two synonym file formats exist. The Solr format is line-oriented and explicit: ny, new york (equivalent, expanded in both directions) or i-pod => ipod (one-directional replacement, where everything on the left collapses to the right). The distinction matters for relevance: equivalent rules grow recall symmetrically, while replacement rules normalize variants onto a canonical form, which is what you want for spelling variants and brand normalization where you do not want the noisy variant to remain searchable. The WordNet format ingests prolog-style WordNet databases for broad lexical coverage, mapping synsets to synonym groups automatically. The Solr format is what almost every hand-curated production set uses because the rules are auditable and diff-able in version control, and because each line is a deliberate editorial decision a relevance engineer can defend. WordNet is reserved for cases where you want dictionary-scale synonymy without manual curation — and it comes with a cost, since dictionary synonyms are frequently wrong in domain context (a furniture site does not want chair expanded to professorship), so most teams that start with WordNet end up pruning it back toward a curated Solr file.
Step-by-Step Implementation
1. Define a search-time synonym_graph analyzer
Create an index with a separate search_analyzer carrying the graph filter, while the index analyzer stays clean. This keeps the inverted index free of injected synonym tokens.
PUT /listings
{
"settings": {
"analysis": {
"filter": {
"geo_synonyms": {
"type": "synonym_graph",
"synonyms": [
"ny, new york",
"sf, san francisco",
"apt, apartment, flat"
]
},
"minimal_stop": {
"type": "stop",
"stopwords": ["a", "an", "the"]
}
},
"analyzer": {
"index_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "minimal_stop"]
},
"search_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "minimal_stop", "geo_synonyms"]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "index_analyzer",
"search_analyzer": "search_analyzer"
}
}
}
}
Verify: confirm the graph expansion runs only on the search side.
curl -s "localhost:9200/listings/_analyze" -H 'Content-Type: application/json' -d '{
"analyzer": "search_analyzer",
"text": "ny apartment"
}' | jq '.tokens[] | {token, position, positionLength}'
You should see new and york emitted with the york token carrying the multi-position path, and apartment at the following position.
2. Externalize rules to a synonyms file
Inline rules are fine for a dozen entries, but real sets grow into thousands. Move them to a file deployed to every data node and reference it with synonyms_path (relative to the Elasticsearch config directory). Mark the filter updateable so it can be reloaded without a _close/_open cycle.
PUT /listings
{
"settings": {
"analysis": {
"filter": {
"geo_synonyms": {
"type": "synonym_graph",
"synonyms_path": "analysis/geo_synonyms.txt",
"updateable": true
}
},
"analyzer": {
"search_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "geo_synonyms"]
}
}
}
}
}
The file uses Solr format, one rule per line:
# config/analysis/geo_synonyms.txt
ny, new york
sf, san francisco
nyc => new york city
laptop, notebook
Verify: check the file is picked up and parses without error.
curl -s "localhost:9200/listings/_settings" | jq '.listings.settings.index.analysis.filter.geo_synonyms'
3. Query through the search analyzer
Because the synonym filter lives in the search analyzer, a normal match query expands automatically — no special query syntax needed.
import requests
resp = requests.post(
"http://localhost:9200/listings/_search",
json={
"query": {
# search_analyzer applies geo_synonyms; "ny" matches "new york"
"match": {"title": {"query": "ny apartment"}}
}
},
)
hits = resp.json()["hits"]["hits"]
print([h["_source"]["title"] for h in hits])
Verify: documents containing new york should appear for the query ny. Run the search and confirm recall against your known-good set.
Configuration Reference
| Name | Default | Type | Effect |
|---|---|---|---|
type (filter) |
none | string | synonym_graph enables multi-word/graph-aware expansion; synonym is flat and breaks multi-word phrases. Always prefer synonym_graph. |
synonyms_path |
unset | string | Path (relative to config/) to a rules file. Mutually exclusive with inline synonyms. Required for large, file-managed sets. |
updateable |
false |
boolean | When true, the filter may be reloaded via _reload_search_analyzers without closing the index. Only valid in a search-time analyzer. |
expand |
true |
boolean | For equivalent-style rules, true expands all terms to all others; false makes the first term the canonical replacement. |
lenient |
false |
boolean | When true, skips rules that fail to parse instead of failing analyzer creation. Useful for large noisy files; risky because it hides errors. |
format |
solr |
string | solr for line-oriented rules, wordnet to ingest WordNet prolog files. |
stopwords (stop filter) |
_english_ |
array/string | The stopword set. Override with a minimal explicit list; the broad _english_ set strips terms that phrase queries depend on. |
remove_trailing (stop) |
true |
boolean | Removes a trailing stopword from queries (e.g. the), which helps autocomplete but can surprise phrase matching. |
Failure Modes & Debugging
Symptom: phrase query for "king of england" matches "king england"
Root cause: of is in your stopword list, so it is removed from the token stream at both index and search time, collapsing positions. The phrase query no longer requires a token between king and england.
Remediation: trim the stopword list to true noise and re-analyze to confirm of survives.
curl -s "localhost:9200/listings/_analyze" -H 'Content-Type: application/json' -d '{
"analyzer": "search_analyzer", "text": "king of england"
}' | jq '.tokens[].token'
Symptom: multi-word synonym "new york" returns no extra hits
Root cause: the filter is the flat synonym type, not synonym_graph. The flat filter cannot emit a multi-position path, so multi-word expansions are dropped or stacked incorrectly and the query parser ignores them.
Remediation: change the filter type to synonym_graph and re-create or reload the analyzer, then confirm the multi-token path appears.
curl -s "localhost:9200/listings/_analyze" -H 'Content-Type: application/json' -d '{
"analyzer": "search_analyzer", "text": "ny"
}' | jq '.tokens[] | {token, position, positionLength}'
Symptom: BM25 scores shifted after adding synonyms
Root cause: synonyms were added at index time, injecting tokens into the corpus and inflating document frequencies. IDF for the injected terms dropped, perturbing every score — not just queries that use the synonym.
Remediation: move the filter to the search analyzer so the index stays clean, reindex once, and revalidate scoring against your BM25 Tuning & Weights baseline.
curl -s "localhost:9200/listings/_search" -H 'Content-Type: application/json' -d '{
"explain": true, "query": {"match": {"title": "apartment"}}, "size": 1
}' | jq '.hits.hits[0]._explanation.description'
Symptom: analyzer update silently dropped some rules
Root cause: lenient: true is set, so malformed lines in the synonyms file are skipped without error. A typo (e.g. a stray => or a stopword inside a multi-word rule) removed the rule from the active set.
Remediation: temporarily set lenient: false in staging and re-create the analyzer to surface the failing line, fix it, and redeploy the file.
curl -s -X PUT "localhost:9200/listings-staging" -H 'Content-Type: application/json' \
-d @analyzer_strict.json | jq '.error.reason'
Performance & Scale Notes
Search-time expansion adds query-side cost proportional to the number of terms a token expands into; a query expanding into a 5-way disjunction costs roughly 5x the clause evaluation of the base term. In practice, with synonym sets in the low thousands of rules and queries of 2–4 tokens, the added latency is single-digit milliseconds at p95 and well within budgets for systems already targeting sub-50ms p95 (the same envelope used for autocomplete and suggestions). Index-time expansion has zero query cost but inflates index size by 10–40% depending on synonym density, and every rule change forces a full reindex — measure both against your reindex throughput before choosing it.
Stopword removal is nearly free at query time and reduces postings list sizes, but the relevance cost is asymmetric: a list of 5–15 true-noise tokens captures almost all the storage benefit, while the broad _english_ set of ~120 terms starts breaking phrase and synonym behavior. Benchmark recall and zero-result rate on a held-out query set after any stopword change; a recall drop greater than 1–2% on phrase-heavy queries is the signal that your list is too aggressive. For large file-managed sets across many data nodes, prefer updateable: true and reload analyzers rather than reindexing — reloading thousands of rules completes in seconds versus minutes-to-hours for a rebuild.
Related
- Updating synonyms without reindexing — apply rule changes live with
_reload_search_analyzers. - BM25 Tuning & Weights — why injected or stripped tokens move your IDF table and scores.
- Custom Scoring Functions — layer business logic on top of synonym-expanded recall.
- Query Autocomplete & Suggestions — where stopword and synonym choices surface to users typing queries.
- Elasticsearch Fundamentals for Engineers — the analyzer pipeline these filters plug into.