[main] Fix an OOM error when creating to many chained synonym graph token filter. (#140026) by tomerqodo · Pull Request #11 · qodo-benchmark/elasticsearch-combined-qodo

tomerqodo · 2026-01-14T16:17:47Z

…lter. (elastic#140026) * Create YAML REST test for chained synonym filters. * Adding some unit tests for chained synonyms graph filters creation. * Ensure previous synonyms set from the chain aren't applied again. (cherry picked from commit e6bd70d)

qodo-code-review · 2026-01-14T16:25:26Z

Code Review by Qodo (Alpha)

🐞 Bugs (3) 📘 Rule Violations (0) 📎 Requirement Gaps (0) 💡 Suggestions (0)

1. No-op synonym parsing override 🐞 Bug

Description

• The new getSynonymFilter() override in the chain-aware synonym_graph factory returns this,
  which is also the TokenFilterFactory default, so it does not actually remove synonym filters from
  the synonym-map-building analyzer.
• SynonymTokenFilterFactory explicitly returns IDENTITY_FILTER for synonym parsing to support
  chaining without applying synonyms to the synonym map; synonym_graph should do the same if it aims
  to prevent recursive expansion.
• As written, chaining synonym_graph filters can still build later synonym maps using analyzers that
  apply earlier synonym_graph filters, contradicting the comment and intended OOM prevention.

correctness

Code

modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymGraphTokenFilterFactory.java[R69-85]

+            @Override
+            public TokenFilterFactory getSynonymFilter() {
+                // When building a synonym filter, we must prevent previous synonym filters in the chain
+                // from being active, as this would cause recursive synonym expansion during the building phase.
+                //
+                // Without this fix, when chaining multiple synonym filters (e.g., synonym_A → synonym_B → synonym_C),
+                // building synonym_C would use an analyzer containing active synonym_A and synonym_B filters.
+                // This causes:
+                // 1. Recursive synonym expansion when parsing synonym rules (e.g., synonyms are expanded via previous filters)
+                // 2. Each SynonymMap inflates since it applies all previous synonym rules again
+                // 3. Triggering O(n²) operations in SynonymGraphFilter.bufferOutputTokens()
+                // 4. Massive memory allocation during analyzer reload → OutOfMemoryError
+                //
+                // This matches the behavior of SynonymTokenFilterFactory and prevents OOM with chained
+                // synonym filters (critical for users with many chained synonym sets).
+                return this;
+            }

Evidence
Synonym analyzers used to build synonym maps are constructed from prior filters’
getSynonymFilter() results. The synonym (non-graph) implementation returns IDENTITY_FILTER
specifically to avoid applying synonyms while parsing/building synonyms. The synonym_graph
chain-aware factory returns this, which is identical to the interface default, so it does not
implement the same protection mechanism the comment claims.
modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymGraphTokenFilterFactory.java[45-85]
modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymTokenFilterFactory.java[197-227]
server/src/main/java/org/elasticsearch/index/analysis/TokenFilterFactory.java[69-114]

Agent Prompt

### Issue description
`SynonymGraphTokenFilterFactory`’s chain-aware factory overrides `getSynonymFilter()` but returns `this`, which is the same as the `TokenFilterFactory` default. This does not prevent earlier synonym filters from being applied during synonym-map parsing/building for later synonym filters.

### Issue Context
`SynonymTokenFilterFactory` enables chaining by returning `IDENTITY_FILTER` for synonym parsing; `buildSynonymAnalyzer()` builds the analyzer for parsing synonym rules by mapping all prior filters through `getSynonymFilter()`.

### Fix Focus Areas
- modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymGraphTokenFilterFactory.java[54-85]
- modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymTokenFilterFactory.java[197-227]
- server/src/main/java/org/elasticsearch/index/analysis/TokenFilterFactory.java[69-114]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Huge randomized test load 🐞 Bug

Description

• testManyChainedSynonymGraphFilters generates extremely large synonym inputs (50k–100k vocab, 10
  filters × 10k rules) and then runs 1000 analysis iterations.
• This is likely to be very slow and memory intensive in standard CI/unit test runs, causing
  timeouts/OOM/flakiness.
• The test should be scaled down by default and/or gated (e.g., @Nightly or TEST_NIGHTLY-based
  scaling) while still preserving regression coverage.

performance

Code

modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[R380-436]

+        String[] vocab = randomArray(50_000, 100_000, String[]::new, () -> randomAlphanumericOfLength(20));
+        int synonymsPerFilter = 10_000;
+        int synonymSets = 10;
+        List<String> filterNames = new ArrayList<>();
+        filterNames.add("lowercase");
+
+        for (int i = 1; i <= synonymSets; i++) {
+            String filterName = "synonyms_" + i;
+            StringBuilder sb = new StringBuilder();
+
+            for (int j = 0; j < synonymsPerFilter; j++) {
+                if (j > 0) {
+                    sb.append("\n");
+                }
+                for (int k = 0; k < between(1, 3); k++) {
+                    if (k > 0) {
+                        sb.append(", ");
+                    }
+                    for (int l = 0; l < between(1, 3); l++) {
+                        if (l > 0) {
+                            sb.append(" ");
+                        }
+                        sb.append(randomFrom(vocab));
+                    }
+                }
+
+                sb.append(" => ");
+                sb.append("syn").append(i * (j + 1)); // Shared ID appears in ALL filters
+            }
+
+            filterNames.add(filterName);
+            settingsBuilder.put("index.analysis.filter." + filterName + ".type", "synonym_graph")
+                .put("index.analysis.filter." + filterName + ".lenient", true)
+                .putList("index.analysis.filter." + filterName + ".synonyms", sb.toString());
+        }
+
+        settingsBuilder.put("index.analysis.analyzer.many_syn.tokenizer", "standard")
+            .putList("index.analysis.analyzer.many_syn.filter", filterNames);
+
+        Settings settings = settingsBuilder.build();
+        IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", settings);
+
+        // This would OOM without the SynonymGraphTokenFilterFactory::getSynonymFilter() fix (filters built sequentially)
+        indexAnalyzers = createTestAnalysis(idxSettings, settings, new CommonAnalysisPlugin()).indexAnalyzers;
+
+        // Verify the analyzer was built successfully and can analyze text
+        // With cross-referencing synonyms, the exact output is complex, so just verify it works
+        Analyzer analyzer = indexAnalyzers.get("many_syn");
+        assertNotNull("Analyzer should be created", analyzer);
+
+        for (int i = 0; i < 1000; i++) {
+            // Test that it can analyze without throwing exceptions
+            TokenStream ts = analyzer.tokenStream("test", randomFrom(vocab));
+            ts.reset();
+            assertTrue("Should produce at least one token", ts.incrementToken());
+            ts.close();
+        }

Evidence
The test constructs very large random synonym rule sets and then repeatedly analyzes random inputs.
This is unboundedly expensive for a normal unit test suite and will vary run-to-run due to
randomness.
modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[375-436]

Agent Prompt

### Issue description
`testManyChainedSynonymGraphFilters` is extremely expensive (large random vocab + 100k+ rules + 1000 iterations). This can destabilize CI/unit test runs.

### Issue Context
The test’s goal is valuable (regression for chained synonym_graph OOM), but it should not run at max scale in every unit test execution.

### Fix Focus Areas
- modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[375-436]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Synonym order tests disagree 🐞 Bug

Description

• The unit test expects synonym_graph output ordering that differs between the single-token and
  multi-token cases (e.g., foo → [foo, bar] but foo baz hello → [bar, foo, ...]).
• The REST YAML test asserts a different ordering (foo then bar for position 0), which conflicts
  with the unit test’s multi-token expectation.
• These order-specific assertions are brittle; tests should assert per-position token sets (or avoid
  ordering assertions in REST YAML), otherwise they may fail despite semantically correct synonym
  expansion.

reliability

Code

modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[R354-372]

+        // Test single word - synonym_graph produces both original and synonym at same position
+        BaseTokenStreamTestCase.assertAnalyzesTo(
+            indexAnalyzers.get("syn"),
+            "foo",
+            new String[] { "foo", "bar" },
+            new int[] { 0, 0 }, // start offsets
+            new int[] { 3, 3 }, // end offsets
+            new int[] { 1, 0 }  // position increments
+        );
+
+        // Test multi-word query with all three filters active
+        BaseTokenStreamTestCase.assertAnalyzesTo(
+            indexAnalyzers.get("syn"),
+            "foo baz hello",
+            new String[] { "bar", "foo", "qux", "baz", "world", "hello" },
+            new int[] { 0, 0, 4, 4, 8, 8 },     // start offsets
+            new int[] { 3, 3, 7, 7, 13, 13 },  // end offsets
+            new int[] { 1, 0, 1, 0, 1, 0 }     // position increments: each synonym pair at same position
+        );

Evidence
The unit test encodes inconsistent ordering assumptions for synonym_graph output across cases, and
the REST YAML test encodes a different ordering assumption. This makes test expectations fragile and
potentially conflicting across suites.
modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[354-372]
modules/analysis-common/src/yamlRestTest/resources/rest-api-spec/test/analysis-common/50_chained_synonym_filters.yml[48-81]

Agent Prompt

### Issue description
The new tests encode conflicting assumptions about synonym_graph token ordering. This can make tests brittle or inconsistent across suites.

### Issue Context
Synonym_graph produces stacked tokens at the same position; ordering among stacked tokens is often an implementation detail and may change.

### Fix Focus Areas
- modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[354-372]
- modules/analysis-common/src/yamlRestTest/resources/rest-api-spec/test/analysis-common/50_chained_synonym_filters.yml[48-81]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-01-14T16:25:27Z

+            @Override
+            public TokenFilterFactory getSynonymFilter() {
+                // When building a synonym filter, we must prevent previous synonym filters in the chain
+                // from being active, as this would cause recursive synonym expansion during the building phase.
+                //
+                // Without this fix, when chaining multiple synonym filters (e.g., synonym_A → synonym_B → synonym_C),
+                // building synonym_C would use an analyzer containing active synonym_A and synonym_B filters.
+                // This causes:
+                // 1. Recursive synonym expansion when parsing synonym rules (e.g., synonyms are expanded via previous filters)
+                // 2. Each SynonymMap inflates since it applies all previous synonym rules again
+                // 3. Triggering O(n²) operations in SynonymGraphFilter.bufferOutputTokens()
+                // 4. Massive memory allocation during analyzer reload → OutOfMemoryError
+                //
+                // This matches the behavior of SynonymTokenFilterFactory and prevents OOM with chained
+                // synonym filters (critical for users with many chained synonym sets).
+                return this;
+            }


1. No-op synonym parsing override 🐞 Bug

• The new getSynonymFilter() override in the chain-aware synonym_graph factory returns this, which is also the TokenFilterFactory default, so it does not actually remove synonym filters from the synonym-map-building analyzer. • SynonymTokenFilterFactory explicitly returns IDENTITY_FILTER for synonym parsing to support chaining without applying synonyms to the synonym map; synonym_graph should do the same if it aims to prevent recursive expansion. • As written, chaining synonym_graph filters can still build later synonym maps using analyzers that apply earlier synonym_graph filters, contradicting the comment and intended OOM prevention.

Agent Prompt

### Issue description `SynonymGraphTokenFilterFactory`’s chain-aware factory overrides `getSynonymFilter()` but returns `this`, which is the same as the `TokenFilterFactory` default. This does not prevent earlier synonym filters from being applied during synonym-map parsing/building for later synonym filters. ### Issue Context `SynonymTokenFilterFactory` enables chaining by returning `IDENTITY_FILTER` for synonym parsing; `buildSynonymAnalyzer()` builds the analyzer for parsing synonym rules by mapping all prior filters through `getSynonymFilter()`. ### Fix Focus Areas - modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymGraphTokenFilterFactory.java[54-85] - modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymTokenFilterFactory.java[197-227] - server/src/main/java/org/elasticsearch/index/analysis/TokenFilterFactory.java[69-114]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-01-14T16:25:27Z

+        String[] vocab = randomArray(50_000, 100_000, String[]::new, () -> randomAlphanumericOfLength(20));
+        int synonymsPerFilter = 10_000;
+        int synonymSets = 10;
+        List<String> filterNames = new ArrayList<>();
+        filterNames.add("lowercase");
+
+        for (int i = 1; i <= synonymSets; i++) {
+            String filterName = "synonyms_" + i;
+            StringBuilder sb = new StringBuilder();
+
+            for (int j = 0; j < synonymsPerFilter; j++) {
+                if (j > 0) {
+                    sb.append("\n");
+                }
+                for (int k = 0; k < between(1, 3); k++) {
+                    if (k > 0) {
+                        sb.append(", ");
+                    }
+                    for (int l = 0; l < between(1, 3); l++) {
+                        if (l > 0) {
+                            sb.append(" ");
+                        }
+                        sb.append(randomFrom(vocab));
+                    }
+                }
+
+                sb.append(" => ");
+                sb.append("syn").append(i * (j + 1)); // Shared ID appears in ALL filters
+            }
+
+            filterNames.add(filterName);
+            settingsBuilder.put("index.analysis.filter." + filterName + ".type", "synonym_graph")
+                .put("index.analysis.filter." + filterName + ".lenient", true)
+                .putList("index.analysis.filter." + filterName + ".synonyms", sb.toString());
+        }
+
+        settingsBuilder.put("index.analysis.analyzer.many_syn.tokenizer", "standard")
+            .putList("index.analysis.analyzer.many_syn.filter", filterNames);
+
+        Settings settings = settingsBuilder.build();
+        IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", settings);
+
+        // This would OOM without the SynonymGraphTokenFilterFactory::getSynonymFilter() fix (filters built sequentially)
+        indexAnalyzers = createTestAnalysis(idxSettings, settings, new CommonAnalysisPlugin()).indexAnalyzers;
+
+        // Verify the analyzer was built successfully and can analyze text
+        // With cross-referencing synonyms, the exact output is complex, so just verify it works
+        Analyzer analyzer = indexAnalyzers.get("many_syn");
+        assertNotNull("Analyzer should be created", analyzer);
+
+        for (int i = 0; i < 1000; i++) {
+            // Test that it can analyze without throwing exceptions
+            TokenStream ts = analyzer.tokenStream("test", randomFrom(vocab));
+            ts.reset();
+            assertTrue("Should produce at least one token", ts.incrementToken());
+            ts.close();
+        }


2. Huge randomized test load 🐞 Bug

• testManyChainedSynonymGraphFilters generates extremely large synonym inputs (50k–100k vocab, 10 filters × 10k rules) and then runs 1000 analysis iterations. • This is likely to be very slow and memory intensive in standard CI/unit test runs, causing timeouts/OOM/flakiness. • The test should be scaled down by default and/or gated (e.g., @Nightly or TEST_NIGHTLY-based scaling) while still preserving regression coverage.

Agent Prompt

### Issue description `testManyChainedSynonymGraphFilters` is extremely expensive (large random vocab + 100k+ rules + 1000 iterations). This can destabilize CI/unit test runs. ### Issue Context The test’s goal is valuable (regression for chained synonym_graph OOM), but it should not run at max scale in every unit test execution. ### Fix Focus Areas - modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[375-436]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-01-14T16:25:27Z

+        // Test single word - synonym_graph produces both original and synonym at same position
+        BaseTokenStreamTestCase.assertAnalyzesTo(
+            indexAnalyzers.get("syn"),
+            "foo",
+            new String[] { "foo", "bar" },
+            new int[] { 0, 0 }, // start offsets
+            new int[] { 3, 3 }, // end offsets
+            new int[] { 1, 0 }  // position increments
+        );
+
+        // Test multi-word query with all three filters active
+        BaseTokenStreamTestCase.assertAnalyzesTo(
+            indexAnalyzers.get("syn"),
+            "foo baz hello",
+            new String[] { "bar", "foo", "qux", "baz", "world", "hello" },
+            new int[] { 0, 0, 4, 4, 8, 8 },     // start offsets
+            new int[] { 3, 3, 7, 7, 13, 13 },  // end offsets
+            new int[] { 1, 0, 1, 0, 1, 0 }     // position increments: each synonym pair at same position
+        );


3. Synonym order tests disagree 🐞 Bug

• The unit test expects synonym_graph output ordering that differs between the single-token and multi-token cases (e.g., foo → [foo, bar] but foo baz hello → [bar, foo, ...]). • The REST YAML test asserts a different ordering (foo then bar for position 0), which conflicts with the unit test’s multi-token expectation. • These order-specific assertions are brittle; tests should assert per-position token sets (or avoid ordering assertions in REST YAML), otherwise they may fail despite semantically correct synonym expansion.

Agent Prompt

### Issue description The new tests encode conflicting assumptions about synonym_graph token ordering. This can make tests brittle or inconsistent across suites. ### Issue Context Synonym_graph produces stacked tokens at the same position; ordering among stacked tokens is often an implementation detail and may change. ### Fix Focus Areas - modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[354-372] - modules/analysis-common/src/yamlRestTest/resources/rest-api-spec/test/analysis-common/50_chained_synonym_filters.yml[48-81]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

afoucret and others added 4 commits December 31, 2025 10:59

Merge branch 'main' into backport/main/pr-140026

3538330

Merge branch 'main' into backport/main/pr-140026

e907faa

Apply changes for benchmark PR

8aaa34f

qodo-code-review Bot reviewed Jan 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[main] Fix an OOM error when creating to many chained synonym graph token filter. (#140026)#11

tomerqodo commented Jan 14, 2026

Uh oh!

qodo-code-review Bot commented Jan 14, 2026

Uh oh!

qodo-code-review Bot Jan 14, 2026

Uh oh!

qodo-code-review Bot Jan 14, 2026

Uh oh!

qodo-code-review Bot Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomerqodo commented Jan 14, 2026

Uh oh!

qodo-code-review Bot commented Jan 14, 2026

Code Review by Qodo (Alpha)

Uh oh!

qodo-code-review Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants