Skip to content

[main] Fix an OOM error when creating to many chained synonym graph token filter. (#140026)#11

Open
tomerqodo wants to merge 4 commits into
qodo_only-issues-20260113-qodo-grep-copilot_base_main_fix_an_oom_error_when_creating_to_many_chained_synonym_graph_token_filter_140026_pr68from
qodo_only-issues-20260113-qodo-grep-copilot_head_main_fix_an_oom_error_when_creating_to_many_chained_synonym_graph_token_filter_140026_pr68
Open

[main] Fix an OOM error when creating to many chained synonym graph token filter. (#140026)#11
tomerqodo wants to merge 4 commits into
qodo_only-issues-20260113-qodo-grep-copilot_base_main_fix_an_oom_error_when_creating_to_many_chained_synonym_graph_token_filter_140026_pr68from
qodo_only-issues-20260113-qodo-grep-copilot_head_main_fix_an_oom_error_when_creating_to_many_chained_synonym_graph_token_filter_140026_pr68

Conversation

@tomerqodo
Copy link
Copy Markdown

Benchmark PR from qodo-benchmark#68

afoucret and others added 4 commits December 31, 2025 10:59
…lter. (elastic#140026)

* Create YAML REST test for chained synonym filters.

* Adding some unit tests for chained synonyms graph filters creation.

* Ensure previous synonyms set from the chain aren't applied again.

(cherry picked from commit e6bd70d)
@qodo-code-review
Copy link
Copy Markdown

Code Review by Qodo (Alpha)

🐞 Bugs (3) 📘 Rule Violations (0) 📎 Requirement Gaps (0) 💡 Suggestions (0)

Grey Divider


Action Required

1. No-op synonym parsing override 🐞 Bug
Description
• The new getSynonymFilter() override in the chain-aware synonym_graph factory returns this,
  which is also the TokenFilterFactory default, so it does not actually remove synonym filters from
  the synonym-map-building analyzer.
• SynonymTokenFilterFactory explicitly returns IDENTITY_FILTER for synonym parsing to support
  chaining without applying synonyms to the synonym map; synonym_graph should do the same if it aims
  to prevent recursive expansion.
• As written, chaining synonym_graph filters can still build later synonym maps using analyzers that
  apply earlier synonym_graph filters, contradicting the comment and intended OOM prevention.

correctness

Code

modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymGraphTokenFilterFactory.java[R69-85]

+            @Override
+            public TokenFilterFactory getSynonymFilter() {
+                // When building a synonym filter, we must prevent previous synonym filters in the chain
+                // from being active, as this would cause recursive synonym expansion during the building phase.
+                //
+                // Without this fix, when chaining multiple synonym filters (e.g., synonym_A → synonym_B → synonym_C),
+                // building synonym_C would use an analyzer containing active synonym_A and synonym_B filters.
+                // This causes:
+                // 1. Recursive synonym expansion when parsing synonym rules (e.g., synonyms are expanded via previous filters)
+                // 2. Each SynonymMap inflates since it applies all previous synonym rules again
+                // 3. Triggering O(n²) operations in SynonymGraphFilter.bufferOutputTokens()
+                // 4. Massive memory allocation during analyzer reload → OutOfMemoryError
+                //
+                // This matches the behavior of SynonymTokenFilterFactory and prevents OOM with chained
+                // synonym filters (critical for users with many chained synonym sets).
+                return this;
+            }
Evidence
Synonym analyzers used to build synonym maps are constructed from prior filters’
getSynonymFilter() results. The synonym (non-graph) implementation returns IDENTITY_FILTER
specifically to avoid applying synonyms while parsing/building synonyms. The synonym_graph
chain-aware factory returns this, which is identical to the interface default, so it does not
implement the same protection mechanism the comment claims.

modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymGraphTokenFilterFactory.java[45-85]
modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymTokenFilterFactory.java[197-227]
server/src/main/java/org/elasticsearch/index/analysis/TokenFilterFactory.java[69-114]

Agent Prompt
### Issue description
`SynonymGraphTokenFilterFactory`’s chain-aware factory overrides `getSynonymFilter()` but returns `this`, which is the same as the `TokenFilterFactory` default. This does not prevent earlier synonym filters from being applied during synonym-map parsing/building for later synonym filters.

### Issue Context
`SynonymTokenFilterFactory` enables chaining by returning `IDENTITY_FILTER` for synonym parsing; `buildSynonymAnalyzer()` builds the analyzer for parsing synonym rules by mapping all prior filters through `getSynonymFilter()`.

### Fix Focus Areas
- modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymGraphTokenFilterFactory.java[54-85]
- modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymTokenFilterFactory.java[197-227]
- server/src/main/java/org/elasticsearch/index/analysis/TokenFilterFactory.java[69-114]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation Recommended

2. Huge randomized test load 🐞 Bug
Description
testManyChainedSynonymGraphFilters generates extremely large synonym inputs (50k–100k vocab, 10
  filters × 10k rules) and then runs 1000 analysis iterations.
• This is likely to be very slow and memory intensive in standard CI/unit test runs, causing
  timeouts/OOM/flakiness.
• The test should be scaled down by default and/or gated (e.g., @Nightly or TEST_NIGHTLY-based
  scaling) while still preserving regression coverage.

performance

Code

modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[R380-436]

+        String[] vocab = randomArray(50_000, 100_000, String[]::new, () -> randomAlphanumericOfLength(20));
+        int synonymsPerFilter = 10_000;
+        int synonymSets = 10;
+        List<String> filterNames = new ArrayList<>();
+        filterNames.add("lowercase");
+
+        for (int i = 1; i <= synonymSets; i++) {
+            String filterName = "synonyms_" + i;
+            StringBuilder sb = new StringBuilder();
+
+            for (int j = 0; j < synonymsPerFilter; j++) {
+                if (j > 0) {
+                    sb.append("\n");
+                }
+                for (int k = 0; k < between(1, 3); k++) {
+                    if (k > 0) {
+                        sb.append(", ");
+                    }
+                    for (int l = 0; l < between(1, 3); l++) {
+                        if (l > 0) {
+                            sb.append(" ");
+                        }
+                        sb.append(randomFrom(vocab));
+                    }
+                }
+
+                sb.append(" => ");
+                sb.append("syn").append(i * (j + 1)); // Shared ID appears in ALL filters
+            }
+
+            filterNames.add(filterName);
+            settingsBuilder.put("index.analysis.filter." + filterName + ".type", "synonym_graph")
+                .put("index.analysis.filter." + filterName + ".lenient", true)
+                .putList("index.analysis.filter." + filterName + ".synonyms", sb.toString());
+        }
+
+        settingsBuilder.put("index.analysis.analyzer.many_syn.tokenizer", "standard")
+            .putList("index.analysis.analyzer.many_syn.filter", filterNames);
+
+        Settings settings = settingsBuilder.build();
+        IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", settings);
+
+        // This would OOM without the SynonymGraphTokenFilterFactory::getSynonymFilter() fix (filters built sequentially)
+        indexAnalyzers = createTestAnalysis(idxSettings, settings, new CommonAnalysisPlugin()).indexAnalyzers;
+
+        // Verify the analyzer was built successfully and can analyze text
+        // With cross-referencing synonyms, the exact output is complex, so just verify it works
+        Analyzer analyzer = indexAnalyzers.get("many_syn");
+        assertNotNull("Analyzer should be created", analyzer);
+
+        for (int i = 0; i < 1000; i++) {
+            // Test that it can analyze without throwing exceptions
+            TokenStream ts = analyzer.tokenStream("test", randomFrom(vocab));
+            ts.reset();
+            assertTrue("Should produce at least one token", ts.incrementToken());
+            ts.close();
+        }
Evidence
The test constructs very large random synonym rule sets and then repeatedly analyzes random inputs.
This is unboundedly expensive for a normal unit test suite and will vary run-to-run due to
randomness.

modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[375-436]

Agent Prompt
### Issue description
`testManyChainedSynonymGraphFilters` is extremely expensive (large random vocab + 100k+ rules + 1000 iterations). This can destabilize CI/unit test runs.

### Issue Context
The test’s goal is valuable (regression for chained synonym_graph OOM), but it should not run at max scale in every unit test execution.

### Fix Focus Areas
- modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[375-436]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Synonym order tests disagree 🐞 Bug
Description
• The unit test expects synonym_graph output ordering that differs between the single-token and
  multi-token cases (e.g., foo[foo, bar] but foo baz hello[bar, foo, ...]).
• The REST YAML test asserts a different ordering (foo then bar for position 0), which conflicts
  with the unit test’s multi-token expectation.
• These order-specific assertions are brittle; tests should assert per-position token sets (or avoid
  ordering assertions in REST YAML), otherwise they may fail despite semantically correct synonym
  expansion.

reliability

Code

modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[R354-372]

+        // Test single word - synonym_graph produces both original and synonym at same position
+        BaseTokenStreamTestCase.assertAnalyzesTo(
+            indexAnalyzers.get("syn"),
+            "foo",
+            new String[] { "foo", "bar" },
+            new int[] { 0, 0 }, // start offsets
+            new int[] { 3, 3 }, // end offsets
+            new int[] { 1, 0 }  // position increments
+        );
+
+        // Test multi-word query with all three filters active
+        BaseTokenStreamTestCase.assertAnalyzesTo(
+            indexAnalyzers.get("syn"),
+            "foo baz hello",
+            new String[] { "bar", "foo", "qux", "baz", "world", "hello" },
+            new int[] { 0, 0, 4, 4, 8, 8 },     // start offsets
+            new int[] { 3, 3, 7, 7, 13, 13 },  // end offsets
+            new int[] { 1, 0, 1, 0, 1, 0 }     // position increments: each synonym pair at same position
+        );
Evidence
The unit test encodes inconsistent ordering assumptions for synonym_graph output across cases, and
the REST YAML test encodes a different ordering assumption. This makes test expectations fragile and
potentially conflicting across suites.

modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[354-372]
modules/analysis-common/src/yamlRestTest/resources/rest-api-spec/test/analysis-common/50_chained_synonym_filters.yml[48-81]

Agent Prompt
### Issue description
The new tests encode conflicting assumptions about synonym_graph token ordering. This can make tests brittle or inconsistent across suites.

### Issue Context
Synonym_graph produces stacked tokens at the same position; ordering among stacked tokens is often an implementation detail and may change.

### Fix Focus Areas
- modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[354-372]
- modules/analysis-common/src/yamlRestTest/resources/rest-api-spec/test/analysis-common/50_chained_synonym_filters.yml[48-81]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment on lines +69 to +85
@Override
public TokenFilterFactory getSynonymFilter() {
// When building a synonym filter, we must prevent previous synonym filters in the chain
// from being active, as this would cause recursive synonym expansion during the building phase.
//
// Without this fix, when chaining multiple synonym filters (e.g., synonym_A → synonym_B → synonym_C),
// building synonym_C would use an analyzer containing active synonym_A and synonym_B filters.
// This causes:
// 1. Recursive synonym expansion when parsing synonym rules (e.g., synonyms are expanded via previous filters)
// 2. Each SynonymMap inflates since it applies all previous synonym rules again
// 3. Triggering O(n²) operations in SynonymGraphFilter.bufferOutputTokens()
// 4. Massive memory allocation during analyzer reload → OutOfMemoryError
//
// This matches the behavior of SynonymTokenFilterFactory and prevents OOM with chained
// synonym filters (critical for users with many chained synonym sets).
return this;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action Required

1. No-op synonym parsing override 🐞 Bug

• The new getSynonymFilter() override in the chain-aware synonym_graph factory returns this,
  which is also the TokenFilterFactory default, so it does not actually remove synonym filters from
  the synonym-map-building analyzer.
• SynonymTokenFilterFactory explicitly returns IDENTITY_FILTER for synonym parsing to support
  chaining without applying synonyms to the synonym map; synonym_graph should do the same if it aims
  to prevent recursive expansion.
• As written, chaining synonym_graph filters can still build later synonym maps using analyzers that
  apply earlier synonym_graph filters, contradicting the comment and intended OOM prevention.
Agent Prompt
### Issue description
`SynonymGraphTokenFilterFactory`’s chain-aware factory overrides `getSynonymFilter()` but returns `this`, which is the same as the `TokenFilterFactory` default. This does not prevent earlier synonym filters from being applied during synonym-map parsing/building for later synonym filters.

### Issue Context
`SynonymTokenFilterFactory` enables chaining by returning `IDENTITY_FILTER` for synonym parsing; `buildSynonymAnalyzer()` builds the analyzer for parsing synonym rules by mapping all prior filters through `getSynonymFilter()`.

### Fix Focus Areas
- modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymGraphTokenFilterFactory.java[54-85]
- modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymTokenFilterFactory.java[197-227]
- server/src/main/java/org/elasticsearch/index/analysis/TokenFilterFactory.java[69-114]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +380 to +436
String[] vocab = randomArray(50_000, 100_000, String[]::new, () -> randomAlphanumericOfLength(20));
int synonymsPerFilter = 10_000;
int synonymSets = 10;
List<String> filterNames = new ArrayList<>();
filterNames.add("lowercase");

for (int i = 1; i <= synonymSets; i++) {
String filterName = "synonyms_" + i;
StringBuilder sb = new StringBuilder();

for (int j = 0; j < synonymsPerFilter; j++) {
if (j > 0) {
sb.append("\n");
}
for (int k = 0; k < between(1, 3); k++) {
if (k > 0) {
sb.append(", ");
}
for (int l = 0; l < between(1, 3); l++) {
if (l > 0) {
sb.append(" ");
}
sb.append(randomFrom(vocab));
}
}

sb.append(" => ");
sb.append("syn").append(i * (j + 1)); // Shared ID appears in ALL filters
}

filterNames.add(filterName);
settingsBuilder.put("index.analysis.filter." + filterName + ".type", "synonym_graph")
.put("index.analysis.filter." + filterName + ".lenient", true)
.putList("index.analysis.filter." + filterName + ".synonyms", sb.toString());
}

settingsBuilder.put("index.analysis.analyzer.many_syn.tokenizer", "standard")
.putList("index.analysis.analyzer.many_syn.filter", filterNames);

Settings settings = settingsBuilder.build();
IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", settings);

// This would OOM without the SynonymGraphTokenFilterFactory::getSynonymFilter() fix (filters built sequentially)
indexAnalyzers = createTestAnalysis(idxSettings, settings, new CommonAnalysisPlugin()).indexAnalyzers;

// Verify the analyzer was built successfully and can analyze text
// With cross-referencing synonyms, the exact output is complex, so just verify it works
Analyzer analyzer = indexAnalyzers.get("many_syn");
assertNotNull("Analyzer should be created", analyzer);

for (int i = 0; i < 1000; i++) {
// Test that it can analyze without throwing exceptions
TokenStream ts = analyzer.tokenStream("test", randomFrom(vocab));
ts.reset();
assertTrue("Should produce at least one token", ts.incrementToken());
ts.close();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remediation Recommended

2. Huge randomized test load 🐞 Bug

testManyChainedSynonymGraphFilters generates extremely large synonym inputs (50k–100k vocab, 10
  filters × 10k rules) and then runs 1000 analysis iterations.
• This is likely to be very slow and memory intensive in standard CI/unit test runs, causing
  timeouts/OOM/flakiness.
• The test should be scaled down by default and/or gated (e.g., @Nightly or TEST_NIGHTLY-based
  scaling) while still preserving regression coverage.
Agent Prompt
### Issue description
`testManyChainedSynonymGraphFilters` is extremely expensive (large random vocab + 100k+ rules + 1000 iterations). This can destabilize CI/unit test runs.

### Issue Context
The test’s goal is valuable (regression for chained synonym_graph OOM), but it should not run at max scale in every unit test execution.

### Fix Focus Areas
- modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[375-436]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +354 to +372
// Test single word - synonym_graph produces both original and synonym at same position
BaseTokenStreamTestCase.assertAnalyzesTo(
indexAnalyzers.get("syn"),
"foo",
new String[] { "foo", "bar" },
new int[] { 0, 0 }, // start offsets
new int[] { 3, 3 }, // end offsets
new int[] { 1, 0 } // position increments
);

// Test multi-word query with all three filters active
BaseTokenStreamTestCase.assertAnalyzesTo(
indexAnalyzers.get("syn"),
"foo baz hello",
new String[] { "bar", "foo", "qux", "baz", "world", "hello" },
new int[] { 0, 0, 4, 4, 8, 8 }, // start offsets
new int[] { 3, 3, 7, 7, 13, 13 }, // end offsets
new int[] { 1, 0, 1, 0, 1, 0 } // position increments: each synonym pair at same position
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remediation Recommended

3. Synonym order tests disagree 🐞 Bug

• The unit test expects synonym_graph output ordering that differs between the single-token and
  multi-token cases (e.g., foo[foo, bar] but foo baz hello[bar, foo, ...]).
• The REST YAML test asserts a different ordering (foo then bar for position 0), which conflicts
  with the unit test’s multi-token expectation.
• These order-specific assertions are brittle; tests should assert per-position token sets (or avoid
  ordering assertions in REST YAML), otherwise they may fail despite semantically correct synonym
  expansion.
Agent Prompt
### Issue description
The new tests encode conflicting assumptions about synonym_graph token ordering. This can make tests brittle or inconsistent across suites.

### Issue Context
Synonym_graph produces stacked tokens at the same position; ordering among stacked tokens is often an implementation detail and may change.

### Fix Focus Areas
- modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[354-372]
- modules/analysis-common/src/yamlRestTest/resources/rest-api-spec/test/analysis-common/50_chained_synonym_filters.yml[48-81]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants