[main] Fix an OOM error when creating to many chained synonym graph token filter. (#140026)#11
[main] Fix an OOM error when creating to many chained synonym graph token filter. (#140026)#11tomerqodo wants to merge 4 commits into
Conversation
…lter. (elastic#140026) * Create YAML REST test for chained synonym filters. * Adding some unit tests for chained synonyms graph filters creation. * Ensure previous synonyms set from the chain aren't applied again. (cherry picked from commit e6bd70d)
Code Review by Qodo (Alpha)
1. No-op synonym parsing override
|
| @Override | ||
| public TokenFilterFactory getSynonymFilter() { | ||
| // When building a synonym filter, we must prevent previous synonym filters in the chain | ||
| // from being active, as this would cause recursive synonym expansion during the building phase. | ||
| // | ||
| // Without this fix, when chaining multiple synonym filters (e.g., synonym_A → synonym_B → synonym_C), | ||
| // building synonym_C would use an analyzer containing active synonym_A and synonym_B filters. | ||
| // This causes: | ||
| // 1. Recursive synonym expansion when parsing synonym rules (e.g., synonyms are expanded via previous filters) | ||
| // 2. Each SynonymMap inflates since it applies all previous synonym rules again | ||
| // 3. Triggering O(n²) operations in SynonymGraphFilter.bufferOutputTokens() | ||
| // 4. Massive memory allocation during analyzer reload → OutOfMemoryError | ||
| // | ||
| // This matches the behavior of SynonymTokenFilterFactory and prevents OOM with chained | ||
| // synonym filters (critical for users with many chained synonym sets). | ||
| return this; | ||
| } |
There was a problem hiding this comment.
1. No-op synonym parsing override 🐞 Bug
• The new getSynonymFilter() override in the chain-aware synonym_graph factory returns this, which is also the TokenFilterFactory default, so it does not actually remove synonym filters from the synonym-map-building analyzer. • SynonymTokenFilterFactory explicitly returns IDENTITY_FILTER for synonym parsing to support chaining without applying synonyms to the synonym map; synonym_graph should do the same if it aims to prevent recursive expansion. • As written, chaining synonym_graph filters can still build later synonym maps using analyzers that apply earlier synonym_graph filters, contradicting the comment and intended OOM prevention.
Agent Prompt
### Issue description
`SynonymGraphTokenFilterFactory`’s chain-aware factory overrides `getSynonymFilter()` but returns `this`, which is the same as the `TokenFilterFactory` default. This does not prevent earlier synonym filters from being applied during synonym-map parsing/building for later synonym filters.
### Issue Context
`SynonymTokenFilterFactory` enables chaining by returning `IDENTITY_FILTER` for synonym parsing; `buildSynonymAnalyzer()` builds the analyzer for parsing synonym rules by mapping all prior filters through `getSynonymFilter()`.
### Fix Focus Areas
- modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymGraphTokenFilterFactory.java[54-85]
- modules/analysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymTokenFilterFactory.java[197-227]
- server/src/main/java/org/elasticsearch/index/analysis/TokenFilterFactory.java[69-114]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| String[] vocab = randomArray(50_000, 100_000, String[]::new, () -> randomAlphanumericOfLength(20)); | ||
| int synonymsPerFilter = 10_000; | ||
| int synonymSets = 10; | ||
| List<String> filterNames = new ArrayList<>(); | ||
| filterNames.add("lowercase"); | ||
|
|
||
| for (int i = 1; i <= synonymSets; i++) { | ||
| String filterName = "synonyms_" + i; | ||
| StringBuilder sb = new StringBuilder(); | ||
|
|
||
| for (int j = 0; j < synonymsPerFilter; j++) { | ||
| if (j > 0) { | ||
| sb.append("\n"); | ||
| } | ||
| for (int k = 0; k < between(1, 3); k++) { | ||
| if (k > 0) { | ||
| sb.append(", "); | ||
| } | ||
| for (int l = 0; l < between(1, 3); l++) { | ||
| if (l > 0) { | ||
| sb.append(" "); | ||
| } | ||
| sb.append(randomFrom(vocab)); | ||
| } | ||
| } | ||
|
|
||
| sb.append(" => "); | ||
| sb.append("syn").append(i * (j + 1)); // Shared ID appears in ALL filters | ||
| } | ||
|
|
||
| filterNames.add(filterName); | ||
| settingsBuilder.put("index.analysis.filter." + filterName + ".type", "synonym_graph") | ||
| .put("index.analysis.filter." + filterName + ".lenient", true) | ||
| .putList("index.analysis.filter." + filterName + ".synonyms", sb.toString()); | ||
| } | ||
|
|
||
| settingsBuilder.put("index.analysis.analyzer.many_syn.tokenizer", "standard") | ||
| .putList("index.analysis.analyzer.many_syn.filter", filterNames); | ||
|
|
||
| Settings settings = settingsBuilder.build(); | ||
| IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", settings); | ||
|
|
||
| // This would OOM without the SynonymGraphTokenFilterFactory::getSynonymFilter() fix (filters built sequentially) | ||
| indexAnalyzers = createTestAnalysis(idxSettings, settings, new CommonAnalysisPlugin()).indexAnalyzers; | ||
|
|
||
| // Verify the analyzer was built successfully and can analyze text | ||
| // With cross-referencing synonyms, the exact output is complex, so just verify it works | ||
| Analyzer analyzer = indexAnalyzers.get("many_syn"); | ||
| assertNotNull("Analyzer should be created", analyzer); | ||
|
|
||
| for (int i = 0; i < 1000; i++) { | ||
| // Test that it can analyze without throwing exceptions | ||
| TokenStream ts = analyzer.tokenStream("test", randomFrom(vocab)); | ||
| ts.reset(); | ||
| assertTrue("Should produce at least one token", ts.incrementToken()); | ||
| ts.close(); | ||
| } |
There was a problem hiding this comment.
2. Huge randomized test load 🐞 Bug
• testManyChainedSynonymGraphFilters generates extremely large synonym inputs (50k–100k vocab, 10 filters × 10k rules) and then runs 1000 analysis iterations. • This is likely to be very slow and memory intensive in standard CI/unit test runs, causing timeouts/OOM/flakiness. • The test should be scaled down by default and/or gated (e.g., @Nightly or TEST_NIGHTLY-based scaling) while still preserving regression coverage.
Agent Prompt
### Issue description
`testManyChainedSynonymGraphFilters` is extremely expensive (large random vocab + 100k+ rules + 1000 iterations). This can destabilize CI/unit test runs.
### Issue Context
The test’s goal is valuable (regression for chained synonym_graph OOM), but it should not run at max scale in every unit test execution.
### Fix Focus Areas
- modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[375-436]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| // Test single word - synonym_graph produces both original and synonym at same position | ||
| BaseTokenStreamTestCase.assertAnalyzesTo( | ||
| indexAnalyzers.get("syn"), | ||
| "foo", | ||
| new String[] { "foo", "bar" }, | ||
| new int[] { 0, 0 }, // start offsets | ||
| new int[] { 3, 3 }, // end offsets | ||
| new int[] { 1, 0 } // position increments | ||
| ); | ||
|
|
||
| // Test multi-word query with all three filters active | ||
| BaseTokenStreamTestCase.assertAnalyzesTo( | ||
| indexAnalyzers.get("syn"), | ||
| "foo baz hello", | ||
| new String[] { "bar", "foo", "qux", "baz", "world", "hello" }, | ||
| new int[] { 0, 0, 4, 4, 8, 8 }, // start offsets | ||
| new int[] { 3, 3, 7, 7, 13, 13 }, // end offsets | ||
| new int[] { 1, 0, 1, 0, 1, 0 } // position increments: each synonym pair at same position | ||
| ); |
There was a problem hiding this comment.
3. Synonym order tests disagree 🐞 Bug
• The unit test expects synonym_graph output ordering that differs between the single-token and multi-token cases (e.g., foo → [foo, bar] but foo baz hello → [bar, foo, ...]). • The REST YAML test asserts a different ordering (foo then bar for position 0), which conflicts with the unit test’s multi-token expectation. • These order-specific assertions are brittle; tests should assert per-position token sets (or avoid ordering assertions in REST YAML), otherwise they may fail despite semantically correct synonym expansion.
Agent Prompt
### Issue description
The new tests encode conflicting assumptions about synonym_graph token ordering. This can make tests brittle or inconsistent across suites.
### Issue Context
Synonym_graph produces stacked tokens at the same position; ordering among stacked tokens is often an implementation detail and may change.
### Fix Focus Areas
- modules/analysis-common/src/test/java/org/elasticsearch/analysis/common/SynonymsAnalysisTests.java[354-372]
- modules/analysis-common/src/yamlRestTest/resources/rest-api-spec/test/analysis-common/50_chained_synonym_filters.yml[48-81]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Benchmark PR from qodo-benchmark#68