[main] Fix an OOM error when creating to many chained synonym graph token filter. (#140026)#11
[main] Fix an OOM error when creating to many chained synonym graph token filter. (#140026)#11tomerqodo wants to merge 4 commits into
Conversation
…lter. (elastic#140026) * Create YAML REST test for chained synonym filters. * Adding some unit tests for chained synonyms graph filters creation. * Ensure previous synonyms set from the chain aren't applied again. (cherry picked from commit e6bd70d)
There was a problem hiding this comment.
Pull request overview
This PR fixes an OutOfMemoryError that occurs when creating multiple chained synonym graph token filters. The issue stems from recursive synonym expansion during the building phase, where each filter in the chain would reprocess synonyms through all previous filters, leading to exponential memory growth.
Changes:
- Override
getSynonymFilter()inSynonymGraphTokenFilterFactoryto return the current filter instance, preventing recursive synonym expansion - Add unit test (
testManyChainedSynonymGraphFilters) that creates 10 chained synonym filters with 10,000 synonyms each to verify the fix - Add YAML REST test to verify basic functionality of chained synonym graph filters
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
SynonymGraphTokenFilterFactory.java |
Implements the fix by overriding getSynonymFilter() to prevent recursive synonym expansion during filter chain building |
SynonymsAnalysisTests.java |
Adds comprehensive unit tests for chained synonym graph filters, including an OOM regression test with many filters |
50_chained_synonym_filters.yml |
Adds integration test verifying that chained synonym graph filters work correctly in actual index/search scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| BaseTokenStreamTestCase.assertAnalyzesTo( | ||
| indexAnalyzers.get("syn"), | ||
| "foo baz hello", | ||
| new String[] { "bar", "foo", "qux", "baz", "world", "hello" }, |
There was a problem hiding this comment.
The expected token order appears to contradict the comment on line 375 which states 'synonym_graph produces synonym first, then original'. The test expects 'bar' before 'foo', but this ordering should be explicitly verified and documented as the actual behavior, especially since it differs from the comment pattern in the YAML test file (lines 76-81) which shows original tokens first in position.
| } | ||
|
|
||
| sb.append(" => "); | ||
| sb.append("syn").append(i * (j + 1)); // Shared ID appears in ALL filters |
There was a problem hiding this comment.
The comment 'Shared ID appears in ALL filters' is misleading. The synonym target is actually unique per filter (multiplied by filter index i) and per synonym rule (multiplied by j+1). The IDs are not shared across filters but rather uniquely generated using i * (j + 1) which produces different values for each filter-synonym combination.
Benchmark PR from qodo-benchmark#68