Fix circuit breaker leak in percolator query construction#144827
Fix circuit breaker leak in percolator query construction#144827drempapis merged 21 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
|
Hi @drempapis, I've created a changelog YAML for you. |
| .put("indices.breaker.request.overhead", "1.0") | ||
| .build(); | ||
| ClusterSettings clusterSettings = new ClusterSettings(breakerSettings, ClusterSettings.BUILT_IN_CLUSTER_SETTINGS); | ||
| CircuitBreaker circuitBreaker = new HierarchyCircuitBreakerService( |
There was a problem hiding this comment.
Could you use the circuit breaker from the ESTestCase class again, that automatically tracks if it is freed again at test end? Might also make sense in the other unit test here? Could they be unified or too different?
There was a problem hiding this comment.
Done, both tests now use newLimitedBreaker(ByteSizeValue.ofMb(100)) from ESTestCase
The two tests are kept separate, one verifies round-trip serialization with TermQueryBuilder while the other requires automaton-based queries to exercise the CB code path.
spinscale
left a comment
There was a problem hiding this comment.
LGTM, do you see any other use-cases, where a similar pattern could occur that went unnoticed?
| return queryBuilder.toQuery(percolateShardContext); | ||
| } finally { | ||
| percolateShardContext.releaseQueryConstructionMemory(); | ||
| } |
There was a problem hiding this comment.
@drempapis, the same pattern exists in PhraseSuggester:138, where the circuit breaker accumulates bytes from all iterations, even though only one query is alive at any given time.
There was a problem hiding this comment.
Percolator queries are unique in that they are executed in multiple threads. More than one query can be alive at any time.
There was a problem hiding this comment.
@reungn, thanks for spotting that.
In PhraseSuggester, wrapped the toQuery() call in try/finally with releaseQueryConstructionMemory() after each correction iteration, since only one query is alive at a time there (sequential loop), releasing per-iteration is safe.
For ExpressionQueryList, the same try/finally pattern applies. It is called once per row in an ESQL lookup join, sequentially. The built query is used and completed before the next row starts.
For InnerHitContextBuilder, the query is stored inside InnerHitsContext and remains live throughout the fetch phase, so releasing immediately after toQuery() would be too early. The outer context is already registered with SearchService.addReleasable, which handles cleanup.
| return queryBuilder.toQuery(percolateShardContext); | ||
| } finally { | ||
| percolateShardContext.releaseQueryConstructionMemory(); | ||
| } |
There was a problem hiding this comment.
Percolator queries are unique in that they are executed in multiple threads. More than one query can be alive at any time.
| queryBuilder = Rewriteable.rewrite(queryBuilder, percolateShardContext); | ||
| return queryBuilder.toQuery(percolateShardContext); | ||
| } finally { | ||
| percolateShardContext.releaseQueryConstructionMemory(); |
There was a problem hiding this comment.
Doesn't this release the memory too early while it is still in use?
PercolateQueryBuilder::newPercolateSearchContext() overrides some methods on the new SearchExecutionContext specific to percolate. If the created context delegated the calls of addCircuitBreakerMemory, getQueryConstructionMemoryUsed and releaseQueryConstructionMemory to the source context then you wouldn't have the problem where the copy was being updated and you can rely on the usual release mechanism.
There was a problem hiding this comment.
Thank you @davidkyle.
Instead of releasing in a finally block, newPercolateSearchContext() now overrides addCircuitBreakerMemory, getQueryConstructionMemoryUsed, and releaseQueryConstructionMemory as suggested, to delegate back to the source context. This means all CB accounting flows through the outer request-scoped context, which SearchService already releases via addReleasable at the request end covering the full lifetime of all concurrent queries correctly.
There was a problem hiding this comment.
Thanks @drempapis looks good but have you pushed some code you didn't mean to?
I see some ESQL changes https://github.com/elastic/elasticsearch/pull/144827/changes#diff-44aa949d9c458930323c63793c06bd6c7a57e306dfab6cdfbb36cf1730cb4ad2
There was a problem hiding this comment.
Those are tests covering the examined classes that can be found in the comment #144827 (comment)
|
I think the leak could also occur at toQuery()
|
…search into fix/cb_leak_percolator
|
@davidkyle In |
💔 Backport failed
You can use sqren/backport to manually backport by running |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
* upstream/main: (146 commits) Revert "[Native] Gradle-related tweaks to improve handling of the simdvec native library (elastic#144539)" Fix ArrayIndexOutOfBoundsException in fetch phase with partial results (elastic#144385) ESQL: Correctly manage NULL data type for SUM (elastic#144942) [ESQL] Fixes GroupedTopNBenchmark not executing (elastic#144944) Fix reader context leak when query response serialization fails (elastic#144708) Validate individual offset values in BULK_OFFSETS bounds checks (elastic#144643) Merge main21 source set into main in simdvec (elastic#144921) [TEST] Unmute TsidExtractingIdFieldMapperTests (elastic#144848) [Native] Gradle-related tweaks to improve handling of the simdvec native library (elastic#144539) Fix `ThreadedActionListenerTests#testRejectionHandling` (elastic#144795) Add new DLM Frozen Tier Transition execution plugin and service (elastic#144595) Prometheus: execute query_range via parsed EsqlStatement plan (elastic#144416) Investigate `testBulkIndexingRequestSplitting` failure (elastic#144766) Add test utility for wrapping directories in FilterDirectory layer (elastic#143563) Fix ES|QL decay tests with negative scale (elastic#144657) Fix circuit breaker leak in percolator query construction (elastic#144827) Use XPerFieldDocValuesFormat in AbstractTSDBSyntheticIdCodec (elastic#144744) [DOCS] Document how reindex work in CPS (elastic#144016) Fix Int4 vector library tests failing on Java 21 (elastic#144830) [DiskBBQ] Fix index sorting on flush (elastic#144938) ...
) (#144998) * Fix circuit breaker leak in percolator query construction (#144827) (cherry picked from commit b36fdbc) # Conflicts: # modules/percolator/src/test/java/org/elasticsearch/percolator/QueryBuilderStoreTests.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/enrich/ExpressionQueryList.java * revert code * Delete x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/enrich/ExpressionQueryListCircuitBreakerTests.java * Update QueryBuilderStoreTests.java * Rename variable for search execution context * update after review
) (#144999) * Fix circuit breaker leak in percolator query construction (#144827) (cherry picked from commit b36fdbc) # Conflicts: # modules/percolator/src/test/java/org/elasticsearch/percolator/QueryBuilderStoreTests.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/enrich/ExpressionQueryList.java * [CI] Auto commit changes from spotless * Update ExpressionQueryList.java * Delete x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/enrich/ExpressionQueryListCircuitBreakerTests.java * update code * update * update * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
…4827) (#145000) * Fix circuit breaker leak in percolator query construction (#144827) (cherry picked from commit b36fdbc) # Conflicts: # modules/percolator/src/test/java/org/elasticsearch/percolator/QueryBuilderStoreTests.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/enrich/ExpressionQueryList.java * [CI] Auto commit changes from spotless * Delete x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/enrich/ExpressionQueryListCircuitBreakerTests.java * update * Remove accidental .claude files * restore files * update code * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Problem
PR #142150 added CB accounting for automaton-based queries (wildcard, regexp, fuzzy, prefix), but introduced a leak for percolator workloads (#144748).
For each percolator document,
createStore()creates a shallow copy of the outerSearchExecutionContext. The copy shares the same physicalCircuitBreakerbut has its own freshqueryConstructionMemoryUsedcounter. WhentoQuery()reserves CB bytes on the copy, the outer context's release (called at request end) drains its own counter, which is always zero. The copy's bytes are never freed.Fix
Add a
try/finallyincreateStore()that callspercolateShardContext.releaseQueryConstructionMemory()after eachdocument's query is built. CB protection during construction is unaffected.
I used the repository
https://github.com/NikolajLeischner/elasticsearch-9.3.2-circuit-breaker-bugto validate the behavior after applying the changes. Prior to the changes, the leak is reproducible.However, when using a Docker image built locally from the latest snapshot version, the leak no longer occurs.