Skip to content

Fix reader context leak when query response serialization fails#144708

Merged
drempapis merged 40 commits intoelastic:mainfrom
drempapis:fix/reader_context_leak
Mar 26, 2026
Merged

Fix reader context leak when query response serialization fails#144708
drempapis merged 40 commits intoelastic:mainfrom
drempapis:fix/reader_context_leak

Conversation

@drempapis
Copy link
Copy Markdown
Contributor

@drempapis drempapis commented Mar 22, 2026

Solves #144598

Problem

PR #143136 introduced NetworkPathListener in SearchTransportService to defer circuit breaker release until the Netty transport write completes. As part of this, query response serialization was moved inside NetworkPathListener.onResponse().

If serialization throws an exception, the exception was caught inside NetworkPathListener and forwarded directly to the transport channel's failure handler bypassing SearchService's wrapFailureListener. That listener is the one responsible for calling processFailure(context, exc), which frees the ReaderContext associated with the in-flight search.

The result: the ReaderContext is never freed, causing a leak of in-flight search contexts. This surfaced as intermittent failures in TransportSearchIT.testCircuitBreakerReduceFail with the error

There are still [14] in-flight contexts.

Solution

The fix ensures serialization failures are propagated through the full listener chain rather than short-circuiting to the channel directly.

SearchTransportService: Instead of catching the serialization exception and swallowing it into the channel, the exception is rethrown so it propagates to listener.onFailure(), which is the wrapFailureListener that knows how to clean up the ReaderContext.

SearchService: Refactored wrapFailureListener to use ActionListener.releaseAfter() for cleaner resource management, and extracted executeQueryPhaseAsync() to make the async execution boundary explicit. The onFailure path now correctly invokes processFailure(context, exc) regardless of whether the failure originated in the search execution phase or the serialization phase.

@drempapis drempapis added >test Issues or PRs that are addressing/adding tests Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v9.4.0 labels Mar 22, 2026
@drempapis
Copy link
Copy Markdown
Contributor Author

@elasticmachine test this please

@drempapis
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/pr-upgrade-part-1

@drempapis
Copy link
Copy Markdown
Contributor Author

@elasticmachine test this please

@drempapis
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/part-3

@drempapis drempapis marked this pull request as ready for review March 25, 2026 10:54
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

final Releasable markAsUsed = readerContext.markAsUsed(getKeepAlive(request));
completionListenerRef.set(wrapFailureListener(listener, readerContext, markAsUsed));
return executeQueryPhase(request, task, readerContext);
}, ActionListener.wrap(result -> completionListenerRef.get().onResponse(result), e -> completionListenerRef.get().onFailure(e)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code sequence is hard to read/understand for me with the atomicreference setting and the access in the next argument. Do you have any idea for simplification?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added a comment to clarify the understanding of the code and the need of the completionListenerRef

Copy link
Copy Markdown
Contributor

@spinscale spinscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment to think about, but LGTM otherwise.

Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this Dimi. Left some minor comments.


private <T> ActionListener<T> wrapFailureListener(ActionListener<T> listener, ReaderContext context, Releasable releasable) {
return ActionListener.releaseAfter(ActionListener.wrap(listener::onResponse, e -> {
processFailure(context, e);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we in trouble if processFailure fails?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, processFailure cannot thro, the freeReaderContext is just a ConcurrentHashMap.remove, and failShard is already wrapped in its own try/catch, but I agree the code shouldn't silently depend on that. I've added a try/finally so listener.onFailure(e) is unconditionally called regardless of what processFailure does.

}

private <T> ActionListener<T> wrapFailureListener(ActionListener<T> listener, ReaderContext context, Releasable releasable) {
return ActionListener.releaseAfter(ActionListener.wrap(listener::onResponse, e -> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a personal flaw but I find it very difficult to understand what we mean by
releaseAfter ( wrap ... )

Can you please document a bit the method and code? Would it make sense to make this method package visible and unit test it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Andrei, for the feedback.

Agreed, I documented the code and added unit tests by making the method package visible.

@drempapis
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/part-3

@drempapis drempapis merged commit 15a99da into elastic:main Mar 26, 2026
36 checks passed
szybia added a commit to szybia/elasticsearch that referenced this pull request Mar 26, 2026
* upstream/main: (146 commits)
  Revert "[Native] Gradle-related tweaks to improve handling of the simdvec native library (elastic#144539)"
  Fix ArrayIndexOutOfBoundsException in fetch phase with partial results (elastic#144385)
  ESQL: Correctly manage NULL data type for SUM (elastic#144942)
  [ESQL] Fixes GroupedTopNBenchmark not executing (elastic#144944)
  Fix reader context leak when query response serialization fails (elastic#144708)
  Validate individual offset values in BULK_OFFSETS bounds checks (elastic#144643)
  Merge main21 source set into main in simdvec (elastic#144921)
  [TEST] Unmute TsidExtractingIdFieldMapperTests (elastic#144848)
  [Native] Gradle-related tweaks to improve handling of the simdvec native library (elastic#144539)
  Fix `ThreadedActionListenerTests#testRejectionHandling` (elastic#144795)
  Add new DLM Frozen Tier Transition execution plugin and service (elastic#144595)
  Prometheus: execute query_range via parsed EsqlStatement plan (elastic#144416)
  Investigate `testBulkIndexingRequestSplitting` failure (elastic#144766)
  Add test utility for wrapping directories in FilterDirectory layer (elastic#143563)
  Fix ES|QL decay tests with negative scale (elastic#144657)
  Fix circuit breaker leak in percolator query construction (elastic#144827)
  Use XPerFieldDocValuesFormat in AbstractTSDBSyntheticIdCodec (elastic#144744)
  [DOCS] Document how reindex work in CPS (elastic#144016)
  Fix Int4 vector library tests failing on Java 21 (elastic#144830)
  [DiskBBQ] Fix index sorting on flush (elastic#144938)
  ...
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 26, 2026
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 27, 2026
mamazzol pushed a commit to mamazzol/elasticsearch that referenced this pull request Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants