Skip to content

Generalize testClientCancellation test#143586

Merged
DaveCTurner merged 10 commits intoelastic:mainfrom
DaveCTurner:2026/03/04/chunked-response-closed-channel-tests
Mar 6, 2026
Merged

Generalize testClientCancellation test#143586
DaveCTurner merged 10 commits intoelastic:mainfrom
DaveCTurner:2026/03/04/chunked-response-closed-channel-tests

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

In #143252 we saw a case where a chunked response appears to have
computed far too many chunks at once, with some indication that it might
relate to client cancellation. We have a test for this case already but
it's a little weak. This commit reworks the test to use a Netty HTTP
client giving much deeper control over the client's behaviour, allowing
to close the connection with either a FIN or a RST and to do so after
starting to receive the response body. It also randomizes the chunk size
and allows for a delay to simulate computation happening while yielding
the chunk.

In elastic#143252 we saw a case where a chunked response appears to have
computed far too many chunks at once, with some indication that it might
relate to client cancellation. We have a test for this case already but
it's a little weak. This commit reworks the test to use a Netty HTTP
client giving much deeper control over the client's behaviour, allowing
to close the connection with either a FIN or a RST and to do so after
starting to receive the response body. It also randomizes the chunk size
and allows for a delay to simulate computation happening while yielding
the chunk.
@DaveCTurner DaveCTurner added >test Issues or PRs that are addressing/adding tests :Distributed/Network Http and internode communication implementations v9.4.0 labels Mar 4, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Mar 4, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@inespot inespot self-requested a review March 5, 2026 17:07
Copy link
Copy Markdown
Contributor

@inespot inespot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! A couple questions/clarifications before I plusone

ctx.close();
}
} else {
assertThat(msg, instanceOf(HttpResponse.class));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this else condition practically reachable? I also don't think I fully understand why we are checking the message is instance of org.elasticsearch.http.HttpResponse in that case. But I might be missing a piece of the puzzle

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we get a single HttpResponse instance representing the response headers, followed by a sequence of HttpContent objects representing the body. Try it: add an assert false here and run the test :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I thought this was the io.netty.handler.codec.http.HttpResponse branch. This cannot be a org.elasticsearch.http.HttpResponse indeed.

releasables.add(() -> eventLoopGroup.shutdownGracefully(0, 0, TimeUnit.SECONDS).awaitUninterruptibly());

final var gracefulClose = randomBoolean();
final var chunkSizeBytes = between(1, ByteSizeUnit.KB.toIntBytes(512));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the max intentionally Netty4WriteThrottlingHandler.MAX_BYTES_PER_WRITE * 2 to potentially trigger multiple slices of write?
If so, could we make this explicit by using the constant?
If not, is this value intentional or arbitrary?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of arbitrary, the case in #143252 happened to have chunks that looked to be in the 300kiB-450kiB range so I rounded that up to the next power of two. But yes Netty4WriteThrottlingHandler.MAX_BYTES_PER_WRITE * 2 seems like a reasonable way to express the limit as well - in practice this isn't a max, we flush a write every time we have queued up at least MAX_BYTES_PER_WRITE bytes but almost always we're going to overflow that because we only hint to encodeChunk to stop at that point, we do not enforce it.

@Override
public BytesReference next() {
return CHUNK;
logger.info("--> yielding chunk of size [{}]", chunkSize);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this will be quite noisy!
Is that intentional? Is the idea to keep this info log in case we manage to capture the situation described in issue # 143252 ? To check that the node continues processing chunks after the client has closed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We stop after at most 5 chunk lengths (plus some buffering) so this is fine. Unless we hit the infinite loop in which case the noise is a feature not a bug.

final var releasables = new ArrayList<Releasable>(4);

try {
releasables.add(withResourceTracker());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to verify my understanding, the current hypothesis for issue # 143252 is that when the client drops the connection, the channel becomes inactive but not yet fully closed, and something in the Netty pipeline starts silently discarding writes. So those writes never reach the ChannelOutboundBuffer, and the buffer stays empty, isWritable() keeps returning true, and the Netty4HttpPipeliningHandler.doFlush() loop keeps encoding chunks that go nowhere.
If this test manages to capture that, we would then see a timeout because sendChunksResponse never releases the ref that resource tracker is waiting for.
Is that all correct?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly.

@DaveCTurner DaveCTurner requested a review from inespot March 6, 2026 11:04
Copy link
Copy Markdown
Contributor

@inespot inespot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@DaveCTurner DaveCTurner enabled auto-merge (squash) March 6, 2026 13:50
@DaveCTurner DaveCTurner merged commit 29b5062 into elastic:main Mar 6, 2026
37 checks passed
szybia added a commit to szybia/elasticsearch that referenced this pull request Mar 6, 2026
…locations

* upstream/main: (153 commits)
  ES|QL: Update docs for TOP_SNIPPETS and DECAY (elastic#143739)
  Correctly include endpoint id in log msg in AuthorizationPoller (elastic#143743)
  Bar searching or sorting on _seq_no when disabled (elastic#143600)
  Generalize `testClientCancellation` test (elastic#143586)
  JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction (elastic#143702)
  Track recycler pages in circuit breaker (elastic#143738)
  [ESQL] Enable distributed pipeline breakers for external sources via FragmentExec (elastic#143696)
  Adding 'mode' and 'codec' fields to ES monitoring template (elastic#143673)
  [ESQL] Columnar I/O and vectorized block conversion for external sources (elastic#143703)
  Fix flaky MMR diversification YAML tests (elastic#143706)
  ES|QL codegen: check builder arguments for vector support (elastic#143724)
  Add Views Security Model (elastic#141050)
  ESQL: Prevent pushdown of unmapped fields in filters and sorts (elastic#143460)
  Don't run seq_no pruning tests in release CI (elastic#143725)
  ESQL: Support intra-row field references in ROW command (elastic#140217)
  ES|QL: Remove implicit limit in FORK branches in CSV tests (elastic#143601)
  IndexRoutingTests with and without synthetic id (elastic#143566)
  Synthetic id upgrade test in serverless (elastic#142471)
  Disable "Review skipped" comments for PRs without specified labels (elastic#143728)
  Cleanup ES|QL T-Digest code duplication, add memory accounting (elastic#143662)
  ...
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Mar 6, 2026
In elastic#143252 we saw a case where a chunked response appears to have
computed far too many chunks at once, with some indication that it might
relate to client cancellation. We have a test for this case already but
it's a little weak. This commit reworks the test to use a Netty HTTP
client giving much deeper control over the client's behaviour, allowing
to close the connection with either a FIN or a RST and to do so after
starting to receive the response body. It also randomizes the chunk size
and allows for a delay to simulate computation happening while yielding
the chunk.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Network Http and internode communication implementations Team:Distributed Meta label for distributed team. >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants