Generalize testClientCancellation test#143586
Conversation
In elastic#143252 we saw a case where a chunked response appears to have computed far too many chunks at once, with some indication that it might relate to client cancellation. We have a test for this case already but it's a little weak. This commit reworks the test to use a Netty HTTP client giving much deeper control over the client's behaviour, allowing to close the connection with either a FIN or a RST and to do so after starting to receive the response body. It also randomizes the chunk size and allows for a delay to simulate computation happening while yielding the chunk.
|
Pinging @elastic/es-distributed (Team:Distributed) |
inespot
left a comment
There was a problem hiding this comment.
Looks really good! A couple questions/clarifications before I plusone
| ctx.close(); | ||
| } | ||
| } else { | ||
| assertThat(msg, instanceOf(HttpResponse.class)); |
There was a problem hiding this comment.
Is this else condition practically reachable? I also don't think I fully understand why we are checking the message is instance of org.elasticsearch.http.HttpResponse in that case. But I might be missing a piece of the puzzle
There was a problem hiding this comment.
Yes, we get a single HttpResponse instance representing the response headers, followed by a sequence of HttpContent objects representing the body. Try it: add an assert false here and run the test :)
There was a problem hiding this comment.
My bad, I thought this was the io.netty.handler.codec.http.HttpResponse branch. This cannot be a org.elasticsearch.http.HttpResponse indeed.
| releasables.add(() -> eventLoopGroup.shutdownGracefully(0, 0, TimeUnit.SECONDS).awaitUninterruptibly()); | ||
|
|
||
| final var gracefulClose = randomBoolean(); | ||
| final var chunkSizeBytes = between(1, ByteSizeUnit.KB.toIntBytes(512)); |
There was a problem hiding this comment.
Is the max intentionally Netty4WriteThrottlingHandler.MAX_BYTES_PER_WRITE * 2 to potentially trigger multiple slices of write?
If so, could we make this explicit by using the constant?
If not, is this value intentional or arbitrary?
There was a problem hiding this comment.
It's kind of arbitrary, the case in #143252 happened to have chunks that looked to be in the 300kiB-450kiB range so I rounded that up to the next power of two. But yes Netty4WriteThrottlingHandler.MAX_BYTES_PER_WRITE * 2 seems like a reasonable way to express the limit as well - in practice this isn't a max, we flush a write every time we have queued up at least MAX_BYTES_PER_WRITE bytes but almost always we're going to overflow that because we only hint to encodeChunk to stop at that point, we do not enforce it.
| @Override | ||
| public BytesReference next() { | ||
| return CHUNK; | ||
| logger.info("--> yielding chunk of size [{}]", chunkSize); |
There was a problem hiding this comment.
I expect this will be quite noisy!
Is that intentional? Is the idea to keep this info log in case we manage to capture the situation described in issue # 143252 ? To check that the node continues processing chunks after the client has closed?
There was a problem hiding this comment.
We stop after at most 5 chunk lengths (plus some buffering) so this is fine. Unless we hit the infinite loop in which case the noise is a feature not a bug.
| final var releasables = new ArrayList<Releasable>(4); | ||
|
|
||
| try { | ||
| releasables.add(withResourceTracker()); |
There was a problem hiding this comment.
So to verify my understanding, the current hypothesis for issue # 143252 is that when the client drops the connection, the channel becomes inactive but not yet fully closed, and something in the Netty pipeline starts silently discarding writes. So those writes never reach the ChannelOutboundBuffer, and the buffer stays empty, isWritable() keeps returning true, and the Netty4HttpPipeliningHandler.doFlush() loop keeps encoding chunks that go nowhere.
If this test manages to capture that, we would then see a timeout because sendChunksResponse never releases the ref that resource tracker is waiting for.
Is that all correct?
…locations * upstream/main: (153 commits) ES|QL: Update docs for TOP_SNIPPETS and DECAY (elastic#143739) Correctly include endpoint id in log msg in AuthorizationPoller (elastic#143743) Bar searching or sorting on _seq_no when disabled (elastic#143600) Generalize `testClientCancellation` test (elastic#143586) JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction (elastic#143702) Track recycler pages in circuit breaker (elastic#143738) [ESQL] Enable distributed pipeline breakers for external sources via FragmentExec (elastic#143696) Adding 'mode' and 'codec' fields to ES monitoring template (elastic#143673) [ESQL] Columnar I/O and vectorized block conversion for external sources (elastic#143703) Fix flaky MMR diversification YAML tests (elastic#143706) ES|QL codegen: check builder arguments for vector support (elastic#143724) Add Views Security Model (elastic#141050) ESQL: Prevent pushdown of unmapped fields in filters and sorts (elastic#143460) Don't run seq_no pruning tests in release CI (elastic#143725) ESQL: Support intra-row field references in ROW command (elastic#140217) ES|QL: Remove implicit limit in FORK branches in CSV tests (elastic#143601) IndexRoutingTests with and without synthetic id (elastic#143566) Synthetic id upgrade test in serverless (elastic#142471) Disable "Review skipped" comments for PRs without specified labels (elastic#143728) Cleanup ES|QL T-Digest code duplication, add memory accounting (elastic#143662) ...
In elastic#143252 we saw a case where a chunked response appears to have computed far too many chunks at once, with some indication that it might relate to client cancellation. We have a test for this case already but it's a little weak. This commit reworks the test to use a Netty HTTP client giving much deeper control over the client's behaviour, allowing to close the connection with either a FIN or a RST and to do so after starting to receive the response body. It also randomizes the chunk size and allows for a delay to simulate computation happening while yielding the chunk.
In #143252 we saw a case where a chunked response appears to have
computed far too many chunks at once, with some indication that it might
relate to client cancellation. We have a test for this case already but
it's a little weak. This commit reworks the test to use a Netty HTTP
client giving much deeper control over the client's behaviour, allowing
to close the connection with either a FIN or a RST and to do so after
starting to receive the response body. It also randomizes the chunk size
and allows for a delay to simulate computation happening while yielding
the chunk.