Fix use-after-free at event-loop shutdown by DaveCTurner · Pull Request #105486 · elastic/elasticsearch

DaveCTurner · 2024-02-14T10:05:49Z

We could still be manipulating a network message when the event loop
shuts down, causing us to close the message while it's still in use.
This is at best going to be a little surprising to the caller, and at
worst could be an outright use-after-free bug.

This commit moves the double-check for a leaked promise to happen
strictly after the event loop has fully terminated, so that we can be
sure we've finished using it by this point.

Relates #105306, #97301

We could still be manipulating a network message when the event loop shuts down, causing us to close the message while it's still in use. This is at best going to be a little surprising to the caller, and at worst could be an outright use-after-free bug. This commit moves the double-check for a leaked promise to happen strictly after the event loop has fully terminated, so that we can be sure we've finished using it by this point. Relates elastic#105306, elastic#97301

elasticsearchmachine · 2024-02-14T10:06:13Z

Hi @DaveCTurner, I've created a changelog YAML for you.

elasticsearchmachine · 2024-02-14T10:06:13Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2024-02-14T10:08:33Z

I'm pretty much certain this has no effects in production - we only get into this state at shutdown after disconnecting from all other nodes and all connected clients, and I don't see that we could end up re-using a buffer for something that could corrupt any data persisted to disk either. But still seems a little neater not to have to wrap every listener in a notifyOnce().

DaveCTurner · 2024-02-14T10:33:39Z

Hmph https://gradle-enterprise.elastic.co/s/v7bfpzgudlomq indicates this doesn't work, but I'm not seeing why.

DaveCTurner · 2024-02-14T11:43:55Z

Ok got it, it's a ChannelPromise so it simply cannot be completed once the executor is gone.

DaveCTurner · 2024-02-14T18:02:17Z

Right this seems to work, but maybe there's a good reason why @original-brownbear didn't do it this way to start with?

elasticsearchmachine · 2024-02-15T09:17:13Z

Hi @DaveCTurner, I've created a changelog YAML for you.

original-brownbear

I'm a little hesitant to do this. This will definitely fix the problem if we also apply it to PromiseCombiner(ctx.executor()) I think but at the cost of risking stack overflows and changing the scheduling of IO vs compute on transport threads. Not sure it's worth taking this risk to fix a test only issue? Maybe we should rather look into ways of fixing this by detecting leaks coming out of this issue in tests specifically and not failing on them? (a tricky task admittedly)

original-brownbear · 2024-02-15T10:56:38Z

+        // Use ImmediateEventExecutor.INSTANCE since we want to be able to complete this promise, and any waiting listeners, even if the
+        // channel's event loop has shut down. Normally this completion will happen on the channel's event loop anyway because the write op
+        // can only be completed by some network event from this point on. However...
+        final var promise = new DefaultChannelPromise(channel, ImmediateEventExecutor.INSTANCE);


I mean this does fundamentally change how things work doesn't it? Normally, Netty will queue the result handling of the operation on its task queue, now this gets executed inline with the write. I have a very hard time understanding the precise effects of this but it makes some sense to assume that it will make IO event handling less predictable?
Also, I wonder if this even provides a full fix? We use this thing in a couple spots: new PromiseCombiner(ctx.executor()), aren't we still subject to the same short-comings just less likely to run into this now? The point of ctx.executor() executor isn't really forking but rather queuing.

Normally, Netty will queue the result handling of the operation on its task queue, now this gets executed inline with the write.

I don't think that's true. If we're 8+ levels deep then it'll do that, but we don't seem to nest these things that much. The rest of the time if you complete the promise on the event loop (which is how we do it pretty much everywhere anyway) then it executes the listeners inline with the write.

If I can't convince you that stack overflows aren't a big deal here, we could reasonably make safeWriteAndFlush fork its listener onto a proper shutdown-sensitive threadpool. That still seems better than what we have today.

We use this thing in a couple spots: new PromiseCombiner(ctx.executor()), aren't we still subject to the same short-comings

I don't think so, all those spots don't look like they'll do anything outside the event loop anyway. We're not relying on all those subsidiary promises completing if we have the protection at the outer level.

Hmm you're right this should actually be ok because all our deep nesting from the combiner is on the event loop indeed so we only really add one more extra level worst case :) Sorry slow to wake up today :)

…r-free-netty-shutdown

We could still be manipulating a network message when the event loop shuts down, causing us to close the message while it's still in use. This is at best going to be a little surprising to the caller, and at worst could be an outright use-after-free bug. This commit moves the double-check for a leaked promise to happen strictly after the event loop has fully terminated, so that we can be sure we've finished using it by this point. Relates elastic#105306, elastic#97301

elasticsearchmachine · 2024-02-15T19:21:39Z

💚 Backport successful

Status	Branch	Result
✅	8.13

We could still be manipulating a network message when the event loop shuts down, causing us to close the message while it's still in use. This is at best going to be a little surprising to the caller, and at worst could be an outright use-after-free bug. This commit moves the double-check for a leaked promise to happen strictly after the event loop has fully terminated, so that we can be sure we've finished using it by this point. Relates #105306, #97301

ywangd · 2024-02-19T06:38:27Z

-        var wrapped = ActionListener.notifyOnce(listener);
-        channel.writeAndFlush(response, Netty4TcpChannel.addPromise(wrapped, channel));
-        if (channel.eventLoop().isShutdown()) {
-            wrapped.onFailure(new TransportException("Cannot send HTTP response, event loop is shutting down."));
-        }


Sorry for being slow on this issue. Could you please help me some more to understand what the issue was with the previous solution? You said

We could still be manipulating a network message when the event loop shuts down, causing us to close the message while it's still in use.

What do you mean by "manipulating a network message"? Is it releasing the response (or "close the message")? I fail to see how we can "close the message while it's still in use" given the listener is notifyOnce?

The problem is that isShutdown() returns true while the event loop is still running its last few tasks, and one of those tasks might still be doing something with the message while we release it. To be sure we're releasing the message after all possible uses on the event loop we need to subscribe to the terminationFuture instead.

Replacing the releaseOnce with Promise#tryFailure is not strictly necessary to fix this.

Ah OK. That makes sense. I am curious how you found out about it. Did you just read into the Netty source code?

Yes, your review comment here prompted me to look a little deeper at the shutdown behaviour which made me realise we were doing it wrong in these places too.

DaveCTurner added >bug :Distributed/Network Http and internode communication implementations v8.13.0 v8.12.2 labels Feb 14, 2024

DaveCTurner requested review from Tim-Brooks, original-brownbear, pxsalehi and ywangd February 14, 2024 10:05

elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Feb 14, 2024

Update docs/changelog/105486.yaml

637aa0d

addListener and addPromise should do the same thing

14f0d57

DaveCTurner mentioned this pull request Feb 14, 2024

Pausable chunked HTTP responses #104851

Merged

DaveCTurner marked this pull request as draft February 14, 2024 10:38

DaveCTurner added the WIP label Feb 14, 2024

Delete docs/changelog/105486.yaml

ad3f278

DaveCTurner mentioned this pull request Feb 14, 2024

Debugging for #105486 #105496

Closed

DaveCTurner added 2 commits February 14, 2024 16:59

Merge branch 'main' into 2024/02/14/use-after-free-netty-shutdown

e12f03a

Use ImmediateEventExecutor

da03ddc

DaveCTurner marked this pull request as ready for review February 14, 2024 18:01

elasticsearchmachine added v8.14.0 and removed v8.13.0 labels Feb 14, 2024

DaveCTurner added v8.13.0 and removed WIP labels Feb 15, 2024

Update docs/changelog/105486.yaml

d485c22

original-brownbear reviewed Feb 15, 2024

View reviewed changes

original-brownbear approved these changes Feb 15, 2024

View reviewed changes

DaveCTurner added 2 commits February 15, 2024 13:12

Merge branch 'main' into 2024/02/14/use-after-free-netty-shutdown

9b3af6e

Document the stack overflow risk

a6703fb

DaveCTurner added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge labels Feb 15, 2024

DaveCTurner added 4 commits February 15, 2024 15:57

Merge branch 'main' into 2024/02/14/use-after-free-netty-shutdown

a6f8f3e

Rejection is also possible

1b7d6e7

Merge remote-tracking branch 'upstream/main' into 2024/02/14/use-afte…

4e13420

…r-free-netty-shutdown

Merge branch 'main' into 2024/02/14/use-after-free-netty-shutdown

d516812

DaveCTurner removed the v8.12.2 label Feb 15, 2024

elasticsearchmachine merged commit d8e6625 into elastic:main Feb 15, 2024

DaveCTurner deleted the 2024/02/14/use-after-free-netty-shutdown branch February 15, 2024 19:20

DaveCTurner mentioned this pull request Feb 15, 2024

[8.13] Fix use-after-free at event-loop shutdown (#105486) #105575

Merged

ywangd reviewed Feb 19, 2024

View reviewed changes

DaveCTurner restored the 2024/02/14/use-after-free-netty-shutdown branch June 17, 2024 06:16

Conversation

DaveCTurner commented Feb 14, 2024

Uh oh!

elasticsearchmachine commented Feb 14, 2024

Uh oh!

elasticsearchmachine commented Feb 14, 2024

Uh oh!

DaveCTurner commented Feb 14, 2024

Uh oh!

DaveCTurner commented Feb 14, 2024

Uh oh!

DaveCTurner commented Feb 14, 2024

Uh oh!

DaveCTurner commented Feb 14, 2024

Uh oh!

elasticsearchmachine commented Feb 15, 2024

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear Feb 15, 2024

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Feb 15, 2024

Choose a reason for hiding this comment

Uh oh!

original-brownbear Feb 15, 2024

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Feb 15, 2024

💚 Backport successful

Uh oh!

ywangd Feb 19, 2024

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Feb 19, 2024

Choose a reason for hiding this comment

Uh oh!

ywangd Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants