Fix use-after-free at event-loop shutdown#105486
Fix use-after-free at event-loop shutdown#105486elasticsearchmachine merged 13 commits intoelastic:mainfrom
Conversation
We could still be manipulating a network message when the event loop shuts down, causing us to close the message while it's still in use. This is at best going to be a little surprising to the caller, and at worst could be an outright use-after-free bug. This commit moves the double-check for a leaked promise to happen strictly after the event loop has fully terminated, so that we can be sure we've finished using it by this point. Relates elastic#105306, elastic#97301
|
Hi @DaveCTurner, I've created a changelog YAML for you. |
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
I'm pretty much certain this has no effects in production - we only get into this state at shutdown after disconnecting from all other nodes and all connected clients, and I don't see that we could end up re-using a buffer for something that could corrupt any data persisted to disk either. But still seems a little neater not to have to wrap every listener in a |
|
Hmph https://gradle-enterprise.elastic.co/s/v7bfpzgudlomq indicates this doesn't work, but I'm not seeing why. |
|
Ok got it, it's a |
|
Right this seems to work, but maybe there's a good reason why @original-brownbear didn't do it this way to start with? |
|
Hi @DaveCTurner, I've created a changelog YAML for you. |
original-brownbear
left a comment
There was a problem hiding this comment.
I'm a little hesitant to do this. This will definitely fix the problem if we also apply it to PromiseCombiner(ctx.executor()) I think but at the cost of risking stack overflows and changing the scheduling of IO vs compute on transport threads. Not sure it's worth taking this risk to fix a test only issue? Maybe we should rather look into ways of fixing this by detecting leaks coming out of this issue in tests specifically and not failing on them? (a tricky task admittedly)
| // Use ImmediateEventExecutor.INSTANCE since we want to be able to complete this promise, and any waiting listeners, even if the | ||
| // channel's event loop has shut down. Normally this completion will happen on the channel's event loop anyway because the write op | ||
| // can only be completed by some network event from this point on. However... | ||
| final var promise = new DefaultChannelPromise(channel, ImmediateEventExecutor.INSTANCE); |
There was a problem hiding this comment.
I mean this does fundamentally change how things work doesn't it? Normally, Netty will queue the result handling of the operation on its task queue, now this gets executed inline with the write. I have a very hard time understanding the precise effects of this but it makes some sense to assume that it will make IO event handling less predictable?
Also, I wonder if this even provides a full fix? We use this thing in a couple spots: new PromiseCombiner(ctx.executor()), aren't we still subject to the same short-comings just less likely to run into this now? The point of ctx.executor() executor isn't really forking but rather queuing.
There was a problem hiding this comment.
Normally, Netty will queue the result handling of the operation on its task queue, now this gets executed inline with the write.
I don't think that's true. If we're 8+ levels deep then it'll do that, but we don't seem to nest these things that much. The rest of the time if you complete the promise on the event loop (which is how we do it pretty much everywhere anyway) then it executes the listeners inline with the write.
If I can't convince you that stack overflows aren't a big deal here, we could reasonably make safeWriteAndFlush fork its listener onto a proper shutdown-sensitive threadpool. That still seems better than what we have today.
We use this thing in a couple spots:
new PromiseCombiner(ctx.executor()), aren't we still subject to the same short-comings
I don't think so, all those spots don't look like they'll do anything outside the event loop anyway. We're not relying on all those subsidiary promises completing if we have the protection at the outer level.
There was a problem hiding this comment.
Hmm you're right this should actually be ok because all our deep nesting from the combiner is on the event loop indeed so we only really add one more extra level worst case :) Sorry slow to wake up today :)
We could still be manipulating a network message when the event loop shuts down, causing us to close the message while it's still in use. This is at best going to be a little surprising to the caller, and at worst could be an outright use-after-free bug. This commit moves the double-check for a leaked promise to happen strictly after the event loop has fully terminated, so that we can be sure we've finished using it by this point. Relates elastic#105306, elastic#97301
💚 Backport successful
|
We could still be manipulating a network message when the event loop shuts down, causing us to close the message while it's still in use. This is at best going to be a little surprising to the caller, and at worst could be an outright use-after-free bug. This commit moves the double-check for a leaked promise to happen strictly after the event loop has fully terminated, so that we can be sure we've finished using it by this point. Relates #105306, #97301
| var wrapped = ActionListener.notifyOnce(listener); | ||
| channel.writeAndFlush(response, Netty4TcpChannel.addPromise(wrapped, channel)); | ||
| if (channel.eventLoop().isShutdown()) { | ||
| wrapped.onFailure(new TransportException("Cannot send HTTP response, event loop is shutting down.")); | ||
| } |
There was a problem hiding this comment.
Sorry for being slow on this issue. Could you please help me some more to understand what the issue was with the previous solution? You said
We could still be manipulating a network message when the event loop shuts down, causing us to close the message while it's still in use.
What do you mean by "manipulating a network message"? Is it releasing the response (or "close the message")? I fail to see how we can "close the message while it's still in use" given the listener is notifyOnce?
There was a problem hiding this comment.
The problem is that isShutdown() returns true while the event loop is still running its last few tasks, and one of those tasks might still be doing something with the message while we release it. To be sure we're releasing the message after all possible uses on the event loop we need to subscribe to the terminationFuture instead.
Replacing the releaseOnce with Promise#tryFailure is not strictly necessary to fix this.
There was a problem hiding this comment.
Ah OK. That makes sense. I am curious how you found out about it. Did you just read into the Netty source code?
There was a problem hiding this comment.
Yes, your review comment here prompted me to look a little deeper at the shutdown behaviour which made me realise we were doing it wrong in these places too.
We could still be manipulating a network message when the event loop
shuts down, causing us to close the message while it's still in use.
This is at best going to be a little surprising to the caller, and at
worst could be an outright use-after-free bug.
This commit moves the double-check for a leaked promise to happen
strictly after the event loop has fully terminated, so that we can be
sure we've finished using it by this point.
Relates #105306, #97301