Skip to content

Fix use-after-free at event-loop shutdown#105486

Merged
elasticsearchmachine merged 13 commits intoelastic:mainfrom
DaveCTurner:2024/02/14/use-after-free-netty-shutdown
Feb 15, 2024
Merged

Fix use-after-free at event-loop shutdown#105486
elasticsearchmachine merged 13 commits intoelastic:mainfrom
DaveCTurner:2024/02/14/use-after-free-netty-shutdown

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

We could still be manipulating a network message when the event loop
shuts down, causing us to close the message while it's still in use.
This is at best going to be a little surprising to the caller, and at
worst could be an outright use-after-free bug.

This commit moves the double-check for a leaked promise to happen
strictly after the event loop has fully terminated, so that we can be
sure we've finished using it by this point.

Relates #105306, #97301

We could still be manipulating a network message when the event loop
shuts down, causing us to close the message while it's still in use.
This is at best going to be a little surprising to the caller, and at
worst could be an outright use-after-free bug.

This commit moves the double-check for a leaked promise to happen
strictly after the event loop has fully terminated, so that we can be
sure we've finished using it by this point.

Relates elastic#105306, elastic#97301
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Feb 14, 2024
@DaveCTurner
Copy link
Copy Markdown
Member Author

I'm pretty much certain this has no effects in production - we only get into this state at shutdown after disconnecting from all other nodes and all connected clients, and I don't see that we could end up re-using a buffer for something that could corrupt any data persisted to disk either. But still seems a little neater not to have to wrap every listener in a notifyOnce().

@DaveCTurner
Copy link
Copy Markdown
Member Author

Hmph https://gradle-enterprise.elastic.co/s/v7bfpzgudlomq indicates this doesn't work, but I'm not seeing why.

@DaveCTurner DaveCTurner marked this pull request as draft February 14, 2024 10:38
@DaveCTurner
Copy link
Copy Markdown
Member Author

Ok got it, it's a ChannelPromise so it simply cannot be completed once the executor is gone.

@DaveCTurner DaveCTurner marked this pull request as ready for review February 14, 2024 18:01
@DaveCTurner
Copy link
Copy Markdown
Member Author

Right this seems to work, but maybe there's a good reason why @original-brownbear didn't do it this way to start with?

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little hesitant to do this. This will definitely fix the problem if we also apply it to PromiseCombiner(ctx.executor()) I think but at the cost of risking stack overflows and changing the scheduling of IO vs compute on transport threads. Not sure it's worth taking this risk to fix a test only issue? Maybe we should rather look into ways of fixing this by detecting leaks coming out of this issue in tests specifically and not failing on them? (a tricky task admittedly)

// Use ImmediateEventExecutor.INSTANCE since we want to be able to complete this promise, and any waiting listeners, even if the
// channel's event loop has shut down. Normally this completion will happen on the channel's event loop anyway because the write op
// can only be completed by some network event from this point on. However...
final var promise = new DefaultChannelPromise(channel, ImmediateEventExecutor.INSTANCE);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this does fundamentally change how things work doesn't it? Normally, Netty will queue the result handling of the operation on its task queue, now this gets executed inline with the write. I have a very hard time understanding the precise effects of this but it makes some sense to assume that it will make IO event handling less predictable?
Also, I wonder if this even provides a full fix? We use this thing in a couple spots: new PromiseCombiner(ctx.executor()), aren't we still subject to the same short-comings just less likely to run into this now? The point of ctx.executor() executor isn't really forking but rather queuing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, Netty will queue the result handling of the operation on its task queue, now this gets executed inline with the write.

I don't think that's true. If we're 8+ levels deep then it'll do that, but we don't seem to nest these things that much. The rest of the time if you complete the promise on the event loop (which is how we do it pretty much everywhere anyway) then it executes the listeners inline with the write.

If I can't convince you that stack overflows aren't a big deal here, we could reasonably make safeWriteAndFlush fork its listener onto a proper shutdown-sensitive threadpool. That still seems better than what we have today.

We use this thing in a couple spots: new PromiseCombiner(ctx.executor()), aren't we still subject to the same short-comings

I don't think so, all those spots don't look like they'll do anything outside the event loop anyway. We're not relying on all those subsidiary promises completing if we have the protection at the outer level.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm you're right this should actually be ok because all our deep nesting from the combiner is on the event loop indeed so we only really add one more extra level worst case :) Sorry slow to wake up today :)

@DaveCTurner DaveCTurner added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge labels Feb 15, 2024
@elasticsearchmachine elasticsearchmachine merged commit d8e6625 into elastic:main Feb 15, 2024
@DaveCTurner DaveCTurner deleted the 2024/02/14/use-after-free-netty-shutdown branch February 15, 2024 19:20
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Feb 15, 2024
We could still be manipulating a network message when the event loop
shuts down, causing us to close the message while it's still in use.
This is at best going to be a little surprising to the caller, and at
worst could be an outright use-after-free bug.

This commit moves the double-check for a leaked promise to happen
strictly after the event loop has fully terminated, so that we can be
sure we've finished using it by this point.

Relates elastic#105306, elastic#97301
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💚 Backport successful

Status Branch Result
8.13

elasticsearchmachine pushed a commit that referenced this pull request Feb 15, 2024
We could still be manipulating a network message when the event loop
shuts down, causing us to close the message while it's still in use.
This is at best going to be a little surprising to the caller, and at
worst could be an outright use-after-free bug.

This commit moves the double-check for a leaked promise to happen
strictly after the event loop has fully terminated, so that we can be
sure we've finished using it by this point.

Relates #105306, #97301
Comment on lines -37 to -41
var wrapped = ActionListener.notifyOnce(listener);
channel.writeAndFlush(response, Netty4TcpChannel.addPromise(wrapped, channel));
if (channel.eventLoop().isShutdown()) {
wrapped.onFailure(new TransportException("Cannot send HTTP response, event loop is shutting down."));
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being slow on this issue. Could you please help me some more to understand what the issue was with the previous solution? You said

We could still be manipulating a network message when the event loop shuts down, causing us to close the message while it's still in use.

What do you mean by "manipulating a network message"? Is it releasing the response (or "close the message")? I fail to see how we can "close the message while it's still in use" given the listener is notifyOnce?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that isShutdown() returns true while the event loop is still running its last few tasks, and one of those tasks might still be doing something with the message while we release it. To be sure we're releasing the message after all possible uses on the event loop we need to subscribe to the terminationFuture instead.

Replacing the releaseOnce with Promise#tryFailure is not strictly necessary to fix this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. That makes sense. I am curious how you found out about it. Did you just read into the Netty source code?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your review comment here prompted me to look a little deeper at the shutdown behaviour which made me realise we were doing it wrong in these places too.

@DaveCTurner DaveCTurner restored the 2024/02/14/use-after-free-netty-shutdown branch June 17, 2024 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Distributed/Network Http and internode communication implementations Team:Distributed Meta label for distributed team. v8.13.0 v8.14.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants