-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Fix use-after-free at event-loop shutdown #105486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d47b27d
637aa0d
14f0d57
ad3f278
e12f03a
da03ddc
d485c22
9b3af6e
a6703fb
a6f8f3e
1b7d6e7
4e13420
d516812
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| pr: 105486 | ||
| summary: Fix use-after-free at event-loop shutdown | ||
| area: Network | ||
| type: bug | ||
| issues: [] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,22 +11,30 @@ | |
| import io.netty.buffer.ByteBuf; | ||
| import io.netty.buffer.CompositeByteBuf; | ||
| import io.netty.buffer.Unpooled; | ||
| import io.netty.channel.Channel; | ||
| import io.netty.channel.DefaultChannelPromise; | ||
| import io.netty.util.NettyRuntime; | ||
| import io.netty.util.concurrent.Future; | ||
| import io.netty.util.concurrent.ImmediateEventExecutor; | ||
|
|
||
| import org.apache.lucene.util.BytesRef; | ||
| import org.apache.lucene.util.BytesRefIterator; | ||
| import org.elasticsearch.ExceptionsHelper; | ||
| import org.elasticsearch.action.ActionListener; | ||
| import org.elasticsearch.common.bytes.BytesArray; | ||
| import org.elasticsearch.common.bytes.BytesReference; | ||
| import org.elasticsearch.common.recycler.Recycler; | ||
| import org.elasticsearch.common.settings.Settings; | ||
| import org.elasticsearch.common.util.concurrent.EsExecutors; | ||
| import org.elasticsearch.core.Booleans; | ||
| import org.elasticsearch.transport.TransportException; | ||
|
|
||
| import java.io.IOException; | ||
| import java.nio.ByteBuffer; | ||
| import java.util.ArrayList; | ||
| import java.util.List; | ||
| import java.util.Locale; | ||
| import java.util.concurrent.RejectedExecutionException; | ||
| import java.util.concurrent.atomic.AtomicBoolean; | ||
|
|
||
| public class Netty4Utils { | ||
|
|
@@ -121,4 +129,57 @@ public static Recycler<BytesRef> createRecycler(Settings settings) { | |
| setAvailableProcessors(EsExecutors.allocatedProcessors(settings)); | ||
| return NettyAllocator.getRecycler(); | ||
| } | ||
|
|
||
| /** | ||
| * Calls {@link Channel#writeAndFlush} to write the given message to the given channel, but ensures that the listener is completed even | ||
| * if the event loop is concurrently shutting down since Netty does not offer this guarantee. | ||
| */ | ||
| public static void safeWriteAndFlush(Channel channel, Object message, ActionListener<Void> listener) { | ||
| // Use ImmediateEventExecutor.INSTANCE since we want to be able to complete this promise, and any waiting listeners, even if the | ||
| // channel's event loop has shut down. Normally this completion will happen on the channel's event loop anyway because the write op | ||
| // can only be completed by some network event from this point on. However... | ||
| final var promise = new DefaultChannelPromise(channel, ImmediateEventExecutor.INSTANCE); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean this does fundamentally change how things work doesn't it? Normally, Netty will queue the result handling of the operation on its task queue, now this gets executed inline with the write. I have a very hard time understanding the precise effects of this but it makes some sense to assume that it will make IO event handling less predictable?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't think that's true. If we're 8+ levels deep then it'll do that, but we don't seem to nest these things that much. The rest of the time if you complete the promise on the event loop (which is how we do it pretty much everywhere anyway) then it executes the listeners inline with the write. If I can't convince you that stack overflows aren't a big deal here, we could reasonably make
I don't think so, all those spots don't look like they'll do anything outside the event loop anyway. We're not relying on all those subsidiary promises completing if we have the protection at the outer level.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm you're right this should actually be ok because all our deep nesting from the combiner is on the event loop indeed so we only really add one more extra level worst case :) Sorry slow to wake up today :) |
||
| addListener(promise, listener); | ||
| assert assertCorrectPromiseListenerThreading(channel, promise); | ||
| channel.writeAndFlush(message, promise); | ||
| if (channel.eventLoop().isShuttingDown()) { | ||
| // ... if we get here then the event loop may already have terminated, and https://github.com/netty/netty/issues/8007 means that | ||
| // we cannot know if the preceding writeAndFlush made it onto its queue before shutdown or whether it will just vanish without a | ||
| // trace, so to avoid a leak we must double-check that the final listener is completed. | ||
| channel.eventLoop().terminationFuture().addListener(ignored -> | ||
| // NB the promise executor is ImmediateEventExecutor.INSTANCE which means this call to tryFailure() will ensure its completion, | ||
| // and the completion of any waiting listeners, without forking away from the current thread. The current thread might be the | ||
| // thread that was running the event loop since that's where the terminationFuture is completed, or it might be a thread which | ||
| // called (and is still calling) safeWriteAndFlush. | ||
| promise.tryFailure(new TransportException("Cannot send network message, event loop is shutting down."))); | ||
| } | ||
| } | ||
|
|
||
| private static boolean assertCorrectPromiseListenerThreading(Channel channel, Future<?> promise) { | ||
| final var eventLoop = channel.eventLoop(); | ||
| promise.addListener(future -> { | ||
| assert eventLoop.inEventLoop() || future.cause() instanceof RejectedExecutionException || channel.eventLoop().isTerminated() | ||
| : future.cause(); | ||
| }); | ||
| return true; | ||
| } | ||
|
|
||
| /** | ||
| * Subscribes the given {@link ActionListener} to the given {@link Future}. | ||
| */ | ||
| public static void addListener(Future<Void> future, ActionListener<Void> listener) { | ||
| future.addListener(f -> { | ||
| if (f.isSuccess()) { | ||
| listener.onResponse(null); | ||
| } else { | ||
| final Throwable cause = f.cause(); | ||
| ExceptionsHelper.maybeDieOnAnotherThread(cause); | ||
| if (cause instanceof Exception exception) { | ||
| listener.onFailure(exception); | ||
| } else { | ||
| listener.onFailure(new Exception(cause)); | ||
| } | ||
| } | ||
| }); | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being slow on this issue. Could you please help me some more to understand what the issue was with the previous solution? You said
What do you mean by "manipulating a network message"? Is it releasing the response (or "close the message")? I fail to see how we can "close the message while it's still in use" given the listener is notifyOnce?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that
isShutdown()returns true while the event loop is still running its last few tasks, and one of those tasks might still be doing something with the message while we release it. To be sure we're releasing the message after all possible uses on the event loop we need to subscribe to theterminationFutureinstead.Replacing the
releaseOncewithPromise#tryFailureis not strictly necessary to fix this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK. That makes sense. I am curious how you found out about it. Did you just read into the Netty source code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your review comment here prompted me to look a little deeper at the shutdown behaviour which made me realise we were doing it wrong in these places too.