Skip to content

Remove redundant event loop shutdown check in Netty4TcpChannel#96856

Merged
original-brownbear merged 1 commit intoelastic:mainfrom
original-brownbear:95759
Jun 16, 2023
Merged

Remove redundant event loop shutdown check in Netty4TcpChannel#96856
original-brownbear merged 1 commit intoelastic:mainfrom
original-brownbear:95759

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear commented Jun 15, 2023

This check isn't necessary any longer. We safely close all channels before shutting down the even loop groups so we can not run into the situation that a listener passed to writeAndFlush isn't completed. We also don't have this check on the http channel sending. -> remove the check that can cause a bug by double invoking the listener when the event loop concurrently closes.

closes #95759

This check isn't necessary any longer. We safely close all channels
before shutting down the even loop groups so we can not run into
the situation that a listener passed to `writeAndFlush` isn't completed.
We also don't have this check on the http channel sending.
-> remove the check that can cause a bug by double invoking the listener
closes #95759
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed/Network Http and internode communication implementations v8.9.0 labels Jun 15, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Jun 15, 2023
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep seems better to shut these things down gracefully indeed. Can we assert channel.eventLoop().isShutdown() == false to make sure our shutdown process actually works as described?

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Can we assert channel.eventLoop().isShutdown() == false to make sure our shutdown process actually works as described?

We can't, the fact that we can run into this situation is exactly what's causing the linked test failure.
We can have a non-transport thread send a message and either run into a closed channel (which will fail the listener right away) or less have the event loop shut down concurrently so that it was still open when calling writeAndFlush but then got closed before we do the check.
We don't have the guarantee that the transport threads are stopped last is what it comes down to.

@DaveCTurner
Copy link
Copy Markdown
Member

DaveCTurner commented Jun 15, 2023

Ah ok sorry I missed that this could be running outside the event loop. Does assert (channel.eventLoop().inEventLoop() && channel.eventLoop().isShutdown()) == false; hold? Eh forget that, how could we be in the event loop if it's shut down?

@original-brownbear
Copy link
Copy Markdown
Contributor Author

original-brownbear commented Jun 15, 2023

I assume you mean:

assert (channel.eventLoop().inEventLoop() == false || channel.eventLoop().isShutdown()) == false;

? I As far as I can tell there's no guarantee that that's true. Only if we checked for isTerminated would it be true but it also seems redundant to assert that?

how could we be in the event loop if it's shut down?

I guess shut-down just means no new tasks are accepted. Terminated means all tasks are done.

Copy link
Copy Markdown
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks Tim + David! Merging this since you're on PTO David :)

@original-brownbear original-brownbear merged commit 74a594d into elastic:main Jun 16, 2023
@original-brownbear original-brownbear deleted the 95759 branch June 16, 2023 08:23
original-brownbear added a commit that referenced this pull request Jul 2, 2023
Follow up to #96856, turns out this wasn't safe to remove after all.
We can't go back to the previous solution though since that had double invocation issues
=> use notify-once for now until this is fixed in Netty.
@original-brownbear original-brownbear restored the 95759 branch November 30, 2024 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Network Http and internode communication implementations Team:Distributed Meta label for distributed team. >test Issues or PRs that are addressing/adding tests v8.9.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] SearchScrollIT testRestartDataNodesDuringScrollSearch failing

4 participants