Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarter issues with slow to start kernels #715

Closed
vidartf opened this issue Oct 27, 2021 · 2 comments · Fixed by #717
Closed

Restarter issues with slow to start kernels #715

vidartf opened this issue Oct 27, 2021 · 2 comments · Fixed by #717

Comments

@vidartf
Copy link
Contributor

vidartf commented Oct 27, 2021

We've been having some quite bad issues with the restarter when a slow-to-start kernel fails (due to "address already in use" error which is also made worse by the slow to start kernel, since there is more time for another process to take the port in question). I've tracked down the restarter problems to a few points:

  • We use a PeriodicCallback to poll the state of the kernel. If the callback takes longer to complete than the period at which it is called, it will prevent any subsequent calls until the running one completes. However, this only holds true for synchronous callbacks, which breaks down for the AsyncIOLoopKernelRestarter which is now the default. Consequently, the poll method is bascially reentrant for slow-starting kernels (default period for polling is 3s).
  • The poll method only sets the _restarting flag after the restart_kernel method has returned, which means that when it is reentered due to the poll not being awaited, it will trigger a new restart while the previous restart is still running.
  • The reentrancy will also cause the "_restart_count" to be reset to 1 every time, causing an ~infinite restart loop.

Separately:

  • The slow kernel start up means that the first self.kernel_manager.is_alive() call will succeed before the "address already in use" error causes the kernel to fail startup. This means that the _initial_startup flag gets set to False before it dies, which means it keeps trying the same ports again and again, even if the random_ports_until_alive flag is set. Ideally, _initial_startup would only be toggled after the connections have all been established.
  • If the KernelManager.autorestart flag is set to False, clients never receive any information when the kernel dies, since it is the restarter that monitors for and sends the dead status. Ideally, this would still be sent even if we say that we don't want the kernel to automatically restart when it dies.
@vidartf vidartf changed the title Restarter issues Restarter issues with slow to start kernels Oct 27, 2021
@vidartf
Copy link
Contributor Author

vidartf commented Oct 27, 2021

* Ideally, `_initial_startup` would only be toggled after the connections have all been established.

I'm wondering if this assessment is a 100% accurate, so this might need to be double-checked.

@kevin-bates
Copy link
Member

Thanks for opening this issue @vidartf - I agree with your assessments.

Aside from tweaking some of the boolean usages (and the incorrect grammar of _restarting) I'm not sure how we can distinguish between the kernel process startup itself (as is the case today) and the startup of a fully-working kernel. Since that is difficult given we're not privy to the kernel's socket usage at this level, it would be nice to at least address the inability to use new ports on initial startup - although I think the same issue is present there (i.e., the setting of _initial_startup to False is premature). I wonder if tracking some number of successful (3-second) iterations would help solve most cases in determining initial startup completion? (Although you could always encounter a slower-starting kernel.)

Might there be a way to interleave some of this with the pending kernels changes (and the ready flag)? Hmm - looks like it would have the same issue since ready is marked as done immediately following post-start but the process may still fail with a port issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants