Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway manager retry kernel updates #1256

Merged

Conversation

ojarjur
Copy link
Contributor

@ojarjur ojarjur commented Apr 7, 2023

This change updates the GatewayMappingKernelManager class to double check when it believes kernels may have been culled in the Gateway server.

Previously, if a known kernel was missing from the list of kernels returned by the Gateway server, then the GatewayMappingKernelManager class assumed it was culled. With this change, it will first attempt to fetch
the kernel from the Gateway server and only treat it as culled if this second check also fails.

This increases reliability in the scenario of a Gateway server erroneously returning a partial list of kernels due to things like bugs around race conditions for newly created kernels, or transient network errors when combining results from multiple backends.

The GatewayMappingKernelManager keeps an internal record of all the remote kernels, and periodically syncs it with the Gateway server.

When it finds that a kernel it previously knew about is no longer in the Gateway server's list of kernels, it has to decide how to reconcile that.

Previously, it was assuming that the kernel was probably culled by the Gateway server, and thus removed it from its internal records.

However, it is conceivable that the list from the upstream Gateway server might have been incomplete due to any combination of bugs, race conditions, or transient network connectivity issues, etc.

If one of those such scenarios occurred, then the previous logic would have declared the kernel as lost.

This change makes the GatewayMappingKernelManager more resilient to such issues by double checking whether or not the kernel was actually removed in the upstream Gateway server.

It does this by attempting to update the GatewayKernelManager instance's model before deciding that the kernel has been culled.
This change extends the test_gateway.py suite to simulate kernels
being transiently missing from kernel list responses.

The new test fails without the update to the GatewayMappingKernelManager
to double check if kernels have been culled, and passes with it.
@codecov
Copy link

codecov bot commented Apr 7, 2023

Codecov Report

Patch coverage: 77.77% and project coverage change: +0.01 🎉

Comparison is base (cd8010e) 80.39% compared to head (5c36160) 80.41%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1256      +/-   ##
==========================================
+ Coverage   80.39%   80.41%   +0.01%     
==========================================
  Files          68       68              
  Lines        8263     8270       +7     
  Branches     1600     1601       +1     
==========================================
+ Hits         6643     6650       +7     
+ Misses       1198     1197       -1     
- Partials      422      423       +1     
Impacted Files Coverage Δ
jupyter_server/gateway/managers.py 85.48% <77.77%> (+1.87%) ⬆️

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ojarjur
Copy link
Contributor Author

ojarjur commented Apr 7, 2023

I don't think the @codecov report is correctly accounting for the included test change.

Edit: It changed after I posted this comment. I'm still not 100% sure that it is right, but it does seem to be more believable now.

@ojarjur
Copy link
Contributor Author

ojarjur commented Apr 7, 2023

@kevin-bates FYI, this issue was something I hit while (using my hacky prototype version of) multiplexing local and remote kernels like described in #1187

This might seem like a questionable benefit when using a single gateway server, but if/when we get support for multiplexing between multiple ones I think this will be an important reliability detail.

@kevin-bates
Copy link
Member

Hi @ojarjur,

This might seem like a questionable benefit when using a single gateway server, but if/when we get support for multiplexing between multiple ones I think this will be an important reliability detail.

I think this change makes perfect sense - thank you! I really apologize we haven't made progress on Kernel Servers!

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ojarjur!

@kevin-bates kevin-bates merged commit 87b2158 into jupyter-server:main Apr 7, 2023
@ojarjur
Copy link
Contributor Author

ojarjur commented Apr 7, 2023

@kevin-bates

I really apologize we haven't made progress on Kernel Servers!

There's nothing to apologize about. Thanks for all of your guidance and feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants