Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Starting Kernels Proposal #592

Closed
blink1073 opened this issue Oct 14, 2021 · 8 comments · Fixed by #593
Closed

Slow Starting Kernels Proposal #592

blink1073 opened this issue Oct 14, 2021 · 8 comments · Fixed by #593

Comments

@blink1073
Copy link
Contributor

Problem

Jupyter Notebook was originally built with the assumption that kernels would start quickly. This turns out
to not be true for some local kernels and most remote kernels.

Proposed Solution

  • We previously proposed changing the REST API to reflect kernels/sessions that were "pending". The downside to a REST API change is that the server would need to advertise capability through a versioned API or some other status, and clients would need to be updated to accommodate the changes.

  • An alternative method is to leave the current REST APIs intact and instead introduce the concept of a "pending" kernel that
    acts like a regular kernel from the client's perspective.

  • A POST to /api/sessions or /api/kernels would create a "pending" kernel and return immediately before starting the kernel.

  • It remains to be seen during implementation what changes need to be made to handlers and managers, but at the very least we will use a scheduled callback to actually start the kernels when we are handling the POST.
    The MappingKernelManager will also need to be updated to handle pending kernels internally in its public methods.

  • We should use the kernel manager to get the kernel id

  • We need to think about how kernel failure to start is handled for the user. Previously, it could be given to the user in the response to a POST

  • We might even be able to add the pending logic to the handlers without needing to affect the managers (e.g by calling save_state on the managers directly)

  • We might also want to address slow-stopping kernels as part of these changes

@echarles
Copy link
Member

A POST to /api/sessions or /api/kernels would create a "pending" kernel and return immediately before starting the kernel.

For now the response to POST /api/sessions, see https://petstore.swagger.io/?url=https://raw.githubusercontent.com/jupyter/jupyter_server/master/jupyter_server/services/api/api.yaml#/sessions/post_api_sessions, is e.g.

{
  "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "path": "string",
  "name": "string",
  "type": "string",
  "kernel": {
    "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
    "name": "string",
    "last_activity": "string",
    "connections": 0,
    "execution_state": "string"
  }
}

Is the intent to add an additional status field with value pending for those cases?

@blink1073
Copy link
Contributor Author

Thanks for the suggestion about state machines, but I agree that we should make this change as small as possible for consumers. Our thought yesterday is that the kernel should still be thought of as "starting" from the point of the view of the REST client, we're extending the "starting" phase to include starting the process.

@blink1073
Copy link
Contributor Author

Following up from a suggestion @vidartf had during the meeting, there is in fact a _starting_kernels property in MultiKernelManager in jupyter_client, but since it only stores a future for the kernel id, we can't use it to create a model for the GET response. An option is to push this logic down to jupyter_client and allow it to use _starting_kernels in more of its public functions. What do folks think?

@blink1073
Copy link
Contributor Author

blink1073 commented Oct 16, 2021

We could make this an opt-in behavior at the level of MultiKernelManager in jupyter_client, whether to use KernelManager objects before waiting for them to start. For consumers opting into this behavior, we add KernelManager.ready property that is a future that resolves when the kernel process has started. We may even have that wait until the "nudge" is complete. Then, a consumer like jupyter_server's websocket handler could wait for the ready future before attempting to send/receive messages to the kernel.

When the opt-in behavior is selected, we do not wait for the future in _async_start_kernel and we instead add the kernel to our internal map of kernels immediately.

@blink1073
Copy link
Contributor Author

I opened jupyter/jupyter_client#712 to explore the ideas from the previous comment.

@blink1073 blink1073 mentioned this issue Oct 19, 2021
8 tasks
@mlucool
Copy link

mlucool commented Oct 21, 2021

This is a great idea!

We worked with @Carreau to tackle slow kernels from another angle. For remote kernels, scheduling is a somewhat large fixed cost (which this proposal seems like it'll bring down). Restarting a remote kernel need not be slow since a user typically means "restart my kernel" and not "reschedule me". With https://github.com/Carreau/inplace_restarter, you can run a restart magic to just restart the kernel, which ends up being very fast for this use case.

If other's like this idea, maybe this can be included as one way to help tackle the slow starting kernel problem for remote kernels. I understand if you feel this is far enough from the rest of this issue to warrant a separate discussion.

@blink1073
Copy link
Contributor Author

blink1073 commented Oct 21, 2021

Interesting! Yes, I think that warrants its own discussion. There's also @echarles's recent efforts in https://github.com/datalayer/jupyterpool.

@echarles
Copy link
Member

Interesting! Yes, I think that warrants its own discussion. There's also @echarles's recent efforts in https://github.com/datalayer/jupyterpool.

The jupyterpool effort came from my frustration as a user to wait 30s (sometimes more) to get an up-and-running Spark on Hadoop (big data) kernel. BTW Things are much better nowadays on that specific are with faster Spark kernels, but if you extrapolate a bit, you can say, hey I want a kernel preloaded with that 30TB of dataset in a ready -to use dataframe in the second.

Having ready-to-be-user Jupyter Kernels to which a user/notebook can bind is something I am working on and is part of making the server more microservice-like, where the security, the code content, the kernel, the datasets... are separated concerns.

Having such a pool of kernel to be used can be simple for python kernel, but drive interesting questions in terms of user impersonation when you want to bind user foo to a running kernel and assing that kernel the permissions of user foo (thinking to e.g. a pod running on a Kubernetes cluster).

To wrap-up I thing this specific issue is a great quick-win to build a better user interaction (say you show message to the user like "Your kernel is starting, we keep you updated") but is just the very first step a long road that we need to discuss and address in may other issues and PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants