-
Notifications
You must be signed in to change notification settings - Fork 840
[jobs] allow one jobs controller process to manage many jobs #7051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/quicktest-core |
|
/smoke-test |
|
/smoke-test --kubernetes |
|
/quicktest-core |
This reverts commit 6f98299. Co-authored-by: Christopher Cooper <[email protected]>
This is a relic of a previous implementation that would claim the job without its schedule_state being moved back to WAITING.
| D: await write(row2) | ||
| E: cursor = await conn.execute(read_row2) | ||
| F: await cursor.fetchall() | ||
| The A -> B -> D -> E -> C time sequence will cause B and D read at the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use AsyncAdaptedQueuePool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, this is only used for the jobs controller, so I just wanted to get this working. Thanks for the hint though, I'll add a TODO to add this.
|
/quicktest-core --base-branch v0.9.3 |
After skypilot-org#7051, job launch will call sky.launch on the API server under the hood. In consolidation mode, this means the sky.launch requests running to manage the jobs may block other jobs.launch requests that are still submitting. You will definitely see this if you are trying to launch significantly more jobs than you have long executors.
After skypilot-org#7051, job launch will call sky.launch on the API server under the hood. In consolidation mode, this means the sky.launch requests running to manage the jobs may block other jobs.launch requests that are still submitting. You will definitely see this if you are trying to launch significantly more jobs than you have long executors.
…7127) * [jobs] in consolidation mode, use a short executor for jobs.launch After #7051, job launch will call sky.launch on the API server under the hood. In consolidation mode, this means the sky.launch requests running to manage the jobs may block other jobs.launch requests that are still submitting. You will definitely see this if you are trying to launch significantly more jobs than you have long executors. * add comment
Previously, after skypilot-org#7051, we allowed the number of jobs launching/running to scale purely based on the controller resources. After this change, we set maximum values for the number that can run and launch, since there are some bottlenecks (e.g. DB) that will not necessarily scale with the instance resources.
* [jobs] limit the max number of jobs that can run/launch Previously, after #7051, we allowed the number of jobs launching/running to scale purely based on the controller resources. After this change, we set maximum values for the number that can run and launch, since there are some bottlenecks (e.g. DB) that will not necessarily scale with the instance resources. * address comments
Diff on top of Previous PR (#6459): https://github.com/skypilot-org/skypilot/pull/7051/files/79880fa2862d9c3ddfe5323b2520a66094ce7c5d..70f58e11cc2ad5d2bb743466600693e55092f478#diff-53a58c3336a3ceca1c6dbdeb15d780a4c28a00efd0cc9af79096e4e6b79385de
Tested (run the relevant ones):
bash format.shtests/smoke_tests/backward_compat/test_backward_compat.py::TestBackwardCompatibility::test_managed_jobs- with both current and base branches AFTER this PR (test the upgrade path after this PR for future versions)/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)