Skip to content

Conversation

@cg505
Copy link
Collaborator

@cg505 cg505 commented Sep 6, 2025

Diff on top of Previous PR (#6459): https://github.com/skypilot-org/skypilot/pull/7051/files/79880fa2862d9c3ddfe5323b2520a66094ce7c5d..70f58e11cc2ad5d2bb743466600693e55092f478#diff-53a58c3336a3ceca1c6dbdeb15d780a4c28a00efd0cc9af79096e4e6b79385de

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • tests/smoke_tests/backward_compat/test_backward_compat.py::TestBackwardCompatibility::test_managed_jobs - with both current and base branches AFTER this PR (test the upgrade path after this PR for future versions)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@cg505
Copy link
Collaborator Author

cg505 commented Sep 6, 2025

/quicktest-core

@cg505
Copy link
Collaborator Author

cg505 commented Sep 8, 2025

/smoke-test
/smoke-test --kubernetes
/quicktest-core

@cg505 cg505 marked this pull request as ready for review September 8, 2025 23:23
@cg505
Copy link
Collaborator Author

cg505 commented Sep 8, 2025

/smoke-test --kubernetes

@cg505
Copy link
Collaborator Author

cg505 commented Sep 8, 2025

/quicktest-core

D: await write(row2)
E: cursor = await conn.execute(read_row2)
F: await cursor.fetchall()
The A -> B -> D -> E -> C time sequence will cause B and D read at the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use AsyncAdaptedQueuePool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, this is only used for the jobs controller, so I just wanted to get this working. Thanks for the hint though, I'll add a TODO to add this.

@cg505 cg505 changed the title [jobs] allow one jobs controller to manage many jobs [jobs] allow one jobs controller process to manage many jobs Sep 9, 2025
@Michaelvll
Copy link
Collaborator

/quicktest-core --base-branch v0.9.3

@cg505 cg505 merged commit eafb271 into skypilot-org:master Sep 9, 2025
18 checks passed
cg505 added a commit to cg505/skypilot that referenced this pull request Sep 10, 2025
After skypilot-org#7051, job launch will call sky.launch on the API server under
the hood. In consolidation mode, this means the sky.launch requests
running to manage the jobs may block other jobs.launch requests that
are still submitting. You will definitely see this if you are trying
to launch significantly more jobs than you have long executors.
cg505 added a commit to cg505/skypilot that referenced this pull request Sep 10, 2025
After skypilot-org#7051, job launch will call sky.launch on the API server under
the hood. In consolidation mode, this means the sky.launch requests
running to manage the jobs may block other jobs.launch requests that
are still submitting. You will definitely see this if you are trying
to launch significantly more jobs than you have long executors.
cg505 added a commit that referenced this pull request Sep 16, 2025
…7127)

* [jobs] in consolidation mode, use a short executor for jobs.launch

After #7051, job launch will call sky.launch on the API server under
the hood. In consolidation mode, this means the sky.launch requests
running to manage the jobs may block other jobs.launch requests that
are still submitting. You will definitely see this if you are trying
to launch significantly more jobs than you have long executors.

* add comment
cg505 added a commit to cg505/skypilot that referenced this pull request Oct 6, 2025
Previously, after skypilot-org#7051, we allowed the number of jobs
launching/running to scale purely based on the controller resources.

After this change, we set maximum values for the number that can run
and launch, since there are some bottlenecks (e.g. DB) that will not
necessarily scale with the instance resources.
cg505 added a commit that referenced this pull request Oct 7, 2025
* [jobs] limit the max number of jobs that can run/launch

Previously, after #7051, we allowed the number of jobs
launching/running to scale purely based on the controller resources.

After this change, we set maximum values for the number that can run
and launch, since there are some bottlenecks (e.g. DB) that will not
necessarily scale with the instance resources.

* address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants