Skip to content

Conversation

@cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Oct 31, 2025

In jobs controller, we have the file ~/.sky/job_controller_pid to keep track of all alive controller pids. However, this file is persisted in PVC for a remote API server deployment and upon an API server upgrade, the process will be killed and the pid is possible to be reused by the OS.

Hence, we need to check whether the command that the PID runs is actually the jobs controller. Otherwise it will skip creating any job controller and all jobs stuck at pending.

# This should be a correct process cmdline
root@skypilot-api-server-7bbfcd6ff7-whxc6:/# cat /proc/30346/cmdline 
/usr/local/bin/python-u-msky.jobs.controllere74da74f-9c6b-40a8-b78a-91bf0cdbe10eroot@skypilot-api-server-7bbfcd6ff7-whxc
root@skypilot-api-server-7bbfcd6ff7-whxc6:/# cat ~/.sky/job_controller_pid
476
478
480
482
484
# Instead, this is what we see on a buggy server.
root@skypilot-api-server-7bbfcd6ff7-whxc6:/# cat /proc/476/cmdline 
/usr/local/bin/python-cfrom multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=20, pipe_handle=104)--multiprocessing-fork

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR: Tested on the buggy server and it fixes the problem.
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants