Skip to content

Conversation

@cg505
Copy link
Collaborator

@cg505 cg505 commented Nov 4, 2025

This uses the same technique used by psutil to notice when a PID has been reused. It is also resilient across machine reboots.
psutil code:
https://github.com/giampaolo/psutil/blob/055ad0f1eb87280d32bd33e30fd21405866e83e6/psutil/__init__.py#L362-L395

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • manually killed controllers
    • killed controllers, restarted controller and made sure the PIDs were reused
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@cg505 cg505 requested review from SeungjinYang and cblmemo November 4, 2025 02:31
@cg505
Copy link
Collaborator Author

cg505 commented Nov 4, 2025

/smoke-test --managed-jobs

@cg505
Copy link
Collaborator Author

cg505 commented Nov 4, 2025

/smoke-test --jobs-consolidation --managed-jobs

@cg505
Copy link
Collaborator Author

cg505 commented Nov 4, 2025

/smoke-test --managed-jobs
/smoke-test --jobs-consolidation --managed-jobs

@cg505
Copy link
Collaborator Author

cg505 commented Nov 4, 2025

/quicktest-core

@cg505
Copy link
Collaborator Author

cg505 commented Nov 4, 2025

/smoke-test --kubernetes

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @cg505 ! IIUC it can close #7803 as they resolve the same issue.eft some discussion on backward compatibility ;)

except ValueError:
return None

started_at: typing.Optional[float] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
started_at: typing.Optional[float] = None
started_at: Optional[float] = None

entry = entry.strip()
if not entry:
return None
raw_pid, _, raw_started_at = entry.partition(',')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a comment for the format of entry here? also, why use partition (but not split)?

Comment on lines 246 to 252
process = psutil.Process(record.pid)
if record.started_at is not None:
if process.create_time() != record.started_at:
logger.debug(f'Controller process {record.pid} has started '
f'at {record.started_at} but process has '
f'started at {process.create_time()}')
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also check the cmdline contains "controller" like #7803 ? i think started at should be sufficient after this pr. but for backward compatibility this might be an issue?

"""
controller_pid = state.get_job_controller_pid(job_id)
if controller_pid is not None:
controller_process = state.get_job_controller_pid(job_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
controller_process = state.get_job_controller_pid(job_id)
controller_process = state.get_job_controller_process(job_id)

@cg505 cg505 requested a review from cblmemo November 7, 2025 00:21
@cg505
Copy link
Collaborator Author

cg505 commented Nov 7, 2025

/smoke-test

@cg505
Copy link
Collaborator Author

cg505 commented Nov 7, 2025

/quicktest-core --base-branch v0.10.3

@cg505
Copy link
Collaborator Author

cg505 commented Nov 7, 2025

/smoke-test --kubernetes
/quicktest-core --base-branch v0.10.3

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @cg505! spotted several suspicious places. Could you help confirm?

try:
waiting_job = await managed_job_state.get_waiting_job_async(
pid=-os.getpid())
pid=self._pid, pid_started_at=self._pid_started_at)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we originally use negative pid but now it is all positive. is this expected? any backward compatibility that needs to be done?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backwards compatibility is handled in the other places in the PR that use the PID.

Comment on lines +1201 to +1205
if pid < 0:
# Between #7051 and #7847, the controller pid was negative to
# indicate a non-legacy multi-job controller process.
return False
return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if pid < 0:
# Between #7051 and #7847, the controller pid was negative to
# indicate a non-legacy multi-job controller process.
return False
return True
if pid < 0:
# Between #7051 and #7847, the controller pid was negative to
# indicate a non-legacy multi-job controller process.
return True
return False

i think the two bool is reverted..? negative pid should indicate legacy?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Legacy" here means "before #7051". So negative is not legacy

@cg505
Copy link
Collaborator Author

cg505 commented Nov 13, 2025

/quicktest-core
/quicktest-core --base-branch v0.10.3
/quicktest-core --base-branch v0.9.3

@cg505
Copy link
Collaborator Author

cg505 commented Nov 13, 2025

/smoke-test --managed-jobs
/smoke-test --managed-jobs --kubernetes

@cg505 cg505 requested a review from cblmemo November 13, 2025 23:52
@cg505
Copy link
Collaborator Author

cg505 commented Nov 14, 2025

/smoke-test

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @cg505 ! LGTM.

@cg505 cg505 merged commit b765b33 into skypilot-org:master Nov 14, 2025
25 of 26 checks passed
@coopslarhette
Copy link
Contributor

woohoo thanks for the fix guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants