-
Notifications
You must be signed in to change notification settings - Fork 840
[jobs] record process timestamp to protect against reboot/pid reuse #7847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/smoke-test --managed-jobs |
|
/smoke-test --jobs-consolidation --managed-jobs |
|
/smoke-test --managed-jobs |
|
/quicktest-core |
|
/smoke-test --kubernetes |
cblmemo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sky/jobs/scheduler.py
Outdated
| except ValueError: | ||
| return None | ||
|
|
||
| started_at: typing.Optional[float] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| started_at: typing.Optional[float] = None | |
| started_at: Optional[float] = None |
sky/jobs/scheduler.py
Outdated
| entry = entry.strip() | ||
| if not entry: | ||
| return None | ||
| raw_pid, _, raw_started_at = entry.partition(',') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a comment for the format of entry here? also, why use partition (but not split)?
sky/jobs/scheduler.py
Outdated
| process = psutil.Process(record.pid) | ||
| if record.started_at is not None: | ||
| if process.create_time() != record.started_at: | ||
| logger.debug(f'Controller process {record.pid} has started ' | ||
| f'at {record.started_at} but process has ' | ||
| f'started at {process.create_time()}') | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also check the cmdline contains "controller" like #7803 ? i think started at should be sufficient after this pr. but for backward compatibility this might be an issue?
sky/jobs/scheduler.py
Outdated
| """ | ||
| controller_pid = state.get_job_controller_pid(job_id) | ||
| if controller_pid is not None: | ||
| controller_process = state.get_job_controller_pid(job_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| controller_process = state.get_job_controller_pid(job_id) | |
| controller_process = state.get_job_controller_process(job_id) |
|
/smoke-test |
|
/quicktest-core --base-branch v0.10.3 |
|
/smoke-test --kubernetes |
cblmemo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @cg505! spotted several suspicious places. Could you help confirm?
| try: | ||
| waiting_job = await managed_job_state.get_waiting_job_async( | ||
| pid=-os.getpid()) | ||
| pid=self._pid, pid_started_at=self._pid_started_at) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we originally use negative pid but now it is all positive. is this expected? any backward compatibility that needs to be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backwards compatibility is handled in the other places in the PR that use the PID.
| if pid < 0: | ||
| # Between #7051 and #7847, the controller pid was negative to | ||
| # indicate a non-legacy multi-job controller process. | ||
| return False | ||
| return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if pid < 0: | |
| # Between #7051 and #7847, the controller pid was negative to | |
| # indicate a non-legacy multi-job controller process. | |
| return False | |
| return True | |
| if pid < 0: | |
| # Between #7051 and #7847, the controller pid was negative to | |
| # indicate a non-legacy multi-job controller process. | |
| return True | |
| return False |
i think the two bool is reverted..? negative pid should indicate legacy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Legacy" here means "before #7051". So negative is not legacy
|
/quicktest-core |
|
/smoke-test --managed-jobs |
|
/smoke-test |
cblmemo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @cg505 ! LGTM.
|
woohoo thanks for the fix guys! |
This uses the same technique used by psutil to notice when a PID has been reused. It is also resilient across machine reboots.
psutil code:
https://github.com/giampaolo/psutil/blob/055ad0f1eb87280d32bd33e30fd21405866e83e6/psutil/__init__.py#L362-L395
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)