Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Avoid job scheduling race condition #4310

Merged
merged 4 commits into from
Nov 10, 2024
Merged

[Core] Avoid job scheduling race condition #4310

merged 4 commits into from
Nov 10, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Nov 9, 2024

Another race condition in job scheduling besides #4264 ...

The pending jobs should be queried during the pending loop, otherwise, a same job can be submitted twice to ray job in the following condition.

  1. Two threads calling schedule_step() and both get the list of pending jobs
  2. The first thread submitted the first pending job
  3. The second thread look at the stale pending job list and found the first job not submitted yet, so it will submit the job again.

23 jobs in parallel now, a bit more than #4264

290  sky-cmd  8 mins ago      -               -         1x[CPU:1+]  PENDING    ~/sky_logs/sky-2024-11-09-04-39-25-046022  
289  sky-cmd  8 mins ago      a few secs ago  2s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-23-183791  
288  sky-cmd  8 mins ago      a few secs ago  5s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-22-702691  
287  sky-cmd  8 mins ago      a few secs ago  7s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-22-167590  
286  sky-cmd  8 mins ago      a few secs ago  10s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-21-982999  
285  sky-cmd  8 mins ago      12 secs ago     12s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-21-824679  
284  sky-cmd  8 mins ago      15 secs ago     15s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-20-840780  
283  sky-cmd  8 mins ago      18 secs ago     18s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-19-944765  
282  sky-cmd  8 mins ago      20 secs ago     20s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-18-955162  
281  sky-cmd  8 mins ago      23 secs ago     23s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-17-037321  
280  sky-cmd  8 mins ago      25 secs ago     25s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-16-274941  
279  sky-cmd  8 mins ago      28 secs ago     28s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-15-639855  
278  sky-cmd  8 mins ago      30 secs ago     30s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-15-495491  
277  sky-cmd  8 mins ago      33 secs ago     33s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-15-271224  
276  sky-cmd  8 mins ago      36 secs ago     36s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-14-332769  
275  sky-cmd  8 mins ago      38 secs ago     38s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-13-681088  
274  sky-cmd  8 mins ago      41 secs ago     41s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-13-161247  
273  sky-cmd  8 mins ago      44 secs ago     44s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-11-060571  
272  sky-cmd  8 mins ago      46 secs ago     46s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-10-313379  
271  sky-cmd  8 mins ago      49 secs ago     49s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-09-829889  
270  sky-cmd  8 mins ago      52 secs ago     52s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-08-895334  
269  sky-cmd  8 mins ago      54 secs ago     54s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-08-861447  
268  sky-cmd  8 mins ago      57 secs ago     57s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-08-065965  
267  sky-cmd  8 mins ago      59 secs ago     59s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-07-076732  
266  sky-cmd  8 mins ago      1 min ago       1m        1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-09-04-39-06-624532

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for identifying this @Michaelvll ! LGTM

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 9, 2024

Actually, do you think we should add job specific lock for all job-related actions?

@Michaelvll
Copy link
Collaborator Author

Actually, do you think we should add job specific lock for all job-related actions?

Do you have any examples where we should add the lock?

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 9, 2024

Actually, do you think we should add job specific lock for all job-related actions?

Do you have any examples where we should add the lock?

e.g. in this function we are getting all information of the jobs altogether, and cancel them one-by-one. Not sure if it is possible that some job information is stale, but adding lock to every place looks safer to me

def cancel_jobs_encoded_results(jobs: Optional[List[int]],
cancel_all: bool = False) -> str:
"""Cancel jobs.
Args:
jobs: Job IDs to cancel. (See `cancel_all` for special semantics.)
cancel_all: Whether to cancel all jobs. If True, asserts `jobs` is
set to None. If False and `jobs` is None, cancel the latest
running job.
Returns:
Encoded job IDs that are actually cancelled. Caller should use
common_utils.decode_payload() to parse.
"""

This is the only one I can find, but not sure if I missed any place

@Michaelvll
Copy link
Collaborator Author

Actually, do you think we should add job specific lock for all job-related actions?

Do you have any examples where we should add the lock?

e.g. in this function we are getting all information of the jobs altogether, and cancel them one-by-one. Not sure if it is possible that some job information is stale, but adding lock to every place looks safer to me

def cancel_jobs_encoded_results(jobs: Optional[List[int]],
cancel_all: bool = False) -> str:
"""Cancel jobs.
Args:
jobs: Job IDs to cancel. (See `cancel_all` for special semantics.)
cancel_all: Whether to cancel all jobs. If True, asserts `jobs` is
set to None. If False and `jobs` is None, cancel the latest
running job.
Returns:
Encoded job IDs that are actually cancelled. Caller should use
common_utils.decode_payload() to parse.
"""

This is the only one I can find, but not sure if I missed any place

Ahh, good catch! I fixed that in #4318, but did not adopt it in this one. Let me do it now.

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 9, 2024

Actually, do you think we should add job specific lock for all job-related actions?

Do you have any examples where we should add the lock?

e.g. in this function we are getting all information of the jobs altogether, and cancel them one-by-one. Not sure if it is possible that some job information is stale, but adding lock to every place looks safer to me

def cancel_jobs_encoded_results(jobs: Optional[List[int]],
cancel_all: bool = False) -> str:
"""Cancel jobs.
Args:
jobs: Job IDs to cancel. (See `cancel_all` for special semantics.)
cancel_all: Whether to cancel all jobs. If True, asserts `jobs` is
set to None. If False and `jobs` is None, cancel the latest
running job.
Returns:
Encoded job IDs that are actually cancelled. Caller should use
common_utils.decode_payload() to parse.
"""

This is the only one I can find, but not sure if I missed any place

Ahh, good catch! I fixed that in #4318, but did not adopt it in this one. Let me do it now.

Thanks! LGTM.

@Michaelvll Michaelvll added this pull request to the merge queue Nov 9, 2024
Merged via the queue into master with commit 2944014 Nov 10, 2024
20 checks passed
@Michaelvll Michaelvll deleted the job-race branch November 10, 2024 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants