-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Guard concurrent access to generator IDs with a mutex #50740
Conversation
714f2fd
to
b875ecc
Compare
This reverts commit 7fae1f9. Signed-off-by: Edward Oakes <[email protected]>
…n Python thread (ray-project#49547)" This reverts commit 41d15db. Signed-off-by: Edward Oakes <[email protected]>
… non-main Python thread (ray-project#49547)"" This reverts commit 97a0127. Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
release/release_tests.yaml
Outdated
- __suffix__: aws0 | ||
- __suffix__: aws1 | ||
- __suffix__: aws2 | ||
- __suffix__: aws3 | ||
- __suffix__: aws4 | ||
- __suffix__: aws5 | ||
- __suffix__: aws6 | ||
- __suffix__: aws7 | ||
- __suffix__: aws8 | ||
- __suffix__: aws9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove this before merging, it's there for verifying that it fixes the issue :)
@stephanie-wang PTAL Also -- is there a difference between an "object ref stream" and an "object ref generator"? They seem to be used mostly interchangeably in the code which was a bit confusing. For example: |
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
src/ray/core_worker/core_worker.h
Outdated
ABSL_GUARDED_BY(mu_); | ||
|
||
// Additionally tracks the sub-states of RUNNING_IN_RAY_GET/WAIT. The counters here | ||
// overlap with those of counter_. | ||
CounterMap<std::pair<std::string, bool>> running_in_get_counter_ ABSL_GUARDED_BY(&mu_); | ||
CounterMap<std::pair<std::string, bool>> running_in_wait_counter_ ABSL_GUARDED_BY(&mu_); | ||
CounterMap<std::pair<std::string, bool>> running_in_get_counter_ ABSL_GUARDED_BY(mu_); | ||
CounterMap<std::pair<std::string, bool>> running_in_wait_counter_ ABSL_GUARDED_BY(mu_); | ||
|
||
std::string job_id_ ABSL_GUARDED_BY(&mu_); | ||
std::string job_id_ ABSL_GUARDED_BY(mu_); | ||
// Used for actor state tracking. | ||
std::string actor_name_ ABSL_GUARDED_BY(&mu_); | ||
int64_t num_tasks_running_ ABSL_GUARDED_BY(&mu_) = 0; | ||
std::string actor_name_ ABSL_GUARDED_BY(mu_); | ||
int64_t num_tasks_running_ ABSL_GUARDED_BY(mu_) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just removed &
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Release tests passed 20/20: https://buildkite.com/ray-project/release/builds/33907 |
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
…ject#50740) The serve microbenchmark has been sporadically failing due to memory corruption issues (see the linked issue). One of the tracebacks captured pointed to the fact that the `deleted_generator_ids_` map was being accessed concurrently by multiple threads. Fixed by adding a mutex. Verified that it at least dramatically reduces the frequency of the crashes. I've also renamed a few fields for clarity. ## Related issue number ray-project#50802 --------- Signed-off-by: Edward Oakes <[email protected]>
…ject#50740) The serve microbenchmark has been sporadically failing due to memory corruption issues (see the linked issue). One of the tracebacks captured pointed to the fact that the `deleted_generator_ids_` map was being accessed concurrently by multiple threads. Fixed by adding a mutex. Verified that it at least dramatically reduces the frequency of the crashes. I've also renamed a few fields for clarity. ray-project#50802 --------- Signed-off-by: Edward Oakes <[email protected]>
Cherry-pick #50740, fixes some serve test flakiness Signed-off-by: Edward Oakes <[email protected]>
Why are these changes needed?
The serve microbenchmark has been sporadically failing due to memory corruption issues (see the linked issue). One of the tracebacks captured pointed to the fact that the
deleted_generator_ids_
map was being accessed concurrently by multiple threads. Fixed by adding a mutex.Verified that it at least dramatically reduces the frequency of the crashes.
I've also renamed a few fields for clarity.
Related issue number
#50802