Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Guard concurrent access to generator IDs with a mutex #50740

Merged
merged 26 commits into from
Feb 23, 2025

Conversation

edoakes
Copy link
Contributor

@edoakes edoakes commented Feb 19, 2025

Why are these changes needed?

The serve microbenchmark has been sporadically failing due to memory corruption issues (see the linked issue). One of the tracebacks captured pointed to the fact that the deleted_generator_ids_ map was being accessed concurrently by multiple threads. Fixed by adding a mutex.

Verified that it at least dramatically reduces the frequency of the crashes.

I've also renamed a few fields for clarity.

Related issue number

#50802

@edoakes edoakes force-pushed the eoakes/test branch 2 times, most recently from 714f2fd to b875ecc Compare February 22, 2025 13:40
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Feb 22, 2025
This reverts commit 7fae1f9.

Signed-off-by: Edward Oakes <[email protected]>
…n Python thread (ray-project#49547)"

This reverts commit 41d15db.

Signed-off-by: Edward Oakes <[email protected]>
… non-main Python thread (ray-project#49547)""

This reverts commit 97a0127.

Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
@edoakes edoakes changed the title [WIP] bisect segfault [core] Guard concurrent access to generator IDs with a mutex Feb 22, 2025
@edoakes edoakes requested a review from a team February 22, 2025 15:42
Comment on lines 2059 to 2068
- __suffix__: aws0
- __suffix__: aws1
- __suffix__: aws2
- __suffix__: aws3
- __suffix__: aws4
- __suffix__: aws5
- __suffix__: aws6
- __suffix__: aws7
- __suffix__: aws8
- __suffix__: aws9
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this before merging, it's there for verifying that it fixes the issue :)

@edoakes
Copy link
Contributor Author

edoakes commented Feb 22, 2025

@stephanie-wang PTAL

Also -- is there a difference between an "object ref stream" and an "object ref generator"? They seem to be used mostly interchangeably in the code which was a bit confusing. For example: CoreWorker::AsyncDelObjectRefStream(const ObjectID &generator_id)

Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Comment on lines 125 to 135
ABSL_GUARDED_BY(mu_);

// Additionally tracks the sub-states of RUNNING_IN_RAY_GET/WAIT. The counters here
// overlap with those of counter_.
CounterMap<std::pair<std::string, bool>> running_in_get_counter_ ABSL_GUARDED_BY(&mu_);
CounterMap<std::pair<std::string, bool>> running_in_wait_counter_ ABSL_GUARDED_BY(&mu_);
CounterMap<std::pair<std::string, bool>> running_in_get_counter_ ABSL_GUARDED_BY(mu_);
CounterMap<std::pair<std::string, bool>> running_in_wait_counter_ ABSL_GUARDED_BY(mu_);

std::string job_id_ ABSL_GUARDED_BY(&mu_);
std::string job_id_ ABSL_GUARDED_BY(mu_);
// Used for actor state tracking.
std::string actor_name_ ABSL_GUARDED_BY(&mu_);
int64_t num_tasks_running_ ABSL_GUARDED_BY(&mu_) = 0;
std::string actor_name_ ABSL_GUARDED_BY(mu_);
int64_t num_tasks_running_ ABSL_GUARDED_BY(mu_) = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just removed &

Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
@edoakes
Copy link
Contributor Author

edoakes commented Feb 23, 2025

Release tests passed 20/20: https://buildkite.com/ray-project/release/builds/33907

Signed-off-by: Edward Oakes <[email protected]>
@edoakes edoakes enabled auto-merge (squash) February 23, 2025 14:41
Signed-off-by: Edward Oakes <[email protected]>
@github-actions github-actions bot disabled auto-merge February 23, 2025 14:49
@edoakes edoakes enabled auto-merge (squash) February 23, 2025 14:51
@edoakes edoakes merged commit d5b51d9 into ray-project:master Feb 23, 2025
6 checks passed
edoakes added a commit to edoakes/ray that referenced this pull request Feb 23, 2025
…ject#50740)

The serve microbenchmark has been sporadically failing due to memory
corruption issues (see the linked issue). One of the tracebacks captured
pointed to the fact that the `deleted_generator_ids_` map was being
accessed concurrently by multiple threads. Fixed by adding a mutex.

Verified that it at least dramatically reduces the frequency of the
crashes.

I've also renamed a few fields for clarity.

## Related issue number

ray-project#50802

---------

Signed-off-by: Edward Oakes <[email protected]>
edoakes added a commit to edoakes/ray that referenced this pull request Feb 23, 2025
…ject#50740)

The serve microbenchmark has been sporadically failing due to memory
corruption issues (see the linked issue). One of the tracebacks captured
pointed to the fact that the `deleted_generator_ids_` map was being
accessed concurrently by multiple threads. Fixed by adding a mutex.

Verified that it at least dramatically reduces the frequency of the
crashes.

I've also renamed a few fields for clarity.

ray-project#50802

---------

Signed-off-by: Edward Oakes <[email protected]>
aslonnie pushed a commit that referenced this pull request Feb 23, 2025
Cherry-pick #50740, fixes some serve test flakiness


Signed-off-by: Edward Oakes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants