-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[core] Minor cpp changes around core worker #48262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
| auto remaining_timeout_ms = timeout_ms; | ||
| auto timeout_timestamp = current_time_ms() + timeout_ms; | ||
| while (!is_ready_) { | ||
| // TODO (dayshah): see if using cv condition function instead of busy while helps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya i think it could still be relevant, pretty sure using cv.wait_for(lock, timeout, []() { return !is_ready; }); could lead to less cpu usage vs the busy while loop but want to look more into how it would affect performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's do it in another PR, if needed
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
|
test failures in cpp and python |
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
| RAY_LOG(INFO) << "Cancelling a task: " << task_spec.TaskId() | ||
| << " force_kill: " << force_kill << " recursive: " << recursive; | ||
| const SchedulingKey scheduling_key( | ||
| SchedulingKey scheduling_key( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why removed const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it can be moved into function capture below
| if (!keep_executing) { | ||
| RAY_UNUSED(task_finisher_->FailOrRetryPendingTask( | ||
| task_spec.TaskId(), rpc::ErrorType::TASK_CANCELLED, nullptr)); | ||
| RequestNewWorkerIfNeeded(scheduling_key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we removed a call to FailOrRetryPendingTask here. is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya this is the call we removed for when the task is removed and the task dependencies have just now been resolved, since now we're failing task before when cancel is actually called
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
jjyao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PR description, could you add details on how the hang happens. Something like #47861
| << "Cancel an actor task"; | ||
| CancelActorTaskOnExecutor( | ||
| caller_worker_id, task_id, force_kill, recursive, on_cancel_callback); | ||
| caller_worker_id, task_id, force_kill, recursive, std::move(on_cancel_callback)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to move since the parameter is const &?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, my bad left it in even after changing param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait actually CancelActorTaskOnExecucutor takes by value, CancelTaskOnExecutor takes by const ref
| auto it = submissible_tasks_.find(task_id); | ||
| RAY_CHECK(it != submissible_tasks_.end()) | ||
| << "Tried to fail task that was not pending " << task_id; | ||
| RAY_CHECK(it->second.IsPending()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why changing this? The function name indicating that it's a pending task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the issue was there was a ray data backpressure test and a train test that failed here, it seems that the task cancel would happen after the task finishes and by the time it acquires that mutex in FailPendingTask it's already at a point where the task status is finished. So change here is basically just to no-op on if the task is finished at this point.
and IsPending checks for status != fail and finish, only want to check for fail here to make sure we're not double failing
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Updated pr description to list the 3 main changes and why they're needed |
Signed-off-by: dayshah <[email protected]>
|
|
||
| /// The maximum number of requests in flight per client. | ||
| const int64_t kMaxBytesInFlight = 16 * 1024 * 1024; | ||
| constexpr int64_t kMaxBytesInFlight = 16L * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Randomly find this PR, two things:
- We should have a human-readable unit library, something like
16_MiB; inlineis needed if we declare constants at header file- C++ spec doesn't guide the behavior on
constexpr, when included by multiple translation units, whether the symbol appears once or multiple times, so it's a compiler-based UB
- C++ spec doesn't guide the behavior on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it looks ok to you? #48638
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but let's not block this pr. we can always update later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline constexpr is needed, yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya good point inlined, can integrate into library after, there's other instances of stuff like this too can go at once
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: Dhyey Shah <[email protected]>
Signed-off-by: dayshah <[email protected]>
| if (spec->TaskId() == task_spec.TaskId()) { | ||
| scheduling_tasks.erase(spec); | ||
|
|
||
| if (scheduling_tasks.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't remove this if check, we should only cancel worker lease if there is no scheduling task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually removed this because CancelWorkerLeaseIfNeeded already checks if its empty and early returns if it's not empty
As titled, I think having `MB` explicitly called out is more readable than `1024 * 1024` or `1<<20` Intended use case: #48262 (comment) Signed-off-by: dentiny <[email protected]>
Signed-off-by: dayshah <[email protected]>
As titled, I think having `MB` explicitly called out is more readable than `1024 * 1024` or `1<<20` Intended use case: ray-project#48262 (comment) Signed-off-by: dentiny <[email protected]>
Signed-off-by: dayshah <[email protected]> Signed-off-by: mohitjain2504 <[email protected]>
As titled, I think having `MB` explicitly called out is more readable than `1024 * 1024` or `1<<20` Intended use case: ray-project#48262 (comment) Signed-off-by: dentiny <[email protected]> Signed-off-by: mohitjain2504 <[email protected]>
Why are these changes needed?
Minor cpp changes around core worker, was part of #48661, but factored out those changes.
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.