Skip to content

Conversation

@dayshah
Copy link
Contributor

@dayshah dayshah commented Oct 24, 2024

Why are these changes needed?

Minor cpp changes around core worker, was part of #48661, but factored out those changes.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@dayshah dayshah marked this pull request as ready for review October 28, 2024 22:26
@dayshah dayshah requested a review from a team as a code owner October 28, 2024 22:26
@dayshah dayshah changed the title [WIP][core] Fix hang when getting cancelled task with dependencies in progress [core] Fix hang when getting cancelled task with dependencies in progress Oct 28, 2024
auto remaining_timeout_ms = timeout_ms;
auto timeout_timestamp = current_time_ms() + timeout_ms;
while (!is_ready_) {
// TODO (dayshah): see if using cv condition function instead of busy while helps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya i think it could still be relevant, pretty sure using cv.wait_for(lock, timeout, []() { return !is_ready; }); could lead to less cpu usage vs the busy while loop but want to look more into how it would affect performance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's do it in another PR, if needed

@rynewang
Copy link
Contributor

test failures in cpp and python

@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Oct 29, 2024
RAY_LOG(INFO) << "Cancelling a task: " << task_spec.TaskId()
<< " force_kill: " << force_kill << " recursive: " << recursive;
const SchedulingKey scheduling_key(
SchedulingKey scheduling_key(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removed const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it can be moved into function capture below

if (!keep_executing) {
RAY_UNUSED(task_finisher_->FailOrRetryPendingTask(
task_spec.TaskId(), rpc::ErrorType::TASK_CANCELLED, nullptr));
RequestNewWorkerIfNeeded(scheduling_key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we removed a call to FailOrRetryPendingTask here. is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya this is the call we removed for when the task is removed and the task dependencies have just now been resolved, since now we're failing task before when cancel is actually called

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR description, could you add details on how the hang happens. Something like #47861

<< "Cancel an actor task";
CancelActorTaskOnExecutor(
caller_worker_id, task_id, force_kill, recursive, on_cancel_callback);
caller_worker_id, task_id, force_kill, recursive, std::move(on_cancel_callback));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to move since the parameter is const &?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, my bad left it in even after changing param

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait actually CancelActorTaskOnExecucutor takes by value, CancelTaskOnExecutor takes by const ref

auto it = submissible_tasks_.find(task_id);
RAY_CHECK(it != submissible_tasks_.end())
<< "Tried to fail task that was not pending " << task_id;
RAY_CHECK(it->second.IsPending())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changing this? The function name indicating that it's a pending task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue was there was a ray data backpressure test and a train test that failed here, it seems that the task cancel would happen after the task finishes and by the time it acquires that mutex in FailPendingTask it's already at a point where the task status is finished. So change here is basically just to no-op on if the task is finished at this point.

and IsPending checks for status != fail and finish, only want to check for fail here to make sure we're not double failing

@dayshah
Copy link
Contributor Author

dayshah commented Nov 7, 2024

In the PR description, could you add details on how the hang happens. Something like #47861

Updated pr description to list the 3 main changes and why they're needed

@dayshah dayshah requested a review from jjyao November 7, 2024 22:09

/// The maximum number of requests in flight per client.
const int64_t kMaxBytesInFlight = 16 * 1024 * 1024;
constexpr int64_t kMaxBytesInFlight = 16L * 1024 * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Randomly find this PR, two things:

  • We should have a human-readable unit library, something like 16_MiB;
  • inline is needed if we declare constants at header file
    • C++ spec doesn't guide the behavior on constexpr, when included by multiple translation units, whether the symbol appears once or multiple times, so it's a compiler-based UB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it looks ok to you? #48638

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but let's not block this pr. we can always update later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline constexpr is needed, yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya good point inlined, can integrate into library after, there's other instances of stuff like this too can go at once

@dayshah dayshah changed the title [core] Fix hang when getting cancelled task with dependencies in progress [core] Minor cpp changes around core worker Nov 8, 2024
if (spec->TaskId() == task_spec.TaskId()) {
scheduling_tasks.erase(spec);

if (scheduling_tasks.empty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't remove this if check, we should only cancel worker lease if there is no scheduling task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually removed this because CancelWorkerLeaseIfNeeded already checks if its empty and early returns if it's not empty

@jjyao jjyao merged commit e2dfcc8 into ray-project:master Nov 9, 2024
@dayshah dayshah deleted the cancel-task-with-deps branch November 9, 2024 18:56
rynewang pushed a commit that referenced this pull request Nov 12, 2024
As titled, I think having `MB` explicitly called out is more readable
than `1024 * 1024` or `1<<20`
Intended use case:
#48262 (comment)


Signed-off-by: dentiny <[email protected]>
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
As titled, I think having `MB` explicitly called out is more readable
than `1024 * 1024` or `1<<20`
Intended use case:
ray-project#48262 (comment)


Signed-off-by: dentiny <[email protected]>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
As titled, I think having `MB` explicitly called out is more readable
than `1024 * 1024` or `1<<20`
Intended use case:
ray-project#48262 (comment)

Signed-off-by: dentiny <[email protected]>
Signed-off-by: mohitjain2504 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants