-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core]: surface error to user when task scheduling is delayed due to out of memory. #25448
Comments
@iycheng can you take a look or re-assign? Thank you. |
@iycheng please take a look or re-assign. This is a critical issue for Modin. |
Escalating this to @scv119 |
@mvashishtha can you run your script with I think most likely it because the object store is serializing the object fetching for unknown reasons. (i.e. the object store can only fit into one object at a time). |
@scv119 sure, here is the result from running the repro script in the first post on my mac. |
|
@mvashishtha so what happens is your plasma store has 2.147 GB capacity, and each of your task argument takes 800MB. Currently we have a limitation to only allow task arguments occupies 70% of the capacity (1.5GB), which means it can't fit two tasks (each has 800MB argument) The solution would be increase the plasma store size (recommended), or increase the (70%) capacity limitation (max_task_args_memory_fraction) (not recommended, as it might have cascading effects affecting task return values) |
@scv119 thank you for the explanation. That makes sense and I think it explains the slowness we originally saw in Modin here. I do think that ray should surface this warning to the user, along with the recommendations you gave about the object store size and By the way, why have a limit on the task arguments sizes? Is it because the return values of the tasks might increase the size of the object store too much? |
@mvashishtha that's a great suggestion. we will surface this warning info back to driver (similar to what we did for scheduling or spilling), but let me know if you think there are better ways.
Yup I think that's main motivation that the object store might not fits the task return value. cc @stephanie-wang @rkooo567 with fallback allocation maybe this is no longer necessary?. quote on the PR introducing this feature:
|
Yes, this is intended behavior to avoid overloading the object store. Fallback allocation ensures liveness, but the performance will suffer compared to storing return values directly in the object store. @mvashishtha can you explain why this is a high severity issue for you? I'm a bit unclear on why it's a blocker. |
@stephanie-wang This was originally flagged as high severity because it was causing crashes and prohibitively bad performance on large datasets (larger than what is reported). We narrowed it down as much as we could here to help you all debug and rule out as much as we could but the original issue is a bit more complicated than what is reported here. Since we didn't see anything relevant in the logs/timeline/stderr we ended up assuming it was some kind of scheduler or object store bug. There is also the issue of the object store on MacOS being prohibitively small (for other reasons) and combining that with the 70% rule described above makes certain types of workloads impossible on Mac machines as a whole. I think this issue can probably be de-escalated since it's working as intended, but I would prefer if there could be some warning surfaced to the user when scheduling is blocked based on the size of an input. Do you think that's possible? Also, are these task scheduling requirements/limits documented somewhere? Sorry if I missed it! |
From ray 2.1, this information will be available from the Ray metrics dashboard as
|
What happened + What you expected to happen
In my reproduction script, I deploy two slow remote functions on two different OIDs from
ray put
ting the same large data. I then wait for the results. From the logs I printed (see below), I see that the second task waits for the first to complete before executing. I expect the two tasks to start executing in parallel instead.Here's the output of that script from executing on my mac. You can see that the second "slow_deploy start" doesn't occur until the first remote function is done.
Two things that independently made the execution parallel:
first_data_oid
to both remote functionsI also ran the same script on an EC2 ubuntu instance with more RAM (specs in "Versions/Dependencies"). There I saw parallel execution:
I even saw parallel execution when I increased the size of the data from 4 million rows to 10 million rows:
but at 20 million rows, I got the serial execution again:
I am also attaching two
ray timeline
collected after running the reproduction script. Both are from my mac. One shows the parallel execution when using 1 million rows of data, and the other shows the serial execution when using 4 million rows of data.ray_bug_parallel_execution_for_1m_rows_mac.json.zip
ray_bug_serial_exeuction_for_4m_row_mac.json.zip
Versions / Dependencies
My mac:
The Ubuntu EC2 instance:
Reproduction script
Issue Severity
High: it blocks me from completing my task
The text was updated successfully, but these errors were encountered: