Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core]: surface error to user when task scheduling is delayed due to out of memory. #25448

Open
mvashishtha opened this issue Jun 3, 2022 · 13 comments
Assignees
Labels
core Issues that should be addressed in Ray Core docs An issue or change related to documentation enhancement Request for new feature and/or capability observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks
Milestone

Comments

@mvashishtha
Copy link

mvashishtha commented Jun 3, 2022

What happened + What you expected to happen

In my reproduction script, I deploy two slow remote functions on two different OIDs from ray putting the same large data. I then wait for the results. From the logs I printed (see below), I see that the second task waits for the first to complete before executing. I expect the two tasks to start executing in parallel instead.

Here's the output of that script from executing on my mac. You can see that the second "slow_deploy start" doesn't occur until the first remote function is done.

both remote functions submited at 1654274857.2931788
(slow_deploy pid=18620) slow_deploy start at 1654274857.298433...
(slow_deploy pid=18620) slow_deploy done at 1654274860.3035252.
(slow_deploy pid=18620) slow_deploy start at 1654274860.305422...
total time: 6.021677255630493
(slow_deploy pid=18620) slow_deploy done at 1654274863.310048.

Two things that independently made the execution parallel:

  • passing the same OID to both remote functions, specifically passing first_data_oid to both remote functions
  • decreasing the size of the data. If I use 1 million rows instead of 4 million, I get parallel execution.

I also ran the same script on an EC2 ubuntu instance with more RAM (specs in "Versions/Dependencies"). There I saw parallel execution:

both remote functions submited at 1654273975.9030573
(slow_deploy pid=1175) slow_deploy start at 1654273975.9124284...
(slow_deploy pid=1174) slow_deploy start at 1654273975.912305...
total time: 3.017226457595825

(slow_deploy pid=1175) slow_deploy done at 1654273978.9154541.
(slow_deploy pid=1174) slow_deploy done at 1654273978.9154496.

I even saw parallel execution when I increased the size of the data from 4 million rows to 10 million rows:

both remote functions submited at 1654274330.798718
(slow_deploy pid=2427) slow_deploy start at 1654274330.8032472...
(slow_deploy pid=2423) slow_deploy start at 1654274330.803071...
total time: 3.0110809803009033

(slow_deploy pid=2427) slow_deploy done at 1654274333.805099.
(slow_deploy pid=2423) slow_deploy done at 1654274333.8051064.

but at 20 million rows, I got the serial execution again:

both remote functions submited at 1654275302.0839524
(slow_deploy pid=3010) slow_deploy start at 1654275302.0881972...
(slow_deploy pid=3010) slow_deploy done at 1654275305.0890934.
(slow_deploy pid=3010) slow_deploy start at 1654275305.0910187...
total time: 6.013562202453613

(slow_deploy pid=3010) slow_deploy done at 1654275308.093094.

I am also attaching two ray timeline collected after running the reproduction script. Both are from my mac. One shows the parallel execution when using 1 million rows of data, and the other shows the serial execution when using 4 million rows of data.
ray_bug_parallel_execution_for_1m_rows_mac.json.zip
ray_bug_serial_exeuction_for_4m_row_mac.json.zip

Versions / Dependencies

My mac:

  • macOS Monterey 12.2.1
  • MacBook Pro (16-inch, 2019) with 2.3 GHz 8-Core Intel Core i9
  • Memory: 16 GB 2667 MHz DDR4
  • Ray: 1.12.1
  • Python: 3.9.12

The Ubuntu EC2 instance:

  • AMI name: ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211129
  • instance type: t2.2xlarge
  • Ubuntu 20.04.3 LTS
  • 8 physical CPUs
  • Memory: 32 GiB
  • Ray: 1.12.0
  • Python: 3.9.7

Reproduction script

import ray
import numpy as np
import time


@ray.remote
def slow_deploy(*args, **kwargs):
    print(f"slow_deploy start at {time.time()}...")
    time.sleep(3)
    result = 0
    print(f"slow_deploy done at {time.time()}.")
    return result


data = np.ones(shape=(4_000_000, 25))
first_data_oid = ray.put(data)
second_data_oid = ray.put(data)

start = time.time()
first_result = slow_deploy.remote(first_data_oid)
second_result = slow_deploy.remote(second_data_oid)
print(f"both remote functions submited at {time.time()}")
ray.wait([first_result, second_result], num_returns=2)
print(f"total time: {time.time() - start}")

Issue Severity

High: it blocks me from completing my task

@mvashishtha mvashishtha added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 3, 2022
@prutskov
Copy link

prutskov commented Jun 8, 2022

cc @simon-mo, @richardliaw

@simon-mo
Copy link
Contributor

simon-mo commented Jun 8, 2022

@iycheng can you take a look or re-assign? Thank you.

@mvashishtha
Copy link
Author

@iycheng please take a look or re-assign. This is a critical issue for Modin.

@simon-mo
Copy link
Contributor

simon-mo commented Jun 9, 2022

Escalating this to @scv119

@scv119
Copy link
Contributor

scv119 commented Jun 9, 2022

@mvashishtha can you run your script with RAY_BACKEND_LOG_LEVEL=debug on your Mac, and share your raylet.out under /tmp/ray/session_latest/logs/.

I think most likely it because the object store is serializing the object fetching for unknown reasons. (i.e. the object store can only fit into one object at a time).

@scv119 scv119 added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 9, 2022
@mvashishtha
Copy link
Author

@scv119 sure, here is the result from running the repro script in the first post on my mac.
raylet.out.zip

@scv119
Copy link
Contributor

scv119 commented Jun 9, 2022

[2022-06-09 15:35:27,302 I 31703 3812441] (raylet) store_runner.cc:32: Allowing the Plasma store to use up to 2.14748GB of memory.
[2022-06-09 15:35:27,302 I 31703 3812441] (raylet) store_runner.cc:48: Starting object store with directory /tmp, fallback /tmp/ray, and huge page support disabled
[2022-06-09 15:35:27,302 D 31703 3812471] (raylet) store_runner.cc:88: starting server listening on /tmp/ray/session_2022-06-09_15-35-24_504793_31696/sockets/plasma_store
[2022-06-09 15:35:27,302 D 31703 3812471] (raylet) plasma_allocator.cc:85: allocating 2147481600
[2022-06-09 15:35:27,302 I 31703 3812471] (raylet) dlmalloc.cc:154: create_and_mmap_buffer(2147483656, /tmp/plasmaXXXXXX)

...

[2022-06-09 15:35:29,831 D 31703 3812441] (raylet) local_task_manager.cc:649: Size of pinned task args is now 1600000520
[2022-06-09 15:35:29,831 D 31703 3812441] (raylet) local_task_manager.cc:663: Cannot dispatch task 9d172512a06e513526abaf75f5a71b612dd74bb601000000 with arguments of size 800000260 current pinned bytes is 800000260
[2022-06-09 15:35:29,831 D 31703 3812441] (raylet) local_task_manager.cc:178: Dispatching task
9d172512a06e513526abaf75f5a71b612dd74bb601000000 would put this node over the max memory allowed for arguments of executing tasks (1503238528). Waiting to dispatch task until other tasks complete
[2022-06-09 15:35:29,831 D 31703 3812441] (raylet) local_task_manager.cc:647: RayTask 9d172512a06e513526a
...

@scv119
Copy link
Contributor

scv119 commented Jun 9, 2022

@mvashishtha so what happens is your plasma store has 2.147 GB capacity, and each of your task argument takes 800MB. Currently we have a limitation to only allow task arguments occupies 70% of the capacity (1.5GB), which means it can't fit two tasks (each has 800MB argument)

The solution would be increase the plasma store size (recommended), or increase the (70%) capacity limitation (max_task_args_memory_fraction) (not recommended, as it might have cascading effects affecting task return values)

@mvashishtha
Copy link
Author

mvashishtha commented Jun 9, 2022

@scv119 thank you for the explanation. That makes sense and I think it explains the slowness we originally saw in Modin here. I do think that ray should surface this warning to the user, along with the recommendations you gave about the object store size and max_task_args_memory_fraction. It seems like a pretty serious problem when tasks that the user expects to execute in parallel are executing serially.

By the way, why have a limit on the task arguments sizes? Is it because the return values of the tasks might increase the size of the object store too much?

@scv119
Copy link
Contributor

scv119 commented Jun 9, 2022

@mvashishtha that's a great suggestion. we will surface this warning info back to driver (similar to what we did for scheduling or spilling), but let me know if you think there are better ways.

By the way, why have a limit on the task arguments sizes? Is it because the return values of the tasks might increase the size of the object store too much?

Yup I think that's main motivation that the object store might not fits the task return value. cc @stephanie-wang @rkooo567 with fallback allocation maybe this is no longer necessary?.

quote on the PR introducing this feature:

Task arguments must be pinned in memory while the task is running. If too many tasks with arguments run at the same time, there may not be enough memory to store a single task output. Object spilling does not help because the task arguments cannot be spilled while the task is still running.

@scv119 scv119 changed the title [Core]: Remote functions that take large objects as inputs execute serially [Core]: surface error to user when task scheduling is delayed due to out of memory. Jun 9, 2022
@scv119 scv119 added the enhancement Request for new feature and/or capability label Jun 9, 2022
@stephanie-wang
Copy link
Contributor

Yes, this is intended behavior to avoid overloading the object store. Fallback allocation ensures liveness, but the performance will suffer compared to storing return values directly in the object store.

@mvashishtha can you explain why this is a high severity issue for you? I'm a bit unclear on why it's a blocker.

@devin-petersohn
Copy link
Member

@stephanie-wang This was originally flagged as high severity because it was causing crashes and prohibitively bad performance on large datasets (larger than what is reported). We narrowed it down as much as we could here to help you all debug and rule out as much as we could but the original issue is a bit more complicated than what is reported here.

Since we didn't see anything relevant in the logs/timeline/stderr we ended up assuming it was some kind of scheduler or object store bug.

There is also the issue of the object store on MacOS being prohibitively small (for other reasons) and combining that with the 70% rule described above makes certain types of workloads impossible on Mac machines as a whole.

I think this issue can probably be de-escalated since it's working as intended, but I would prefer if there could be some warning surfaced to the user when scheduling is blocked based on the size of an input. Do you think that's possible?

Also, are these task scheduling requirements/limits documented somewhere? Sorry if I missed it!

@scottsun94 scottsun94 added the observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling label Oct 13, 2022
@rkooo567 rkooo567 added this to the Core Backlog milestone Oct 24, 2022
@rkooo567 rkooo567 added docs An issue or change related to documentation and removed bug Something that is supposed to be working; but isn't labels Oct 24, 2022
@rkooo567
Copy link
Contributor

rkooo567 commented Oct 24, 2022

I think this issue can probably be de-escalated since it's working as intended, but I would prefer if there could be some warning surfaced to the user when scheduling is blocked based on the size of an input. Do you think that's possible?

From ray 2.1, this information will be available from the Ray metrics dashboard as PENDING_OBJ_STORE_MEM_AVAIL. From 2.2, it will be also available from ray get tasks API. cc @rickyyx

Also, are these task scheduling requirements/limits documented somewhere? Sorry if I missed it!
I will document this (when task scheduling can be delayed). There are many cases tasks can be delayed, so printing warning seems to be a little drastic (maybe too spammy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core docs An issue or change related to documentation enhancement Request for new feature and/or capability observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

9 participants