[Core]: surface error to user when task scheduling is delayed due to out of memory. #25448

mvashishtha · 2022-06-03T17:13:03Z

What happened + What you expected to happen

In my reproduction script, I deploy two slow remote functions on two different OIDs from ray putting the same large data. I then wait for the results. From the logs I printed (see below), I see that the second task waits for the first to complete before executing. I expect the two tasks to start executing in parallel instead.

Here's the output of that script from executing on my mac. You can see that the second "slow_deploy start" doesn't occur until the first remote function is done.

both remote functions submited at 1654274857.2931788
(slow_deploy pid=18620) slow_deploy start at 1654274857.298433...
(slow_deploy pid=18620) slow_deploy done at 1654274860.3035252.
(slow_deploy pid=18620) slow_deploy start at 1654274860.305422...
total time: 6.021677255630493
(slow_deploy pid=18620) slow_deploy done at 1654274863.310048.

Two things that independently made the execution parallel:

passing the same OID to both remote functions, specifically passing first_data_oid to both remote functions
decreasing the size of the data. If I use 1 million rows instead of 4 million, I get parallel execution.

I also ran the same script on an EC2 ubuntu instance with more RAM (specs in "Versions/Dependencies"). There I saw parallel execution:

both remote functions submited at 1654273975.9030573
(slow_deploy pid=1175) slow_deploy start at 1654273975.9124284...
(slow_deploy pid=1174) slow_deploy start at 1654273975.912305...
total time: 3.017226457595825

(slow_deploy pid=1175) slow_deploy done at 1654273978.9154541.
(slow_deploy pid=1174) slow_deploy done at 1654273978.9154496.

I even saw parallel execution when I increased the size of the data from 4 million rows to 10 million rows:

both remote functions submited at 1654274330.798718
(slow_deploy pid=2427) slow_deploy start at 1654274330.8032472...
(slow_deploy pid=2423) slow_deploy start at 1654274330.803071...
total time: 3.0110809803009033

(slow_deploy pid=2427) slow_deploy done at 1654274333.805099.
(slow_deploy pid=2423) slow_deploy done at 1654274333.8051064.

but at 20 million rows, I got the serial execution again:

both remote functions submited at 1654275302.0839524
(slow_deploy pid=3010) slow_deploy start at 1654275302.0881972...
(slow_deploy pid=3010) slow_deploy done at 1654275305.0890934.
(slow_deploy pid=3010) slow_deploy start at 1654275305.0910187...
total time: 6.013562202453613

(slow_deploy pid=3010) slow_deploy done at 1654275308.093094.

I am also attaching two ray timeline collected after running the reproduction script. Both are from my mac. One shows the parallel execution when using 1 million rows of data, and the other shows the serial execution when using 4 million rows of data.
ray_bug_parallel_execution_for_1m_rows_mac.json.zip
ray_bug_serial_exeuction_for_4m_row_mac.json.zip

Versions / Dependencies

My mac:

macOS Monterey 12.2.1
MacBook Pro (16-inch, 2019) with 2.3 GHz 8-Core Intel Core i9
Memory: 16 GB 2667 MHz DDR4
Ray: 1.12.1
Python: 3.9.12

The Ubuntu EC2 instance:

AMI name: ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211129
instance type: t2.2xlarge
Ubuntu 20.04.3 LTS
8 physical CPUs
Memory: 32 GiB
Ray: 1.12.0
Python: 3.9.7

Reproduction script

import ray
import numpy as np
import time


@ray.remote
def slow_deploy(*args, **kwargs):
    print(f"slow_deploy start at {time.time()}...")
    time.sleep(3)
    result = 0
    print(f"slow_deploy done at {time.time()}.")
    return result


data = np.ones(shape=(4_000_000, 25))
first_data_oid = ray.put(data)
second_data_oid = ray.put(data)

start = time.time()
first_result = slow_deploy.remote(first_data_oid)
second_result = slow_deploy.remote(second_data_oid)
print(f"both remote functions submited at {time.time()}")
ray.wait([first_result, second_result], num_returns=2)
print(f"total time: {time.time() - start}")

Issue Severity

High: it blocks me from completing my task

The text was updated successfully, but these errors were encountered:

prutskov · 2022-06-08T14:51:14Z

cc @simon-mo, @richardliaw

simon-mo · 2022-06-08T16:30:20Z

@iycheng can you take a look or re-assign? Thank you.

mvashishtha · 2022-06-09T17:30:14Z

@iycheng please take a look or re-assign. This is a critical issue for Modin.

simon-mo · 2022-06-09T17:48:42Z

Escalating this to @scv119

scv119 · 2022-06-09T20:33:40Z

@mvashishtha can you run your script with RAY_BACKEND_LOG_LEVEL=debug on your Mac, and share your raylet.out under /tmp/ray/session_latest/logs/.

I think most likely it because the object store is serializing the object fetching for unknown reasons. (i.e. the object store can only fit into one object at a time).

mvashishtha · 2022-06-09T20:36:55Z

@scv119 sure, here is the result from running the repro script in the first post on my mac.
raylet.out.zip

scv119 · 2022-06-09T21:18:54Z

[2022-06-09 15:35:27,302 I 31703 3812441] (raylet) store_runner.cc:32: Allowing the Plasma store to use up to 2.14748GB of memory.
[2022-06-09 15:35:27,302 I 31703 3812441] (raylet) store_runner.cc:48: Starting object store with directory /tmp, fallback /tmp/ray, and huge page support disabled
[2022-06-09 15:35:27,302 D 31703 3812471] (raylet) store_runner.cc:88: starting server listening on /tmp/ray/session_2022-06-09_15-35-24_504793_31696/sockets/plasma_store
[2022-06-09 15:35:27,302 D 31703 3812471] (raylet) plasma_allocator.cc:85: allocating 2147481600
[2022-06-09 15:35:27,302 I 31703 3812471] (raylet) dlmalloc.cc:154: create_and_mmap_buffer(2147483656, /tmp/plasmaXXXXXX)

...

[2022-06-09 15:35:29,831 D 31703 3812441] (raylet) local_task_manager.cc:649: Size of pinned task args is now 1600000520
[2022-06-09 15:35:29,831 D 31703 3812441] (raylet) local_task_manager.cc:663: Cannot dispatch task 9d172512a06e513526abaf75f5a71b612dd74bb601000000 with arguments of size 800000260 current pinned bytes is 800000260
[2022-06-09 15:35:29,831 D 31703 3812441] (raylet) local_task_manager.cc:178: Dispatching task 9d172512a06e513526abaf75f5a71b612dd74bb601000000 would put this node over the max memory allowed for arguments of executing tasks (1503238528). Waiting to dispatch task until other tasks complete
[2022-06-09 15:35:29,831 D 31703 3812441] (raylet) local_task_manager.cc:647: RayTask 9d172512a06e513526a
...

scv119 · 2022-06-09T21:22:01Z

@mvashishtha so what happens is your plasma store has 2.147 GB capacity, and each of your task argument takes 800MB. Currently we have a limitation to only allow task arguments occupies 70% of the capacity (1.5GB), which means it can't fit two tasks (each has 800MB argument)

The solution would be increase the plasma store size (recommended), or increase the (70%) capacity limitation (max_task_args_memory_fraction) (not recommended, as it might have cascading effects affecting task return values)

mvashishtha · 2022-06-09T22:16:48Z

@scv119 thank you for the explanation. That makes sense and I think it explains the slowness we originally saw in Modin here. I do think that ray should surface this warning to the user, along with the recommendations you gave about the object store size and max_task_args_memory_fraction. It seems like a pretty serious problem when tasks that the user expects to execute in parallel are executing serially.

By the way, why have a limit on the task arguments sizes? Is it because the return values of the tasks might increase the size of the object store too much?

scv119 · 2022-06-09T22:50:17Z

@mvashishtha that's a great suggestion. we will surface this warning info back to driver (similar to what we did for scheduling or spilling), but let me know if you think there are better ways.

By the way, why have a limit on the task arguments sizes? Is it because the return values of the tasks might increase the size of the object store too much?

Yup I think that's main motivation that the object store might not fits the task return value. cc @stephanie-wang @rkooo567 with fallback allocation maybe this is no longer necessary?.

quote on the PR introducing this feature:

Task arguments must be pinned in memory while the task is running. If too many tasks with arguments run at the same time, there may not be enough memory to store a single task output. Object spilling does not help because the task arguments cannot be spilled while the task is still running.

stephanie-wang · 2022-06-09T22:56:45Z

Yes, this is intended behavior to avoid overloading the object store. Fallback allocation ensures liveness, but the performance will suffer compared to storing return values directly in the object store.

@mvashishtha can you explain why this is a high severity issue for you? I'm a bit unclear on why it's a blocker.

devin-petersohn · 2022-06-10T00:04:16Z

@stephanie-wang This was originally flagged as high severity because it was causing crashes and prohibitively bad performance on large datasets (larger than what is reported). We narrowed it down as much as we could here to help you all debug and rule out as much as we could but the original issue is a bit more complicated than what is reported here.

Since we didn't see anything relevant in the logs/timeline/stderr we ended up assuming it was some kind of scheduler or object store bug.

There is also the issue of the object store on MacOS being prohibitively small (for other reasons) and combining that with the 70% rule described above makes certain types of workloads impossible on Mac machines as a whole.

I think this issue can probably be de-escalated since it's working as intended, but I would prefer if there could be some warning surfaced to the user when scheduling is blocked based on the size of an input. Do you think that's possible?

Also, are these task scheduling requirements/limits documented somewhere? Sorry if I missed it!

rkooo567 · 2022-10-24T04:55:55Z

I think this issue can probably be de-escalated since it's working as intended, but I would prefer if there could be some warning surfaced to the user when scheduling is blocked based on the size of an input. Do you think that's possible?

From ray 2.1, this information will be available from the Ray metrics dashboard as PENDING_OBJ_STORE_MEM_AVAIL. From 2.2, it will be also available from ray get tasks API. cc @rickyyx

Also, are these task scheduling requirements/limits documented somewhere? Sorry if I missed it!
I will document this (when task scheduling can be delayed). There are many cases tasks can be delayed, so printing warning seems to be a little drastic (maybe too spammy).

mvashishtha added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 3, 2022

mvashishtha mentioned this issue Jun 3, 2022

PERF-#4182, FIX-#4059: Add cell-wise execution for binary ops, fix bin ops for empty dataframes modin-project/modin#4391

Merged

8 tasks

scv119 added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 9, 2022

scv119 changed the title ~~[Core]: Remote functions that take large objects as inputs execute serially~~ [Core]: surface error to user when task scheduling is delayed due to out of memory. Jun 9, 2022

scv119 added the enhancement Request for new feature and/or capability label Jun 9, 2022

scottsun94 added the observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling label Oct 13, 2022

rkooo567 added this to the Core Backlog milestone Oct 24, 2022

rkooo567 added docs An issue or change related to documentation and removed bug Something that is supposed to be working; but isn't labels Oct 24, 2022

hora-anyscale assigned rickyyx Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core]: surface error to user when task scheduling is delayed due to out of memory. #25448

[Core]: surface error to user when task scheduling is delayed due to out of memory. #25448

mvashishtha commented Jun 3, 2022 •

edited

Loading

prutskov commented Jun 8, 2022

simon-mo commented Jun 8, 2022

mvashishtha commented Jun 9, 2022

simon-mo commented Jun 9, 2022

scv119 commented Jun 9, 2022

mvashishtha commented Jun 9, 2022

scv119 commented Jun 9, 2022

scv119 commented Jun 9, 2022 •

edited

Loading

mvashishtha commented Jun 9, 2022 •

edited

Loading

scv119 commented Jun 9, 2022 •

edited

Loading

stephanie-wang commented Jun 9, 2022

devin-petersohn commented Jun 10, 2022

rkooo567 commented Oct 24, 2022 •

edited

Loading

[Core]: surface error to user when task scheduling is delayed due to out of memory. #25448

[Core]: surface error to user when task scheduling is delayed due to out of memory. #25448

Comments

mvashishtha commented Jun 3, 2022 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

prutskov commented Jun 8, 2022

simon-mo commented Jun 8, 2022

mvashishtha commented Jun 9, 2022

simon-mo commented Jun 9, 2022

scv119 commented Jun 9, 2022

mvashishtha commented Jun 9, 2022

scv119 commented Jun 9, 2022

scv119 commented Jun 9, 2022 • edited Loading

mvashishtha commented Jun 9, 2022 • edited Loading

scv119 commented Jun 9, 2022 • edited Loading

stephanie-wang commented Jun 9, 2022

devin-petersohn commented Jun 10, 2022

rkooo567 commented Oct 24, 2022 • edited Loading

mvashishtha commented Jun 3, 2022 •

edited

Loading

scv119 commented Jun 9, 2022 •

edited

Loading

mvashishtha commented Jun 9, 2022 •

edited

Loading

scv119 commented Jun 9, 2022 •

edited

Loading

rkooo567 commented Oct 24, 2022 •

edited

Loading