[core] Admission control for pulling objects to the local node #13514

stephanie-wang · 2021-01-18T05:19:48Z

Why are these changes needed?

Previously, we would fetch all requested objects simultaneously, including queued tasks' arguments and ray.get or ray.wait requests from local workers. If the total size was greater than the node's capacity, this could result in starvation.

This adds admission control when choosing which objects to fetch to the local node. This makes a couple changes:

Pull requests are served in FIFO order.
The total size of objects actively fetched is kept under the node's current capacity (defined as the object store's total capacity - size of pinned objects).
We do not start pulling an object until its size is known. This is to prevent flooding the object manager with incoming objects when many requests are made simultaneously for different objects.

The algorithm is implemented by finding the longest contiguous prefix of the current request queue whose total size is known and under the current capacity. Object size is now attached to all object table replies. The total set of objects needed by the chosen requests will be actively pulled or restored. The current capacity is dynamically and asynchronously updated at every ObjectManager tick.

Related issue number

Closes #12663.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

To test this, the Python tests submit tasks with several arguments, and there is only enough memory to run one task at a time. On master, this timed out, but the test would finish in ~5s if the Python code is modified to manually submit and get one task at a time. With this PR, the run time now matches the version with manual admission control.

… known

rkooo567

IIRC, ray.get and ray.wait are prioritized over fetching task dependencies. (Lmk if I am wrong?) Should we add test cases for this?

rkooo567 · 2021-01-18T05:30:36Z

python/ray/tests/test_object_manager.py



+@pytest.mark.timeout(30)
+def test_pull_bundles_admission_control(shutdown_only):


Why don't we just use ray_start_cluster?

I was too lazy to figure out how to parametrize it properly :D Also, I was running into trouble where the non-head node would connect first, so the rest of the test wouldn't run properly.

oh lol. I think you can just do like

cluster = ray_start_cluster cluster.add_node() cluster.wait_for_nodes() ray.init(address=cluster.address) cluster.add_node...

src/ray/core_worker/reference_count.cc

rkooo567 · 2021-01-18T05:34:45Z

src/ray/object_manager/object_directory.cc

  // addition or deletion.
  bool isUpdated = false;
  for (const auto &update : location_updates) {
+    if (update.size() > 0) {


When's the case where update.size is not bigger than 0?

I think it's 0 for deletion. We can add a flag instead if there are cases where the object size really is 0.

Do you mind writing a comment here?

src/ray/object_manager/object_manager.cc

rkooo567 · 2021-01-18T05:39:07Z

src/ray/object_manager/object_manager.cc


+  // Request the current available memory from the object
+  // store.
+  plasma::plasma_store_runner->GetAvailableMemoryAsync([this](size_t available_memory) {


It seems like when we collect stats, we use config_.object_store_memory - used_memory_ to calculate available memory (inside object manager). Are we directly querying it from the plasma store because we need to take into account of pinned objects?

Yes, I don't think we currently report the available memory metric that I'm using (but maybe we should!).

rkooo567 · 2021-01-18T06:34:04Z

src/ray/object_manager/pull_manager.cc

+    if (request_it == pull_request_bundles_.begin()) {
+      highest_req_id_being_pulled_ = 0;
+    } else {
+      highest_req_id_being_pulled_ = std::prev(request_it)->first;


Q: Is the map ordered?

src/ray/object_manager/pull_manager.cc

rkooo567 · 2021-01-18T06:46:31Z

src/ray/object_manager/test/pull_manager_test.cc

+      ASSERT_FALSE(IsUnderCapacity((num_requests_expected + 1) * num_oids_per_request *
+                                   object_size));
+    }
+  }


Can you also test the case where the whole pull requests are digested and new pull requests are queued to test this part?

if (highest_req_id_being_pulled_ == request_it->first) { if (request_it == pull_request_bundles_.begin()) { highest_req_id_being_pulled_ = 0; } else { highest_req_id_being_pulled_ = std::prev(request_it)->first;

Hmm I think this will get covered by the cancellation tests, but let me know if there is something else you're looking for.

Ah, if it is convered by that test, it is fine. (It was a bit hard to understand if this was tested).

python/ray/tests/test_object_manager.py

rkooo567 · 2021-01-18T06:50:17Z

python/ray/tests/test_object_manager.py

+
+@pytest.mark.timeout(30)
+def test_pull_bundles_admission_control_dynamic(shutdown_only):
+    # This test is the same as test_pull_bundles_admission_control, except that


This is a nice test!

ericl · 2021-01-18T22:16:04Z

src/ray/object_manager/pull_manager.cc

+  num_bytes_available_ = num_bytes_available;
+
+  std::unordered_set<ObjectID> object_ids_to_pull;
+  // While there is available capacity, activate the next pull request.


Should we always pull at least the first bundle in the queue? Otherwise I can see the workload stalling when it is possible we can make space via object spilling.

Hmm that's a good point, I'll change that and see if I can write a regression test for it.

Actually, it turns out this doesn't work because the object store just keeps evicting other objects needed by the first bundle, so it never triggers the OOM handling. So I'm going to change this to directly trigger OOM instead.

ericl · 2021-01-18T22:17:49Z

src/ray/object_manager/pull_manager.cc

+  std::unordered_set<ObjectID> object_ids_to_cancel;
+  // While the total bytes requested is over the available capacity, deactivate
+  // the last pull request, ordered by request ID.
+  while (num_bytes_being_pulled_ > num_bytes_available_) {


ericl · 2021-01-18T22:18:37Z

src/ray/object_manager/pull_manager.cc

+      // NOTE(swang): We could also just wait for the next tick to pull the
+      // objects, but this would add a delay of up to one tick for any bundles
+      // of multiple objects, even when we are not under memory pressure.
+      TryToMakeObjectLocal(obj_id);


Could we instead trigger the tick?

Ah, good idea!

ericl

Looks good at a high level--- some questions about the corner case handling when under memory pressure.

rkooo567

LGTM. A few follow up comments, but they are minor.

rkooo567 · 2021-01-19T06:13:08Z

python/ray/tests/test_object_manager.py



+@pytest.mark.timeout(30)
+def test_pull_bundles_admission_control(shutdown_only):


oh lol. I think you can just do like

cluster = ray_start_cluster cluster.add_node() cluster.wait_for_nodes() ray.init(address=cluster.address) cluster.add_node...

python/ray/tests/test_object_manager.py

rkooo567 · 2021-01-19T06:14:45Z

src/ray/core_worker/reference_count.h

  absl::optional<absl::flat_hash_set<NodeID>> GetObjectLocations(
      const ObjectID &object_id) LOCKS_EXCLUDED(mutex_);

+  size_t GetObjectSize(const ObjectID &object_id) const;


Can you write a comment? (and explain when 0 is returned?)

rkooo567 · 2021-01-19T06:15:05Z

src/ray/object_manager/object_directory.cc

  // addition or deletion.
  bool isUpdated = false;
  for (const auto &update : location_updates) {
+    if (update.size() > 0) {


Do you mind writing a comment here?

rkooo567 · 2021-01-19T06:16:26Z

src/ray/object_manager/pull_manager.cc

+    if (!it->second.object_size_set) {
+      RAY_LOG(DEBUG) << "No size for " << obj_id << ", canceling activation for pull "
+                     << next_request_it->first;
+      return false;


Isn't it something we should improve in the future? (Like re-queue until sizes are known or something)?

src/ray/object_manager/pull_manager.cc

Co-authored-by: Eric Liang <[email protected]>

…o admission-control

stephanie-wang · 2021-01-21T00:55:05Z

@ericl, @rkooo567, FYI:

I had to change a couple things to get the tests working:

removed the tests under "run_object_manager_tests.sh". These tests have been useless for a while, and now we have python and unit tests that cover the code better anyway.
disabled //:core_worker_test since it relies on the legacy plasma store and I couldn't figure out how to disable that. This test is also pretty useless since the Python tests are a superset.
disabled the new object spilling test added in this PR. It also hangs on master, so I don't think this is a big deal.

It turns out there is a race condition causing the new object spilling test to fail that I think is non-trivial to fix. Here is the issue:

Task 1 requires A, task 2 requires B. We only have room for 1 object.
We pull A and lease the worker for task 1. At this point, we cancel the pull request for A and start the pull request for B.
We pull B, which evicts A. The worker executing task 1 now requests A again. We hang because we're trying to pull B first.

To break the deadlock, we need to prioritize running workers' dependencies over queued task dependencies (as @rkooo567 mentioned above). Unfortunately, this won't fix the problem of evicting A before the worker gets it. To fix that, we'd need wait to to cancel the pull manager request for A until after we were sure the worker got a ref to A, which seems pretty complicated to do right now.

So for now, I can open a second PR to break the deadlock and at least report some metrics on how often such thrashing is happening.

ericl · 2021-01-21T01:06:34Z

Could we pin the object A in memory until the task 1 has finished execution? When the task 1 finishes it could do a double decrement of the refcount to release the object.

…

On Wed, Jan 20, 2021, 4:55 PM Stephanie Wang ***@***.***> wrote: @ericl <https://github.com/ericl>, @rkooo567 <https://github.com/rkooo567>, FYI: I had to change a couple things to get the tests working: - removed the tests under "run_object_manager_tests.sh". These tests have been useless for a while, and now we have python and unit tests that cover the code better anyway. - disabled //:core_worker_test since it relies on the legacy plasma store and I couldn't figure out how to disable that. This test is also pretty useless since the Python tests are a superset. - disabled the new object spilling test added in this PR. It also hangs on master, so I don't think this is a big deal. It turns out there is a race condition causing the new object spilling test to fail that I think is non-trivial to fix. Here is the issue: 1. Task 1 requires A, task 2 requires B. We only have room for 1 object. 2. We pull A and lease the worker for task 1. At this point, we cancel the pull request for A and start the pull request for B. 3. We pull B, which evicts A. The worker executing task 1 now requests A again. We hang because we're trying to pull B first. To break the deadlock, we need to prioritize running workers' dependencies over queued task dependencies (as @rkooo567 <https://github.com/rkooo567> mentioned above). Unfortunately, this won't fix the problem of evicting A before the worker gets it. To fix that, we'd need wait to to cancel the pull manager request for A until after we were sure the worker got a ref to A, which seems pretty complicated to do right now. So for now, I can open a second PR to break the deadlock and at least report some metrics on how often such thrashing is happening. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13514 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSQ7JMHVARLA32WUFN3S253PRANCNFSM4WGXWQWA> .

ericl · 2021-01-21T01:07:53Z

Btw by pin I mean hard pin (unevictable).

…

On Wed, Jan 20, 2021, 5:06 PM Eric Liang ***@***.***> wrote: Could we pin the object A in memory until the task 1 has finished execution? When the task 1 finishes it could do a double decrement of the refcount to release the object. On Wed, Jan 20, 2021, 4:55 PM Stephanie Wang ***@***.***> wrote: > @ericl <https://github.com/ericl>, @rkooo567 > <https://github.com/rkooo567>, FYI: > > I had to change a couple things to get the tests working: > > - removed the tests under "run_object_manager_tests.sh". These tests > have been useless for a while, and now we have python and unit tests that > cover the code better anyway. > - disabled //:core_worker_test since it relies on the legacy plasma > store and I couldn't figure out how to disable that. This test is also > pretty useless since the Python tests are a superset. > - disabled the new object spilling test added in this PR. It also > hangs on master, so I don't think this is a big deal. > > It turns out there is a race condition causing the new object spilling > test to fail that I think is non-trivial to fix. Here is the issue: > > 1. Task 1 requires A, task 2 requires B. We only have room for 1 > object. > 2. We pull A and lease the worker for task 1. At this point, we > cancel the pull request for A and start the pull request for B. > 3. We pull B, which evicts A. The worker executing task 1 now > requests A again. We hang because we're trying to pull B first. > > To break the deadlock, we need to prioritize running workers' > dependencies over queued task dependencies (as @rkooo567 > <https://github.com/rkooo567> mentioned above). Unfortunately, this > won't fix the problem of evicting A before the worker gets it. To fix that, > we'd need wait to to cancel the pull manager request for A until after we > were sure the worker got a ref to A, which seems pretty complicated to do > right now. > > So for now, I can open a second PR to break the deadlock and at least > report some metrics on how often such thrashing is happening. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#13514 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAADUSQ7JMHVARLA32WUFN3S253PRANCNFSM4WGXWQWA> > . >

stephanie-wang · 2021-01-21T01:09:44Z

Could we pin the object A in memory until the task 1 has finished execution? When the task 1 finishes it could do a double decrement of the refcount to release the object.

I think that is the right approach long-term, but it's pretty complicated to do right now. The problem is that we're calculating the PullManager's available memory based on total_capacity - pinned_objects_size. If we pin pulled objects, we'll have to make sure to account for that in the available memory. Plus the available memory is reported asynchronously right now, since the object store is running in a different thread.

It seems like we need a larger refactor to really fix this problem properly.

ericl · 2021-01-21T01:14:56Z

Hmm wouldn't incrementing the pinned memory count and letting it get asynchronously updated as well suffice? I also think it's fine to let it just get asynchronously updated, the worst that can happen is a few spurious pull failures.

…

On Wed, Jan 20, 2021, 5:10 PM Stephanie Wang ***@***.***> wrote: Could we pin the object A in memory until the task 1 has finished execution? When the task 1 finishes it could do a double decrement of the refcount to release the object. I think that is the right approach long-term, but it's pretty complicated to do right now. The problem is that we're calculating the PullManager's available memory based on total_capacity - pinned_objects_size. If we pin pulled objects, we'll have to make sure to account for that in the available memory. Plus the available memory is reported asynchronously right now, since the object store is running in a different thread. It seems like we need a larger refactor to really fix this problem properly. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13514 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSXGFYUQHB4UBMZWQCDS255GRANCNFSM4WGXWQWA> .

stephanie-wang · 2021-01-21T01:37:08Z

After some offline discussion, the decision is to merge this and fix the deadlock problem by pinning pulled objects until they're no longer needed. We'll account for the pinned memory by asking for the current availability synchronously and subtracting the size of the objects pinned by the PullManager.

…roject#13514) * Admission control, TODO: tests, object size * Unit tests for admission control and some bug fixes * Add object size to object table, only activate pull if object size is known * Some fixes, reset timer on eviction * doc * update * Trigger OOM from the pull manager * don't spam * doc * Update src/ray/object_manager/pull_manager.cc Co-authored-by: Eric Liang <[email protected]> * Remove useless tests * Fix test * osx build * Skip broken test * tests * Skip failing tests Co-authored-by: Eric Liang <[email protected]>

ray-project#13514)" This reverts commit e281944.

stephanie-wang added 6 commits January 13, 2021 22:44

Admission control, TODO: tests, object size

91a1d1f

Unit tests for admission control and some bug fixes

136574c

Add object size to object table, only activate pull if object size is…

4f38ef4

… known

Some fixes, reset timer on eviction

cc34ea2

doc

e1edc19

Merge remote-tracking branch 'upstream/master' into admission-control

70cf694

stephanie-wang assigned ericl and rkooo567 Jan 18, 2021

rkooo567 reviewed Jan 18, 2021

View reviewed changes

ericl reviewed Jan 18, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 18, 2021

stephanie-wang added 2 commits January 18, 2021 19:02

update

add49bd

Trigger OOM from the pull manager

1e9d091

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 19, 2021

rkooo567 approved these changes Jan 19, 2021

View reviewed changes

stephanie-wang added 2 commits January 18, 2021 22:19

don't spam

987d710

doc

b522cd4

ericl reviewed Jan 19, 2021

View reviewed changes

src/ray/object_manager/pull_manager.cc Outdated Show resolved Hide resolved

Update src/ray/object_manager/pull_manager.cc

03ffed7

Co-authored-by: Eric Liang <[email protected]>

ericl approved these changes Jan 19, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 19, 2021

stephanie-wang added 2 commits January 19, 2021 16:33

Remove useless tests

2bc1cee

Merge branch 'admission-control' of github.com:stephanie-wang/ray int…

0070cfa

…o admission-control

stephanie-wang mentioned this pull request Jan 20, 2021

[Object Spilling] Multi node file spilling V2. #13542

Merged

6 tasks

stephanie-wang added 3 commits January 20, 2021 10:18

Fix test

9114c94

osx build

7662e14

Skip broken test

73375ce

stephanie-wang added 2 commits January 20, 2021 21:15

tests

ce6d79b

Skip failing tests

68c9efd

stephanie-wang merged commit 0998d69 into ray-project:master Jan 22, 2021

stephanie-wang deleted the admission-control branch January 22, 2021 00:46

This was referenced Jan 25, 2021

[Object Spilling] Thrashing when there are large number of dependencies for many tasks #12663

Closed

[core] Lineage reconstruction fails due to deadlock in object pulling #13689

Closed

rkooo567 mentioned this pull request Feb 5, 2021

Output logs to log files instead of stdout in C++ tests. #13927

Closed

fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021

Revert "[core] Admission control for pulling objects to the local node (

45f915b

ray-project#13514)" This reverts commit e281944.

kfstorm mentioned this pull request Dec 31, 2021

[Core] Remove core worker test #21308

Closed

6 tasks



		@pytest.mark.timeout(30)
		def test_pull_bundles_admission_control(shutdown_only):

[core] Admission control for pulling objects to the local node #13514

[core] Admission control for pulling objects to the local node #13514

Uh oh!

Conversation

stephanie-wang commented Jan 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

rkooo567 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

rkooo567 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stephanie-wang commented Jan 21, 2021

Uh oh!

ericl commented Jan 21, 2021 via email

Uh oh!

ericl commented Jan 21, 2021 via email

Uh oh!

stephanie-wang commented Jan 18, 2021 •

edited

Loading

rkooo567 left a comment •

edited

Loading