Updates to scheduling objects to support dynamic custom resources #4465

romilbhardwaj · 2019-03-23T00:00:26Z

What do these changes do?

These changes add support to scheduling objects (ResourceSet, ResourceIdSet) to support adding, updating and removing resources at runtime. This is a split of the larger PR #3742.

src/ray/raylet/scheduling_resources.h

src/ray/raylet/scheduling_resources.cc

src/ray/raylet/scheduling_resources.h

robertnishihara · 2019-03-23T00:38:01Z

src/ray/raylet/scheduling_resources.cc

If we're adding an extra total_capacity_ field, then do we need the TotalQuantity method? Can't we just have TotalQuantity return total_capacity_?

I think the name TotalQuantity() is a bit misleading, since it returns the number of resources currently in the whole_ids_ vector. The whole_ids_ vector size keeps changing when resources are acquired or released, thus TotalQuantity() reflects the available resources. Thus I added total_capacity_ to keep a track of the total count of resources in ResourceIds.

src/ray/raylet/scheduling_resources.cc

src/ray/raylet/scheduling_resources.h

src/ray/raylet/scheduling_resources.cc

src/ray/raylet/scheduling_resources.h

AmplabJenkins · 2019-03-23T01:50:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13212/
Test FAILed.

AmplabJenkins · 2019-03-23T17:33:08Z

Can one of the admins verify this patch?

AmplabJenkins · 2019-03-23T18:25:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13215/
Test FAILed.

robertnishihara · 2019-03-24T20:23:37Z

@romilbhardwaj the tests seem to be failing. I can reproduce this locally by running e.g.,

python -m pytest -v --durations=10 python/ray/tune/tests/test_commands.py::test_ls

src/ray/raylet/scheduling_resources.h

raulchen

Thanks, I left some comments.

src/ray/raylet/scheduling_resources.h

src/ray/raylet/scheduling_resources.cc

raulchen · 2019-03-25T10:30:39Z

src/ray/raylet/scheduling_resources.cc

Why do we need to constrain the upper limits now? And didn't need it before?

Also, it seems that the AddResourcesStrict is not used any more. Remove it if so.

We need to constrain Release now because resources may get removed while they are acquired by a task, and a subsequent release would incorrectly re-add those resources to resources_available_.

I have removed AddResourcesStrict.

src/ray/raylet/scheduling_resources.h

src/ray/raylet/scheduling_resources.cc

AmplabJenkins · 2019-03-26T04:39:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13237/
Test FAILed.

romilbhardwaj · 2019-03-26T04:45:59Z

Thanks for the comments, @robertnishihara and @raulchen. I had a closer look at SubtractResourcesStrict and realized the delete_zero_capacity flag wasn't needed for dynamic resources anymore and infact was related to the bug which was causing the tune tests to fail . I've removed it and reverted SubtractResourcesStrict to it's original implementation.

raulchen · 2019-03-26T12:28:52Z

Thanks, I'll take another look tomorrow. Note, the linter is failing. https://travis-ci.com/ray-project/ray/jobs/187695736

AmplabJenkins · 2019-03-26T23:51:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13271/
Test PASSed.

robertnishihara · 2019-03-27T06:28:06Z

greatest_id_ should no longer be necessary. Let's remove that and get rid of the concept of IDs for dynamic resources.

romilbhardwaj · 2019-03-27T20:18:23Z

greatest_id_ should no longer be necessary. Let's remove that and get rid of the concept of IDs for dynamic resources.

Okay, we would still need to allocate the newly created resources some id which can be pushed in to ResourceIds::whole_ids_. In the latest commit, I assign -1 as the id to all dynamically updated resources - would that be okay?

This would be the new behavior when the user calls ray.get_resource_ids on task which has acquired dynamically updated resources:

>>> ray.experimental.create_resource("a",1)	# This creates 1 resource with id 0 
>>> ray.experimental.create_resource("a",5)	# This creates 4 more resources with ids -1.
>>> @ray.remote(resources={"a":5})
... def f():  
...  return ray.get_resource_ids()
... 
>>> print(ray.get(f.remote()))
{'a': [(-1, 1.0), (-1, 1.0), (-1, 1.0), (-1, 1.0), (0, 1.0)], 'CPU': [(7, 1.0)]}

AmplabJenkins · 2019-03-27T23:12:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13305/
Test PASSed.

src/ray/raylet/scheduling_resources.cc

src/ray/raylet/scheduling_resources.h

AmplabJenkins · 2019-03-29T06:52:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13350/
Test FAILed.

robertnishihara · 2019-03-29T06:53:17Z

src/ray/raylet/scheduling_resources.cc

+void ResourceIds::UpdateCapacity(int64_t new_capacity) {
+  // Assert the new capacity is positive for sanity
+  RAY_CHECK(new_capacity >= 0);
+  int64_t capacity_delta = new_capacity - total_capacity_;


I'm confused about this. Is total_capacity_ always an int? It's currently declared as a double.

total_capacity_ is double because by definition, it should include fractional resources. I could change it to int to reflect it's current usage, but I think it's better to leave it as a double so it stays true to it's definition. Thanks for pointing this out though, I've updated this line to int64_t capacity_delta = new_capacity - (int64_t)total_capacity_; to truncate any fractional resources from this calculation.

total_capacity_ is initialized to an integer, right? And it's only updated in UpdateCapacity, which requires an integer value, right? So when could it be not an integer?

Also, the line int64_t capacity_delta = new_capacity - (int64_t)total_capacity_; really concerns me because if total_capacity_ has some fractional value, then we are truncating it and losing that value. Wouldn't that lead to incorrect behavior?

Also, if we do need to cast it, please right it as int64_t capacity_delta = new_capacity - static_cast<int64_t>(total_capacity_); to avoid C-style casts.

total_capacity_ is not necessarily initialized as an integer - it may be initialized as a double when a ResourceIds object is instantiated with fractional resources.

I think the cast to int is required in int64_t capacity_delta = new_capacity - (int64_t)total_capacity_; for correct behavior. Consider the two possible cases:

total_capacity_ is a whole number -> there are no fractional resources. The cast has no effect.

total_capacity_ is not a whole number -> there exist fractional resources. If we do not do the explicit int64_t cast, the subtraction new_capacity - total_capacity_ followed by a int64_t cast for capacity_delta would result in an incorrect result. For instance, consider new_capacity = 8 and total_capacity_ = 7.2. The value of capacity_delta after evaluating int64_t capacity_delta = new_capacity - (int64_t)total_capacity_; would be 0, whereas we would want it to be 1 (and thus the resultant capacity would be 8.2).

AmplabJenkins · 2019-03-29T06:53:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13352/
Test FAILed.

AmplabJenkins · 2019-03-29T09:52:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13351/
Test FAILed.

AmplabJenkins · 2019-03-29T16:42:06Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13359/
Test FAILed.

raulchen · 2019-03-30T06:28:44Z

src/ray/raylet/scheduling_resources.h

-  /// \return True if the resource set was added successfully. False otherwise.
-  bool AddResourcesStrict(const ResourceSet &other);
+  /// \param total_resources: Total resource set which sets upper limits on capacity for
+  /// each label. \return True if the resource set was added successfully. False


Suggested change

/// each label. \return True if the resource set was added successfully. False

/// each label.

/// \return True if the resource set was added successfully. False

raulchen · 2019-03-30T06:29:00Z

src/ray/raylet/scheduling_resources.h

-  void Release(const ResourceIdSet &resource_id_set);
+  /// \param add_new_resources If set to to true, creates any resources that do not
+  /// already exist in the ResourceIdSet. Else ignores any new resources and does not add
+  /// them back to available_resources_. \return Void.


Suggested change

/// them back to available_resources_. \return Void.

/// them back to available_resources_.

/// \return Void.

AmplabJenkins · 2019-03-30T07:18:35Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13374/
Test FAILed.

robertnishihara · 2019-03-30T19:41:20Z

src/ray/raylet/scheduling_resources.h

+  /// A double to track the total capacity of the resource, since the whole_ids_ vector
+  /// keeps changing
+  double total_capacity_;
+  /// A double to track any pending decrements in capacity that weren't executed because


this is not a double

robertnishihara · 2019-03-30T19:43:02Z

src/ray/raylet/scheduling_resources.cc

-  whole_ids_.insert(whole_ids_.end(), whole_ids_to_return.begin(),
-                    whole_ids_to_return.end());
+  int64_t return_resource_count = whole_ids_to_return.size();
+  if (return_resource_count > decrement_backlog_) {


We're only clearing the decrement_backlog_ when we return whole IDs. So if each task only uses fractional resources, then we will never clear the decrement_backlog_, right?

AmplabJenkins · 2019-04-08T07:47:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13631/
Test FAILed.

AmplabJenkins · 2019-04-16T00:36:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13809/
Test FAILed.

raulchen · 2019-04-22T06:09:57Z

@romilbhardwaj @robertnishihara Is this PR ready for merge now? And what's status of the 3rd PR? thanks.

robertnishihara · 2019-04-22T06:13:01Z

@raulchen I don't think it is ready. We should merge #4533 first since there will be conflicts. cc @williamma12

raulchen · 2019-04-22T12:28:29Z

@robertnishihara Thanks. In case #4533 still takes long time to merge, I think we can also merge this PR first. Because this PR doesn't introduce new problems.

robertnishihara · 2019-04-24T00:12:48Z

#4533 has been merged, so I think this can be rebased and merged.

Fix typo More fixes. Updates to python functions debug statements Rounding error fixes, removing cpu addition in cython and test fixes. linting Fix worker pool test python linting Update tune to work with zero capacity == deletion Add test for zero capacity resource semantics. python linting more linting more linting more linting Bad linting undo python linting linting and review comments. Add more epsilon checks Use functions rather than macros for epsilon compare. linting Fixing reconstruction tasks getting stuck in a local resubmission loop Linting Fix test_uses_resources Rounding error fixes, removing cpu addition in cython and test fixes. Updates to methods and classes to support dynamic custom resources. Fixes. Add comment why zero capacity resources must be deleted from queue load. Review comments, updating SubtractResourcesStrict() to be a single method instead of two signatures. Removing SubtractResourcesStrict and incorporating review comments. Incorporating more review comments. linting Removing _greatest_id and now allocating id -1 to any new dynamic resource. Review comments rebase and fix std::max call changing update capacity to int. Linting Refactoring fixes Changing type to int64t linting updates to UpdateCapacity Change to static cast decerement backlog check in fractional release. rebase fixes Remove add_new_resources param from release to abide by new resource zero semantics. Fix release and delete resource with new semantics. Add ReleaseConstrained to account for resource release when resource has been deleted. Updates for zero cap checks Linting Linting rebase fixes Linting

AmplabJenkins · 2019-04-26T00:46:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13973/
Test FAILed.

AmplabJenkins · 2019-04-26T22:20:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13984/
Test FAILed.

robertnishihara

I pushed some small changes. LGTM assuming tests pass.

AmplabJenkins · 2019-04-28T01:19:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13995/
Test FAILed.

AmplabJenkins · 2019-04-28T01:30:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13998/
Test FAILed.

This PR contains changes that help with memory issues ray-project#4465

robertnishihara reviewed Mar 23, 2019

View reviewed changes

src/ray/raylet/scheduling_resources.h Outdated Show resolved Hide resolved

robertnishihara reviewed Mar 23, 2019

View reviewed changes

src/ray/raylet/scheduling_resources.cc Outdated Show resolved Hide resolved

robertnishihara reviewed Mar 23, 2019

View reviewed changes

src/ray/raylet/scheduling_resources.cc Outdated Show resolved Hide resolved

robertnishihara requested a review from raulchen March 23, 2019 00:31