Skip to content

Conversation

@romilbhardwaj
Copy link
Member

What do these changes do?

These changes add support to scheduling objects (ResourceSet, ResourceIdSet) to support adding, updating and removing resources at runtime. This is a split of the larger PR #3742.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're adding an extra total_capacity_ field, then do we need the TotalQuantity method? Can't we just have TotalQuantity return total_capacity_?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name TotalQuantity() is a bit misleading, since it returns the number of resources currently in the whole_ids_ vector. The whole_ids_ vector size keeps changing when resources are acquired or released, thus TotalQuantity() reflects the available resources. Thus I added total_capacity_ to keep a track of the total count of resources in ResourceIds.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13212/
Test FAILed.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13215/
Test FAILed.

@robertnishihara
Copy link
Collaborator

@romilbhardwaj the tests seem to be failing. I can reproduce this locally by running e.g.,

python -m pytest -v --durations=10 python/ray/tune/tests/test_commands.py::test_ls

Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I left some comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to constrain the upper limits now? And didn't need it before?

Also, it seems that the AddResourcesStrict is not used any more. Remove it if so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to constrain Release now because resources may get removed while they are acquired by a task, and a subsequent release would incorrectly re-add those resources to resources_available_.

I have removed AddResourcesStrict.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13237/
Test FAILed.

@romilbhardwaj
Copy link
Member Author

Thanks for the comments, @robertnishihara and @raulchen. I had a closer look at SubtractResourcesStrict and realized the delete_zero_capacity flag wasn't needed for dynamic resources anymore and infact was related to the bug which was causing the tune tests to fail . I've removed it and reverted SubtractResourcesStrict to it's original implementation.

@raulchen
Copy link
Contributor

Thanks, I'll take another look tomorrow. Note, the linter is failing. https://travis-ci.com/ray-project/ray/jobs/187695736

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13271/
Test PASSed.

@robertnishihara
Copy link
Collaborator

greatest_id_ should no longer be necessary. Let's remove that and get rid of the concept of IDs for dynamic resources.

@romilbhardwaj
Copy link
Member Author

romilbhardwaj commented Mar 27, 2019

greatest_id_ should no longer be necessary. Let's remove that and get rid of the concept of IDs for dynamic resources.

Okay, we would still need to allocate the newly created resources some id which can be pushed in to ResourceIds::whole_ids_. In the latest commit, I assign -1 as the id to all dynamically updated resources - would that be okay?

This would be the new behavior when the user calls ray.get_resource_ids on task which has acquired dynamically updated resources:

>>> ray.experimental.create_resource("a",1)	# This creates 1 resource with id 0 
>>> ray.experimental.create_resource("a",5)	# This creates 4 more resources with ids -1.
>>> @ray.remote(resources={"a":5})
... def f():  
...  return ray.get_resource_ids()
... 
>>> print(ray.get(f.remote()))
{'a': [(-1, 1.0), (-1, 1.0), (-1, 1.0), (-1, 1.0), (0, 1.0)], 'CPU': [(7, 1.0)]}

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13305/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13350/
Test FAILed.

void ResourceIds::UpdateCapacity(int64_t new_capacity) {
// Assert the new capacity is positive for sanity
RAY_CHECK(new_capacity >= 0);
int64_t capacity_delta = new_capacity - total_capacity_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this. Is total_capacity_ always an int? It's currently declared as a double.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_capacity_ is double because by definition, it should include fractional resources. I could change it to int to reflect it's current usage, but I think it's better to leave it as a double so it stays true to it's definition. Thanks for pointing this out though, I've updated this line to int64_t capacity_delta = new_capacity - (int64_t)total_capacity_; to truncate any fractional resources from this calculation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_capacity_ is initialized to an integer, right? And it's only updated in UpdateCapacity, which requires an integer value, right? So when could it be not an integer?

Also, the line int64_t capacity_delta = new_capacity - (int64_t)total_capacity_; really concerns me because if total_capacity_ has some fractional value, then we are truncating it and losing that value. Wouldn't that lead to incorrect behavior?

Also, if we do need to cast it, please right it as int64_t capacity_delta = new_capacity - static_cast<int64_t>(total_capacity_); to avoid C-style casts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_capacity_ is not necessarily initialized as an integer - it may be initialized as a double when a ResourceIds object is instantiated with fractional resources.

I think the cast to int is required in int64_t capacity_delta = new_capacity - (int64_t)total_capacity_; for correct behavior. Consider the two possible cases:

  1. total_capacity_ is a whole number -> there are no fractional resources. The cast has no effect.
  2. total_capacity_ is not a whole number -> there exist fractional resources. If we do not do the explicit int64_t cast, the subtraction new_capacity - total_capacity_ followed by a int64_t cast for capacity_delta would result in an incorrect result. For instance, consider new_capacity = 8 and total_capacity_ = 7.2. The value of capacity_delta after evaluating int64_t capacity_delta = new_capacity - (int64_t)total_capacity_; would be 0, whereas we would want it to be 1 (and thus the resultant capacity would be 8.2).

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13352/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13351/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13359/
Test FAILed.

/// \return True if the resource set was added successfully. False otherwise.
bool AddResourcesStrict(const ResourceSet &other);
/// \param total_resources: Total resource set which sets upper limits on capacity for
/// each label. \return True if the resource set was added successfully. False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// each label. \return True if the resource set was added successfully. False
/// each label.
/// \return True if the resource set was added successfully. False

void Release(const ResourceIdSet &resource_id_set);
/// \param add_new_resources If set to to true, creates any resources that do not
/// already exist in the ResourceIdSet. Else ignores any new resources and does not add
/// them back to available_resources_. \return Void.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// them back to available_resources_. \return Void.
/// them back to available_resources_.
/// \return Void.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13374/
Test FAILed.

/// A double to track the total capacity of the resource, since the whole_ids_ vector
/// keeps changing
double total_capacity_;
/// A double to track any pending decrements in capacity that weren't executed because
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a double

whole_ids_.insert(whole_ids_.end(), whole_ids_to_return.begin(),
whole_ids_to_return.end());
int64_t return_resource_count = whole_ids_to_return.size();
if (return_resource_count > decrement_backlog_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're only clearing the decrement_backlog_ when we return whole IDs. So if each task only uses fractional resources, then we will never clear the decrement_backlog_, right?

@romilbhardwaj romilbhardwaj force-pushed the dynamic-res-schedres branch from bb9ae5f to dc33a03 Compare April 8, 2019 05:49
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13631/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13809/
Test FAILed.

@raulchen
Copy link
Contributor

@romilbhardwaj @robertnishihara Is this PR ready for merge now? And what's status of the 3rd PR? thanks.

@robertnishihara
Copy link
Collaborator

@raulchen I don't think it is ready. We should merge #4533 first since there will be conflicts. cc @williamma12

@raulchen
Copy link
Contributor

@robertnishihara Thanks. In case #4533 still takes long time to merge, I think we can also merge this PR first. Because this PR doesn't introduce new problems.

@robertnishihara
Copy link
Collaborator

#4533 has been merged, so I think this can be rebased and merged.

Fix typo

More fixes.

Updates to python functions

debug statements

Rounding error fixes, removing cpu addition in cython and test fixes.

linting

Fix worker pool test

python linting

Update tune to work with zero capacity == deletion

Add test for zero capacity resource semantics.

python linting

more linting

more linting

more linting

Bad linting undo

python linting

linting and review comments.

Add more epsilon checks

Use functions rather than macros for epsilon compare.

linting

Fixing reconstruction tasks getting stuck in a local resubmission loop

Linting

Fix test_uses_resources

Rounding error fixes, removing cpu addition in cython and test fixes.

Updates to methods and classes to support dynamic custom resources.

Fixes.

Add comment why zero capacity resources must be deleted from queue load.

Review comments, updating SubtractResourcesStrict() to be a single method instead of two signatures.

Removing SubtractResourcesStrict and incorporating review comments.

Incorporating more review comments.

linting

Removing _greatest_id and now allocating id -1 to any new dynamic resource.

Review comments

rebase and fix std::max call

changing update capacity to int.

Linting

Refactoring fixes

Changing type to int64t

linting

updates to UpdateCapacity

Change to static cast

decerement backlog check in fractional release.

rebase fixes

Remove add_new_resources param from release to abide by new resource zero semantics.

Fix release and delete resource with new semantics.

Add ReleaseConstrained to account for resource release when resource has been deleted.

Updates for zero cap checks

Linting

Linting

rebase fixes

Linting
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13973/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13984/
Test FAILed.

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some small changes. LGTM assuming tests pass.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13995/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13998/
Test FAILed.

@robertnishihara robertnishihara merged commit 686d4ca into ray-project:master Apr 28, 2019
Edilmo added a commit to BonsaiAI/ray that referenced this pull request Feb 10, 2020
This PR contains changes that help with memory issues
ray-project#4465
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants