-
Notifications
You must be signed in to change notification settings - Fork 7k
Updates to scheduling objects to support dynamic custom resources #4465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to scheduling objects to support dynamic custom resources #4465
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're adding an extra total_capacity_ field, then do we need the TotalQuantity method? Can't we just have TotalQuantity return total_capacity_?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name TotalQuantity() is a bit misleading, since it returns the number of resources currently in the whole_ids_ vector. The whole_ids_ vector size keeps changing when resources are acquired or released, thus TotalQuantity() reflects the available resources. Thus I added total_capacity_ to keep a track of the total count of resources in ResourceIds.
|
Test FAILed. |
|
Can one of the admins verify this patch? |
|
Test FAILed. |
|
@romilbhardwaj the tests seem to be failing. I can reproduce this locally by running e.g., |
raulchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I left some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to constrain the upper limits now? And didn't need it before?
Also, it seems that the AddResourcesStrict is not used any more. Remove it if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to constrain Release now because resources may get removed while they are acquired by a task, and a subsequent release would incorrectly re-add those resources to resources_available_.
I have removed AddResourcesStrict.
|
Test FAILed. |
|
Thanks for the comments, @robertnishihara and @raulchen. I had a closer look at |
|
Thanks, I'll take another look tomorrow. Note, the linter is failing. https://travis-ci.com/ray-project/ray/jobs/187695736 |
|
Test PASSed. |
|
|
Okay, we would still need to allocate the newly created resources some id which can be pushed in to This would be the new behavior when the user calls |
|
Test PASSed. |
|
Test FAILed. |
| void ResourceIds::UpdateCapacity(int64_t new_capacity) { | ||
| // Assert the new capacity is positive for sanity | ||
| RAY_CHECK(new_capacity >= 0); | ||
| int64_t capacity_delta = new_capacity - total_capacity_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this. Is total_capacity_ always an int? It's currently declared as a double.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_capacity_ is double because by definition, it should include fractional resources. I could change it to int to reflect it's current usage, but I think it's better to leave it as a double so it stays true to it's definition. Thanks for pointing this out though, I've updated this line to int64_t capacity_delta = new_capacity - (int64_t)total_capacity_; to truncate any fractional resources from this calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_capacity_ is initialized to an integer, right? And it's only updated in UpdateCapacity, which requires an integer value, right? So when could it be not an integer?
Also, the line int64_t capacity_delta = new_capacity - (int64_t)total_capacity_; really concerns me because if total_capacity_ has some fractional value, then we are truncating it and losing that value. Wouldn't that lead to incorrect behavior?
Also, if we do need to cast it, please right it as int64_t capacity_delta = new_capacity - static_cast<int64_t>(total_capacity_); to avoid C-style casts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_capacity_ is not necessarily initialized as an integer - it may be initialized as a double when a ResourceIds object is instantiated with fractional resources.
I think the cast to int is required in int64_t capacity_delta = new_capacity - (int64_t)total_capacity_; for correct behavior. Consider the two possible cases:
total_capacity_is a whole number -> there are no fractional resources. The cast has no effect.total_capacity_is not a whole number -> there exist fractional resources. If we do not do the explicitint64_tcast, the subtractionnew_capacity - total_capacity_followed by aint64_tcast forcapacity_deltawould result in an incorrect result. For instance, consider new_capacity = 8 and total_capacity_ = 7.2. The value ofcapacity_deltaafter evaluatingint64_t capacity_delta = new_capacity - (int64_t)total_capacity_;would be 0, whereas we would want it to be 1 (and thus the resultant capacity would be 8.2).
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
| /// \return True if the resource set was added successfully. False otherwise. | ||
| bool AddResourcesStrict(const ResourceSet &other); | ||
| /// \param total_resources: Total resource set which sets upper limits on capacity for | ||
| /// each label. \return True if the resource set was added successfully. False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// each label. \return True if the resource set was added successfully. False | |
| /// each label. | |
| /// \return True if the resource set was added successfully. False |
| void Release(const ResourceIdSet &resource_id_set); | ||
| /// \param add_new_resources If set to to true, creates any resources that do not | ||
| /// already exist in the ResourceIdSet. Else ignores any new resources and does not add | ||
| /// them back to available_resources_. \return Void. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// them back to available_resources_. \return Void. | |
| /// them back to available_resources_. | |
| /// \return Void. |
|
Test FAILed. |
| /// A double to track the total capacity of the resource, since the whole_ids_ vector | ||
| /// keeps changing | ||
| double total_capacity_; | ||
| /// A double to track any pending decrements in capacity that weren't executed because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a double
| whole_ids_.insert(whole_ids_.end(), whole_ids_to_return.begin(), | ||
| whole_ids_to_return.end()); | ||
| int64_t return_resource_count = whole_ids_to_return.size(); | ||
| if (return_resource_count > decrement_backlog_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're only clearing the decrement_backlog_ when we return whole IDs. So if each task only uses fractional resources, then we will never clear the decrement_backlog_, right?
bb9ae5f to
dc33a03
Compare
|
Test FAILed. |
dc33a03 to
e208e7b
Compare
|
Test FAILed. |
|
@romilbhardwaj @robertnishihara Is this PR ready for merge now? And what's status of the 3rd PR? thanks. |
|
@raulchen I don't think it is ready. We should merge #4533 first since there will be conflicts. cc @williamma12 |
|
@robertnishihara Thanks. In case #4533 still takes long time to merge, I think we can also merge this PR first. Because this PR doesn't introduce new problems. |
|
#4533 has been merged, so I think this can be rebased and merged. |
Fix typo More fixes. Updates to python functions debug statements Rounding error fixes, removing cpu addition in cython and test fixes. linting Fix worker pool test python linting Update tune to work with zero capacity == deletion Add test for zero capacity resource semantics. python linting more linting more linting more linting Bad linting undo python linting linting and review comments. Add more epsilon checks Use functions rather than macros for epsilon compare. linting Fixing reconstruction tasks getting stuck in a local resubmission loop Linting Fix test_uses_resources Rounding error fixes, removing cpu addition in cython and test fixes. Updates to methods and classes to support dynamic custom resources. Fixes. Add comment why zero capacity resources must be deleted from queue load. Review comments, updating SubtractResourcesStrict() to be a single method instead of two signatures. Removing SubtractResourcesStrict and incorporating review comments. Incorporating more review comments. linting Removing _greatest_id and now allocating id -1 to any new dynamic resource. Review comments rebase and fix std::max call changing update capacity to int. Linting Refactoring fixes Changing type to int64t linting updates to UpdateCapacity Change to static cast decerement backlog check in fractional release. rebase fixes Remove add_new_resources param from release to abide by new resource zero semantics. Fix release and delete resource with new semantics. Add ReleaseConstrained to account for resource release when resource has been deleted. Updates for zero cap checks Linting Linting rebase fixes Linting
e208e7b to
24cfe1f
Compare
|
Test FAILed. |
|
Test FAILed. |
robertnishihara
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed some small changes. LGTM assuming tests pass.
|
Test FAILed. |
|
Test FAILed. |
This PR contains changes that help with memory issues ray-project#4465
What do these changes do?
These changes add support to scheduling objects (ResourceSet, ResourceIdSet) to support adding, updating and removing resources at runtime. This is a split of the larger PR #3742.