-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Change resource bookkeeping to account for machine precision. #4533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
Test FAILed. |
|
I'm skeptical this is enough to fix the problems. Should we just always `std::round`` resources to some finite level (1/10000 of a unit)? |
|
Yes, I discussed this with @williamma12 today. We're going to try to just make the arithmetic exact. This |
|
Test FAILed. |
|
Test FAILed. |
|
This fixes some bugs, right? Is there a reliable way to trigger the error (so that we can add a test for it)? |
|
How about something like this? import numpy as np
import ray
ray.init(num_cpus=2, num_gpus=2, resources={"Custom": 2})
@ray.remote
def f():
return 1
ray.get([f._remote(num_cpus=np.random.uniform()) for _ in range(100)])
ray.get([f._remote(num_gpus=np.random.uniform()) for _ in range(100)])
ray.get([f._remote(resources={"Custom": np.random.uniform()}) for _ in range(100)])
# NOTE: You'll need to try the check below in a loop because available_resources is updated asynchronously.
assert ray.global_state.available_resources() == {'CPU': 2.0, 'Custom': 2.0, 'GPU': 2.0} |
|
@robertnishihara Thanks! Is the loop check needed even though the @romilbhardwaj Does this fix work with #4555? |
|
Test FAILed. |
|
Test FAILed. |
e049ebf to
35d9f44
Compare
|
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should probably be a separate class (which has an int fieldor evenuint8_t` since it only goes from 0 to 1000 or 10000). That will prevent accidental usage errors. It will mean that you need to add some methods for doing addition and subtraction and maybe checking if it's equal to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually let's use int instead of uint8_t. That way we can check that it never goes below 0 and stays in the appropriate range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can set the upper bound to 1*conversion_factor_ since the NodeManager.local_available_resources_ uses ResourceIdSet, which uses ResourceSets and does not differentiate between whole and fractional resource, and, as a result, FractionalResourceQuantity ends up representing resources beyond 1.
db53ccb to
06214f2
Compare
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this constructor? Can we get rid of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to have an empty constructor because when we add values to resource_capacity_ in ResourceSets, std::unordered_map makes use of the default constructor of zero arguments.
python/ray/tests/test_basic.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this going to be really slow since we're doing ray.get inside of the loop? One option is just to put the assert inside of the remote function so you don't need to check the return value.
79279fa to
c239322
Compare
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
This PR contains changes that help with memory issues ray-project#4533
What do these changes do?
Changes the internal representation of resource to integers that represent 1/10000 of a unit of each resource to avoid machine precision of doubles
Related issue number
Closes #4503
Linter
scripts/format.shto lint the changes in this PR.