Skip to content

Conversation

@williamma12
Copy link
Contributor

@williamma12 williamma12 commented Apr 2, 2019

What do these changes do?

Changes the internal representation of resource to integers that represent 1/10000 of a unit of each resource to avoid machine precision of doubles

Related issue number

Closes #4503

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@williamma12 williamma12 changed the title Issues/4503 Account for machine precision in resource allocation Apr 2, 2019
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13421/
Test FAILed.

@ericl
Copy link
Contributor

ericl commented Apr 3, 2019

I'm skeptical this is enough to fix the problems. Should we just always `std::round`` resources to some finite level (1/10000 of a unit)?

@robertnishihara
Copy link
Collaborator

Yes, I discussed this with @williamma12 today. We're going to try to just make the arithmetic exact. This epsilson would have to be put in too many places and is going to be error prone.

@williamma12 williamma12 changed the title Account for machine precision in resource allocation Change internal representation of resource to account for machine precision Apr 7, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13615/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13616/
Test FAILed.

@robertnishihara
Copy link
Collaborator

This fixes some bugs, right? Is there a reliable way to trigger the error (so that we can add a test for it)?

@robertnishihara
Copy link
Collaborator

How about something like this?

import numpy as np
import ray
ray.init(num_cpus=2, num_gpus=2, resources={"Custom": 2})

@ray.remote
def f():
    return 1

ray.get([f._remote(num_cpus=np.random.uniform()) for _ in range(100)])
ray.get([f._remote(num_gpus=np.random.uniform()) for _ in range(100)])
ray.get([f._remote(resources={"Custom": np.random.uniform()}) for _ in range(100)])

# NOTE: You'll need to try the check below in a loop because available_resources is updated asynchronously.
assert ray.global_state.available_resources() == {'CPU': 2.0, 'Custom': 2.0, 'GPU': 2.0}

@williamma12
Copy link
Contributor Author

@robertnishihara Thanks! Is the loop check needed even though the ray.get is a blocking call?

@romilbhardwaj Does this fix work with #4555?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13643/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13661/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13674/
Test FAILed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should probably be a separate class (which has an int fieldor evenuint8_t` since it only goes from 0 to 1000 or 10000). That will prevent accidental usage errors. It will mean that you need to add some methods for doing addition and subtraction and maybe checking if it's equal to 1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually let's use int instead of uint8_t. That way we can check that it never goes below 0 and stays in the appropriate range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can set the upper bound to 1*conversion_factor_ since the NodeManager.local_available_resources_ uses ResourceIdSet, which uses ResourceSets and does not differentiate between whole and fractional resource, and, as a result, FractionalResourceQuantity ends up representing resources beyond 1.

@williamma12 williamma12 force-pushed the issues/4503 branch 2 times, most recently from db53ccb to 06214f2 Compare April 14, 2019 21:10
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13795/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13796/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13797/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13799/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13800/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13801/
Test FAILed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this constructor? Can we get rid of it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to have an empty constructor because when we add values to resource_capacity_ in ResourceSets, std::unordered_map makes use of the default constructor of zero arguments.

https://github.com/ray-project/ray/blob/1592ead002086768b8ec6ea609bd9a625a1dc08d/src/ray/raylet/scheduling_resources.cc#L99-L104

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going to be really slow since we're doing ray.get inside of the loop? One option is just to put the assert inside of the remote function so you don't need to check the return value.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13915/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13917/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13920/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13921/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13936/
Test FAILed.

@robertnishihara robertnishihara merged commit c99e3ca into ray-project:master Apr 23, 2019
Edilmo added a commit to BonsaiAI/ray that referenced this pull request Feb 10, 2020
This PR contains changes that help with memory issues
ray-project#4533
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test failure in test_actor.py::test_resource_assignment on Travis.

4 participants