Skip to content

Conversation

@richardliaw
Copy link
Contributor

@richardliaw richardliaw commented Nov 27, 2018

Changes include:

  • Notify Components on Requeue
  • Slight refactoring of Node Failure handling
  • Better tests

This is a subset of changes of #3309, so this should go in before.

TODO:

  • Add one more test for try_recover

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9623/
Test FAILed.

from ray.tune.suggest import BasicVariantGenerator


def register_test_trainable():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in favor of __fake

node = nodes.pop()
cluster.remove_node(node)
assert cluster.wait_for_nodes()
assert ray.global_state.cluster_resources()["CPU"] == 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test previously didn't test Tune's resource tracking - updated test

trial_executor.start_trial(trial)
except Exception as e:
self.assertIn("a class", str(e))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test actually didn't actually work because start_trial didn't throw; I rewrote this test and moved it to ray_trial_executor.py.

self.start_trial(trial)
else:
trial.status = Trial.PENDING

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to trial_runner.try_recover so for better handling and ability to notify other components.

@richardliaw richardliaw changed the title [tune] Refactor Node FT + Tests [tune] Node FT for components + Tests Nov 27, 2018
@richardliaw richardliaw requested a review from ericl November 27, 2018 05:10
@richardliaw richardliaw changed the title [tune] Node FT for components + Tests [tune] Component notification on node failure + Tests Nov 27, 2018
@richardliaw richardliaw mentioned this pull request Nov 27, 2018
2 tasks
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9626/
Test FAILed.

ericl
ericl previously requested changes Nov 29, 2018
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9671/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9670/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9672/
Test FAILed.

@richardliaw richardliaw merged commit 9d0bd50 into ray-project:master Dec 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants