Adding runtime error related apis, metric aggregation, early bailout … #55

dmlyubim · 2021-02-08T21:40:11Z

Why are these changes needed?

Right now, at least TFGPU Multi (used in PPO) takes non-numeric metrics corresponding to first minibatch of an SGD step/last SGD step, the information for all other steps (or even minibatches in the last SGD step) is lost.

We want to be able to analyze and report runtime errors in any minibatch/sgd step. We also want to break SGD training early if we detected any fatal errors there.

So we add two apis

to aggregate non-numeric metrics in ways other than just taking the one from last sgd iteration/first minibatch.
to examine SGD evaluated fetches for runtime errors and break SGD training loop if we detect fatal errors (usually, loss errors that immediately would result to NaNs in weights of the policy, making further training impossible).

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…in TFGraphGPUMulti

Edilmo

LGTM

…in TFGraphGPUMulti (#55)

Adding runtime error related apis, metric aggregation, early bailout …

2b3a551

…in TFGraphGPUMulti

dmlyubim marked this pull request as ready for review February 8, 2021 21:44

Edilmo approved these changes Feb 8, 2021

View reviewed changes

Edilmo merged commit 171c6c4 into releases/0.8.6 Feb 9, 2021

Edilmo deleted the dmlyubim/runtime_err_apis branch February 9, 2021 01:25

Edilmo pushed a commit that referenced this pull request Feb 10, 2021

Adding runtime error related apis, metric aggregation, early bailout …

c7031b4

…in TFGraphGPUMulti (#55)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding runtime error related apis, metric aggregation, early bailout … #55

Adding runtime error related apis, metric aggregation, early bailout … #55

Uh oh!

dmlyubim commented Feb 8, 2021

Uh oh!

Edilmo left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adding runtime error related apis, metric aggregation, early bailout … #55

Adding runtime error related apis, metric aggregation, early bailout … #55

Uh oh!

Conversation

dmlyubim commented Feb 8, 2021

Why are these changes needed?

Related issue number

Checks

Uh oh!

Edilmo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants