Skip to content

Conversation

@dmlyubim
Copy link

@dmlyubim dmlyubim commented Feb 8, 2021

Why are these changes needed?

Right now, at least TFGPU Multi (used in PPO) takes non-numeric metrics corresponding to first minibatch of an SGD step/last SGD step, the information for all other steps (or even minibatches in the last SGD step) is lost.

We want to be able to analyze and report runtime errors in any minibatch/sgd step. We also want to break SGD training early if we detected any fatal errors there.

So we add two apis

  • to aggregate non-numeric metrics in ways other than just taking the one from last sgd iteration/first minibatch.
  • to examine SGD evaluated fetches for runtime errors and break SGD training loop if we detect fatal errors (usually, loss errors that immediately would result to NaNs in weights of the policy, making further training impossible).

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@dmlyubim dmlyubim marked this pull request as ready for review February 8, 2021 21:44
Copy link

@Edilmo Edilmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Edilmo Edilmo merged commit 171c6c4 into releases/0.8.6 Feb 9, 2021
@Edilmo Edilmo deleted the dmlyubim/runtime_err_apis branch February 9, 2021 01:25
Edilmo pushed a commit that referenced this pull request Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants