Skip to content

Conversation

@barakmich
Copy link
Contributor

Related issue number

Closes #13446

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@barakmich barakmich marked this pull request as ready for review January 15, 2021 23:17
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need three different retry loops for the different stages? Why not have just a single one that tests this last condition?

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on simplifying

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 15, 2021
@barakmich
Copy link
Contributor Author

barakmich commented Jan 15, 2021

Because it could fail at any one of these stages.

Now, one big retry loop would look like a finite state machine of

while retrying:
   if in step A:
   if in step B:
   if in step C:

But we can't "test this last condition" alone. If this is what you'd prefer, I'll go do that.

@ericl
Copy link
Contributor

ericl commented Jan 16, 2021

Why can't we test the last condition alone? C implies {A, B} are fine right? I think a loop that just tests C would be sufficient. If it's for better error messages, we could also check {A, B} prior to C in the loop.

@ericl
Copy link
Contributor

ericl commented Jan 16, 2021

Btw, it would be good to have a test here

Change-Id: I4e96d1c4dd7252b754482892597280504a4ba63c
Change-Id: I9b7fae0c7bf15d86c3f3a9bbb6b4eded438975f7
Change-Id: I54c890e7e5d6452bcb58312d54c746e5c8e556bc
Change-Id: I278dda94f4bc0c53d93908afdd32011ab1951a13
Change-Id: I23dffe65feb4c778985be724f3ab213a61eb2da8
Change-Id: I000bb467e61ca5f2f4e0345d07ca01acf143956f
Change-Id: I746c4dda014b8b5b7752c07299f9b42951fe6cb8
Change-Id: I6a0b6d7dccba3d34f6e2c7864909394d016decd3
@barakmich barakmich removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2021
Change-Id: I9102433a212846af77ac7a23322e35b48d8464b8
Change-Id: I443e911ca6e8e6c5fcc91d2ec5a4990d81215c1e
@barakmich
Copy link
Contributor Author

Sure.

Removed state machine and added test.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 22, 2021
Change-Id: I815b4a0266e4e0fd6331a6b9957309c563c58038
@barakmich barakmich removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 22, 2021
@ericl
Copy link
Contributor

ericl commented Jan 22, 2021

//python/ray/tests:test_client                                          TIMEOUT in 3 out of 3 in 75.0s
  Stats over 3 runs: max = 75.0s, min = 75.0s, avg = 75.0s, dev = 0.0s
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_client/test.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_client/test_attempts/attempt_1.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_client/test_attempts/attempt_2.log

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 22, 2021
Change-Id: I3669ea2734de9e7aad8522a898d3248aa560892a
Change-Id: Iae12027d2340e81832305a60a297f85deb8f2919
@barakmich barakmich removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 22, 2021
@ericl
Copy link
Contributor

ericl commented Jan 22, 2021

Tests / lint still failing

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 22, 2021
Change-Id: Ia3e8df94ea689b38d87f8bbafb5a1142edd7b3e1
@barakmich
Copy link
Contributor Author

I'm starting to run out of ideas, but we'll see if this goes

@barakmich barakmich removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 23, 2021
@barakmich
Copy link
Contributor Author

I think I figured it out

Change-Id: Ie691ed9ac082639c583eec4f6af7578b0841c744
Change-Id: I1f926cd492e9076c77576f388cdbb4ec45de77d1
@barakmich
Copy link
Contributor Author

This time for sure -- I got the reproduction locally and tracked it down

@ericl ericl merged commit e675e5b into ray-project:master Jan 24, 2021
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray client connect retries not sufficient: sometimes get 502 error

2 participants