Skip to content

Conversation

@pcmoritz
Copy link
Contributor

Why are these changes needed?

When Ray client is proxied over nginx (e.g. via the Kubernetes nginx ingress) and the configuration reloaded, the proxy sends a GOAWAY to the client, which resets the state of the stream to IDLE. Our code interprets this as a disconnect, but in reality the underlying streams seems to still be functional.

Related issue number

Fixes a problem with GOAWAYs that was introduced in #13386

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@pcmoritz pcmoritz requested a review from ckw017 August 30, 2021 06:31
@ckw017
Copy link
Member

ckw017 commented Aug 30, 2021

Can merge these changes to propagate the error correctly (get rid of _pickle.UnpicklingError: invalid load key, '%'): ckw017@09226a1

Other notes:

  • This fix will probably just propagate the internal error seen on the server side logs. I'm guessing "INTERNAL" in this context means "Flow-control protocol violation" mentioned in here: https://grpc.github.io/grpc/core/md_doc_statuscodes.html
  • This fix might be worthwhile. Should be noted that on the next RPC it will technically transition from IDLE -> CONNECTING -> READY, so CONNECTING might also be a valid "is_connected" state after the initial connection is resolved, (can add an additional flag)

@ericl
Copy link
Contributor

ericl commented Aug 30, 2021

Failing LINT and other tests (merge master?)

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 30, 2021
@ericl
Copy link
Contributor

ericl commented Sep 8, 2021

Do we still want this?

@ericl ericl closed this Sep 11, 2021
@pcmoritz
Copy link
Contributor Author

Yeah I think it is not really needed any more. It was most likely due to a protocol error in nginx and if that error happens other things are broken too so it doesn't really help...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants