Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ring_hash stuck in TransientFailure despite having available endpoints #7363

Open
atollena opened this issue Jun 28, 2024 · 1 comment
Open
Assignees
Milestone

Comments

@atollena
Copy link
Collaborator

What version of gRPC are you using?

This most likely affects all version since ring hash behavior to always attempt to connect to at least one endpoint was added via #5338.

What version of Go are you using (go version)?

go version go1.22.4 darwin/arm64

What operating system (Linux, Windows, …) and version?

MacOS 14.5

What did you do?

Use the ring_hash balancer with 2 priorities:

  • The highest priority has 3+ endpoints, and none is available
  • The lower priority has at least one available endpoint.

What did you expect to see?

If some of the endpoints in the highest priority becomes available, but not all of them (technically, at least 2 are still not available), the traffic should go back to the endpoints with highest priority.

What did you see instead?

The traffic sometimes continues to go to the lower priority indefinitely, until the endpoints change.

Additional notes

The ring hash balancer is specifically designed to avoid that situation, as outlines in A42: xDS Ring Hash LB Policy - Aggregated Connectivity States:

In addition, once the ring_hash policy reports TRANSIENT_FAILURE, it needs some way to recover from that state. The ring_hash policy normally requires pick requests to trigger subchannel connection attempts, but if it is being used as a child of the priority policy, it will not be getting any picks once it reports TRANSIENT_FAILURE. To work around this, it will make sure that it is attempting to connect (after applicable backoff period) to at least one subchannel at any given time. After a given subchannel fails a connection attempt, it will move on to the next subchannel in the ring. It will keep doing this until one of the subchannels successfully connects, at which point it will report READY and stop proactively trying to connect.

I believe that the solution implementated in Go in #5338 is not complete. Specifically, it always walks the ring from the start forward looking for an endpoint that is not the endpoint we are currently trying to connect to. As a result, if the ring contains an endpoint twice before every endpoints (i.e. the ring looks like [A B A C] rather than [A B C A], which is very likely), it will cycle through the list of endpoints at the beginning of the ring until the duplicate (in this case [A B]), without trying the remaining endpoints (C).

@atollena atollena self-assigned this Jun 28, 2024
@atollena atollena added this to the 1.66 Release milestone Jun 28, 2024
@atollena atollena added the P1 label Jun 28, 2024
@atollena atollena changed the title ring_hash stuck in TransientFailure despite having available endpoints ringhash stuck in TransientFailure despite having available endpoints Jun 28, 2024
@atollena atollena changed the title ringhash stuck in TransientFailure despite having available endpoints ring_hash stuck in TransientFailure despite having available endpoints Jun 28, 2024
@easwars
Copy link
Contributor

easwars commented Jun 28, 2024

Thank you very much for all the effort in porting the tests and uncovering these issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants