You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This most likely affects all version since ring hash behavior to always attempt to connect to at least one endpoint was added via #5338.
What version of Go are you using (go version)?
go version go1.22.4 darwin/arm64
What operating system (Linux, Windows, …) and version?
MacOS 14.5
What did you do?
Use the ring_hash balancer with 2 priorities:
The highest priority has 3+ endpoints, and none is available
The lower priority has at least one available endpoint.
What did you expect to see?
If some of the endpoints in the highest priority becomes available, but not all of them (technically, at least 2 are still not available), the traffic should go back to the endpoints with highest priority.
What did you see instead?
The traffic sometimes continues to go to the lower priority indefinitely, until the endpoints change.
In addition, once the ring_hash policy reports TRANSIENT_FAILURE, it needs some way to recover from that state. The ring_hash policy normally requires pick requests to trigger subchannel connection attempts, but if it is being used as a child of the priority policy, it will not be getting any picks once it reports TRANSIENT_FAILURE. To work around this, it will make sure that it is attempting to connect (after applicable backoff period) to at least one subchannel at any given time. After a given subchannel fails a connection attempt, it will move on to the next subchannel in the ring. It will keep doing this until one of the subchannels successfully connects, at which point it will report READY and stop proactively trying to connect.
I believe that the solution implementated in Go in #5338 is not complete. Specifically, it always walks the ring from the start forward looking for an endpoint that is not the endpoint we are currently trying to connect to. As a result, if the ring contains an endpoint twice before every endpoints (i.e. the ring looks like [A B A C] rather than [A B C A], which is very likely), it will cycle through the list of endpoints at the beginning of the ring until the duplicate (in this case [A B]), without trying the remaining endpoints (C).
The text was updated successfully, but these errors were encountered:
atollena
changed the title
ring_hash stuck in TransientFailure despite having available endpoints
ringhash stuck in TransientFailure despite having available endpoints
Jun 28, 2024
atollena
changed the title
ringhash stuck in TransientFailure despite having available endpoints
ring_hash stuck in TransientFailure despite having available endpoints
Jun 28, 2024
What version of gRPC are you using?
This most likely affects all version since ring hash behavior to always attempt to connect to at least one endpoint was added via #5338.
What version of Go are you using (
go version
)?go version go1.22.4 darwin/arm64
What operating system (Linux, Windows, …) and version?
MacOS 14.5
What did you do?
Use the
ring_hash
balancer with 2 priorities:What did you expect to see?
If some of the endpoints in the highest priority becomes available, but not all of them (technically, at least 2 are still not available), the traffic should go back to the endpoints with highest priority.
What did you see instead?
The traffic sometimes continues to go to the lower priority indefinitely, until the endpoints change.
Additional notes
The ring hash balancer is specifically designed to avoid that situation, as outlines in A42: xDS Ring Hash LB Policy - Aggregated Connectivity States:
I believe that the solution implementated in Go in #5338 is not complete. Specifically, it always walks the ring from the start forward looking for an endpoint that is not the endpoint we are currently trying to connect to. As a result, if the ring contains an endpoint twice before every endpoints (i.e. the ring looks like
[A B A C]
rather than[A B C A]
, which is very likely), it will cycle through the list of endpoints at the beginning of the ring until the duplicate (in this case[A B]
), without trying the remaining endpoints (C
).The text was updated successfully, but these errors were encountered: