ring_hash stuck in TransientFailure despite having available endpoints #7363

atollena · 2024-06-28T07:49:58Z

What version of gRPC are you using?

This most likely affects all version since ring hash behavior to always attempt to connect to at least one endpoint was added via #5338.

What version of Go are you using (`go version`)?

go version go1.22.4 darwin/arm64

What operating system (Linux, Windows, …) and version?

MacOS 14.5

What did you do?

Use the ring_hash balancer with 2 priorities:

The highest priority has 3+ endpoints, and none is available
The lower priority has at least one available endpoint.

What did you expect to see?

If some of the endpoints in the highest priority becomes available, but not all of them (technically, at least 2 are still not available), the traffic should go back to the endpoints with highest priority.

What did you see instead?

The traffic sometimes continues to go to the lower priority indefinitely, until the endpoints change.

Additional notes

The ring hash balancer is specifically designed to avoid that situation, as outlines in A42: xDS Ring Hash LB Policy - Aggregated Connectivity States:

In addition, once the ring_hash policy reports TRANSIENT_FAILURE, it needs some way to recover from that state. The ring_hash policy normally requires pick requests to trigger subchannel connection attempts, but if it is being used as a child of the priority policy, it will not be getting any picks once it reports TRANSIENT_FAILURE. To work around this, it will make sure that it is attempting to connect (after applicable backoff period) to at least one subchannel at any given time. After a given subchannel fails a connection attempt, it will move on to the next subchannel in the ring. It will keep doing this until one of the subchannels successfully connects, at which point it will report READY and stop proactively trying to connect.

I believe that the solution implementated in Go in #5338 is not complete. Specifically, it always walks the ring from the start forward looking for an endpoint that is not the endpoint we are currently trying to connect to. As a result, if the ring contains an endpoint twice before every endpoints (i.e. the ring looks like [A B A C] rather than [A B C A], which is very likely), it will cycle through the list of endpoints at the beginning of the ring until the duplicate (in this case [A B]), without trying the remaining endpoints (C).

The text was updated successfully, but these errors were encountered:

easwars · 2024-06-28T15:32:19Z

Thank you very much for all the effort in porting the tests and uncovering these issues.

atollena added the Type: Bug label Jun 28, 2024

atollena self-assigned this Jun 28, 2024

atollena added this to the 1.66 Release milestone Jun 28, 2024

atollena added the P1 label Jun 28, 2024

This was referenced Jun 28, 2024

ringhash: improve test coverage #6072

Open

ringhash: more e2e tests from c-core #7334

Open

ringhash: fix bug where ring hash can be stuck in transient failure despite having available endpoints #7364

Draft

atollena changed the title ~~ring_hash stuck in TransientFailure despite having available endpoints~~ ringhash stuck in TransientFailure despite having available endpoints Jun 28, 2024

atollena changed the title ~~ringhash stuck in TransientFailure despite having available endpoints~~ ring_hash stuck in TransientFailure despite having available endpoints Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ring_hash stuck in TransientFailure despite having available endpoints #7363

ring_hash stuck in TransientFailure despite having available endpoints #7363

atollena commented Jun 28, 2024

easwars commented Jun 28, 2024

ring_hash stuck in TransientFailure despite having available endpoints #7363

ring_hash stuck in TransientFailure despite having available endpoints #7363

Comments

atollena commented Jun 28, 2024

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

Additional notes

easwars commented Jun 28, 2024

What version of Go are you using (`go version`)?