-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XDS Ring Hash inconsistent after client and server restarts #6927
Comments
A point to add - Since this setup is on Kubernetes when the clients or servers crash and come back up, they come up with different IP address/names. Just for info if that makes any difference. |
Hi @ginayeh , any update ? let me know if more info is needed. Thanks. |
Hi @k-raval, thanks for the patience. Are you only seeing this inconsistent behavior when both client AND server are killed? Would you mind sharing your example code to reproduce the ring hash inconsistent behavior? |
To answer your question, when I kill either one server or one client independently, this behaviour is not observed, but when I kill a server and client simultaneously, this problem is observed. Attaching the codebase here. Directory contains:
Environment: Steps:
Thanks. Let me know if you want to try something. |
Hi @ginayeh , any update so far ? let me know. thanks. |
Hi @arvindbr8 , let me know if you need any help in simulating this. awaiting your reply. thanks. |
@k-raval -- We appreciate your patience here. Just taking a bit to come up to context with xDS Ring Hash consistencies. Will get back to you asap. |
Hi @k-raval -- apologies. We are a bit thin on the ground with multiple teammates on vacation. Hence the delayed turn back time. I would lean on to you for understanding the behavior of istiod here. Looking at your setup, it seems like you are expecting a header based ring hash load balancing. As per the spec:
And this is the code that actually generates the hash for each RPC (
Based on the reading above, the header to use for hashing needs to come though the xDS response as a RouteAction in the PS: The particular scenario when the client and server restarts is one of those cases where the definition of persistence is different for different people. Especially because in gRPC when in the event of a down/upscale of backends (or even if there is a churn of backends) the ring changes. Also its important to note that a new gRPC channel could potentially create a different ring for the same backends based on the ordering of the endpoints response from the xdsResolver. |
Thanks @arvindbr8. For your comment on the ring formation, my comment is - irrespective of the order in which XDS resolver sends the list of server endpoints, the client can order them probably alphabetically to construct the ring, so that all different clients will follow the same behaviour and when we will take a hash on a parameter, it will result in choosing same destination server "across clients". Let me know if this makes sense ? |
Thanks for the logs. The RouteAction looks correct to me with the hash_policy providing the header to use. "hash_policy":{"header":{"header_name":"ULID"}}
Also, I should rephrase my previous comment about ring formation.
what I meant here is that, its hard to determine if the ring is going to look alike across all the clients. But the current hashing algorithm tries to ensure that an address update only a small portion of the keys in the ring are updates in the same binary. But when the client restarts, the shape of the ring might look totally different
The ring hash LB policy is implemented as per the xDS spec. Keeping consistency over different runs of the client doesn't seem to be in the scope of this LB policy. Seems like Ring Hash LB Policy might be too barebones for your usecase here. A55-xds-stateful-session-affinity.md might something you might be interested in- but with the caveats of client/server shutdown cases. But this feature is pending implementation in golang and has not been prioritized yet. |
Hi @arvindbr8 , thanks for explanation. My requirement is simple. For a given parameter (e.g. ULID string above), I want all the different clients independently started and stateless to hash it to SAME destination server endpoint. If this costs me un-equal load distribution, I am fine with it. If the server endpoints change (add/delete/crash-restart), the same ULID may get hashed to any other server endpoint, that is fine, but I need all the clients to follow the same behaviour independently without exchanging any state. And I see this working with Envoy proxy doing the load balancing using same configuration. However, to avoid the cost of proxy, I wanted to used GRPC proxy less mode using XDS. Hence expected same behaviour here as well. Let me know if there is any solution you may see since you are much more into the GRPC codebase. Thanks again and awaiting your reply. |
Let me followup with the team to get you better guidance for this usecase. Meanwhile, could you confirm that in the case with Envoy proxy you are also using the same ring hash LB policy? |
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
Removing the |
Hi @arvindbr8, I was able to reproduce the issue with your branch with logs enabled. Sequence of events:
Attaching logs: |
Thanks.
Shows evidence of not being deterministic. I'm sort of inclined now to believe if there is an interaction with istiod ADS responses which makes the behavior of gRPC non deterministic. Anyways, I'll take a look |
I feel like I have found what the issue could be. I have pushed some more commits to https://github.com/arvindbr8/grpc-go/tree/dbug_6927. This contains a potential fix and other debug messages (which would be useful in case the fix does not completely fix the issue) Do you mind running the test again with the HEAD @ https://github.com/arvindbr8/grpc-go/tree/dbug_6927 |
Hi @arvindbr8 ,happy to inform that with your new commit, the issue seems to have been fixed. attaching the logs again for your review. thanks a lot for putting so much effort. Same procedure I tried twice and each time I did two sets of restarts (each set consisting of one client and one server). |
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
@k-raval I think this should fix the bug. Please feel free to reopen this issue if you still have concerns. |
GRPC Version - 1.60.1
Go Version - 1.19.1
OS - Linux 5.4.0-81-generic #91-Ubuntu SMP aarch64 GNU/Linux
Setup:
Problem:
When one server and one client each are killed (or crash at the same time) and they restart back up again, now the hashing pattern changes. Some clients send the message to one server, but other clients send the message to S2. The header field is still the same, however different clients now hash it to different destinations.
Expectation:
No matter in what order client or server go down and come back up, the final hashing logic should be such that a given field value should be hashed to exactly same destination.
Other Information:
Instead of using GRPC XDS hashing, if we try to use Envoy sidecar which does its own hashing, we see the expected behaviour always. However, we want to use GRPC XDS features as it avoids one hop to Envoy proxy and gives us better performance.
@easwars - we talked about similar in a question/answer format some time back #5933 . Appreciate your help here. thanks in advance.
The text was updated successfully, but these errors were encountered: