-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use slot node id to detect node re-configuration #1913
Conversation
Seems ok to me, I can give it a try in an environment where i can reproduce the issue |
9bea59d
to
e3a6d70
Compare
e3a6d70
to
425f2fc
Compare
Okay, I've added the missing part - please give it a try. |
Oks awesome, Ill try it during Today and Ill tell you something! |
Tested and still failing, Ill keep working on this and see if there is a none intrusive way of fixing this |
Okay, let me know if you find out why this approach does not work - it looked promising... |
Hi @vmihailenco let me summarize why the fixed proposed in that PR is still having issues. But before this just a recap of how is being done the roll out by AWS for replacing the master instances.
There are at least two issues with the solution proposed:
For solving this scenario, Ive created this PR #1914, which follows the original idea that was commented. Ive tested this and it mostly address the issues, but due the DNS propagation during 20 seconds half of the request fail. But its a considerable improvement considering that without that change the downtime is total, all requests affected, during more than 10 minutes. |
Thanks for the explanation and for the new PR. Do you think we still should merge this PR or it makes the situation worse? |
I dont think that this PR is providing value and indeed is surfacing the error in a different way, reporting pool closed, which is more hard to understand. IMO the way to go, not a must, for adding the id as part of the state would be having the nodes indexed by ID instead of addr, but in any case this would be more a "cosmetic" change which would not help on the issue presented. So, to sum up, I would close this PR |
Thanks, makes sense. |
Fixes #1910
@pfreixes what do you think?