-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Feature: Add HeartbeatConsistencyChecks #2656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is a new feature for allowing keepalive commands to be sent every heartbeat regardless of if they're needed for actual keepalives. I've gotten a ping about a network drop that went undetected for ~8 minutes because of usage only in strings. We see this use case not uncommonly, and because of that string type being the only thing used, we don't detect a protocol fault from subsequent string commands. The symptoms here are partial/unexpected string payloads but ultimately the wrong answers to the wrong query. Given we don't know at what layer this drop happens (this appears to be _extremely_ rare, 1 in quadrillions as far as we know), the best we can currently do is react to it ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes seem good; should this exist as a parse / format token on the string form of the options? (apologies if I missed this)
@mgravell My thinking it that Thoughts? |
@NickCraver want me to push a release? |
@mgravell Already underway now! Just waiting on MyGet to finish building, much appreciated though <3 |
We are seeing daily wrong answers to the wrong query issues. This fix doesn't help either. Starting to lose confidence in this lib, what can we do to help track this down? |
@DTTerastar I'd recommend opening an issue with your details - it sounds like you have a different issue of e.g. caching the wrong thing which may be a usage problem - this was purely around a connection de-sync so if this isn't triggering then it's likely you have a different race issue. |
What is the recommended action for client in this particular case. Should it reconnect, or should it reinitialize the multiplexer and get a new connection (forceReconnect) ? |
if (assertIdentity && map.IsAvailable(RedisCommand.ECHO)) | ||
if (checkResponse && map.IsAvailable(RedisCommand.ECHO)) | ||
{ | ||
msg = Message.Create(-1, flags, RedisCommand.ECHO, (RedisValue)Multiplexer.UniqueId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this msg
is sent by KeepAlive()
its result is processed by ResultProcessor.Tracer
, and I don't think ResultProcessor.Tracer
actually checks the result for Multiplexer.UniqueId
. But its buddy ResultProcessor.EstablishConnection
would check the result thanks to this check in TracerProcessor.SetResultCore()
:
(!establishConnection || result.IsEqual(connection.BridgeCouldBeNull?.Multiplexer?.UniqueId))
(in ResultProcessor.Tracer
, !establishConnection
is always true)
Not sure what the right fix would be, but maybe add a flag to force ResultProcessor.Tracer
to check the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We looked here - are good since we're on the PING
path instead of ECHO
meaning we'll check for PONG
on the response.
This is a new feature for allowing keepalive commands to be sent every heartbeat regardless of if they're needed for actual keepalives. I've gotten a ping about a network drop that went undetected for ~8 minutes because of usage only in strings. We see this use case not uncommonly, and because of that string type being the only thing used, we don't detect a protocol fault from subsequent string commands.
The symptoms here are partial/unexpected string payloads but ultimately the wrong answers to the wrong query. Given we don't know at what layer this drop happens (this appears to be extremely rare, 1 in quadrillions as far as we know), the best we can currently do is react to it ASAP.
There may be a better way to accomplish what this is after - discussing with Marc soon as we're both online but prepping this Pr in case it's best path.