Skip to content

Conversation

NickCraver
Copy link
Collaborator

This is a new feature for allowing keepalive commands to be sent every heartbeat regardless of if they're needed for actual keepalives. I've gotten a ping about a network drop that went undetected for ~8 minutes because of usage only in strings. We see this use case not uncommonly, and because of that string type being the only thing used, we don't detect a protocol fault from subsequent string commands.

The symptoms here are partial/unexpected string payloads but ultimately the wrong answers to the wrong query. Given we don't know at what layer this drop happens (this appears to be extremely rare, 1 in quadrillions as far as we know), the best we can currently do is react to it ASAP.

There may be a better way to accomplish what this is after - discussing with Marc soon as we're both online but prepping this Pr in case it's best path.

This is a new feature for allowing keepalive commands to be sent every heartbeat regardless of if they're needed for actual keepalives. I've gotten a ping about a network drop that went undetected for ~8 minutes because of usage only in strings. We see this use case not uncommonly, and because of that string type being the only thing used, we don't detect a protocol fault from subsequent string commands.

The symptoms here are partial/unexpected string payloads but ultimately the wrong answers to the wrong query. Given we don't know at what layer this drop happens (this appears to be _extremely_ rare, 1 in quadrillions as far as we know), the best we can currently do is react to it ASAP.
Copy link
Collaborator

@mgravell mgravell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes seem good; should this exist as a parse / format token on the string form of the options? (apologies if I missed this)

@NickCraver
Copy link
Collaborator Author

@mgravell My thinking it that HeartbeatInterval closely related to this isn't a text option there now and we could add it later if we wanted, but did mean to ask that in PR and dropped it.

Thoughts?

@NickCraver NickCraver merged commit 54a633f into main Feb 26, 2024
@NickCraver NickCraver deleted the craver/echo-heartbeat branch February 26, 2024 12:39
@mgravell
Copy link
Collaborator

@NickCraver want me to push a release?

@NickCraver
Copy link
Collaborator Author

@mgravell Already underway now! Just waiting on MyGet to finish building, much appreciated though <3

@DTTerastar
Copy link

We are seeing daily wrong answers to the wrong query issues. This fix doesn't help either. Starting to lose confidence in this lib, what can we do to help track this down?

@NickCraver
Copy link
Collaborator Author

NickCraver commented Apr 22, 2024

@DTTerastar I'd recommend opening an issue with your details - it sounds like you have a different issue of e.g. caching the wrong thing which may be a usage problem - this was purely around a connection de-sync so if this isn't triggering then it's likely you have a different race issue.

@singhramkumar
Copy link

What is the recommended action for client in this particular case. Should it reconnect, or should it reinitialize the multiplexer and get a new connection (forceReconnect) ?

if (assertIdentity && map.IsAvailable(RedisCommand.ECHO))
if (checkResponse && map.IsAvailable(RedisCommand.ECHO))
{
msg = Message.Create(-1, flags, RedisCommand.ECHO, (RedisValue)Multiplexer.UniqueId);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this msg is sent by KeepAlive() its result is processed by ResultProcessor.Tracer, and I don't think ResultProcessor.Tracer actually checks the result for Multiplexer.UniqueId. But its buddy ResultProcessor.EstablishConnection would check the result thanks to this check in TracerProcessor.SetResultCore():
(!establishConnection || result.IsEqual(connection.BridgeCouldBeNull?.Multiplexer?.UniqueId))
(in ResultProcessor.Tracer, !establishConnection is always true)

Not sure what the right fix would be, but maybe add a flag to force ResultProcessor.Tracer to check the result?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We looked here - are good since we're on the PING path instead of ECHO meaning we'll check for PONG on the response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants