-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeing lot of moved errors when Routebylatency is enabled in ClusterClient #3023
Comments
@vmihailenco, @monkey92t Can you please look into this issue? |
I believe using nodes that are marked as failed is not a good idea. In most cases, such nodes are unusable, and attempting to use them will result in a predictable error. |
In cluster mode, during a node failover, such situations may occur: the hash slots originally handled by Node A are transferred to Node B. However, the latest cluster state is not retrieved in real time and has some delay. While redirecting commands to the new node might bring certain issues, such as network connections, it still ensures the execution of commands. If we use a crashed node, it might not yield any effective results, and returning an error directly might be better. |
From the perspective of the Redis server, it expects many clients to perform similar operations; otherwise, the MOVE command would be meaningless. Redis does not guarantee a permanent relationship between nodes and hash slots. When adding or removing nodes, the relationship between hash slots and nodes may change. Redis uses the MOVE command to inform the client that there has been a change in the hash slots within the Redis server. In normal cluster mode, such situations rarely occur because, in the vast majority of cases, the nodes in a Redis cluster are stable, or replicas are used to avoid single-node failures. |
@monkey92t thanks for replying, we were having lot of "all nodes failing" cases when we were benchmarking, this was due to a bug in But in the situation where we are getting lot of moved errors, we were not having any shard rebalancing at all(or node failovers), the only root cause was State reloading can be avoided if we choose a random node in the current shard itself and On the other hand,
|
Also Is it possible to add a config to ignore failure of nodes? (i.e consider them while choosing lowest latency node) |
OK, I am trying to understand your point. During your testing, there were no changes to the Redis cluster structure, but the node responsible for a certain range of hash slots experienced a failure (or network issue). As a result, the node corresponding to those hash slots became inaccessible. When go-redis encounters an error while accessing that node, it marks the node as faulty and stops accessing it. However, since all nodes responsible for the hash slots in the cluster have failed, the commands are randomly sent to any available node. Because the cluster structure hasn't changed, the commands are then redirected (MOVE) back to the original faulty node, and go-redis attempts to send the commands to the faulty node again. |
Your solution might be effective, but it will bring about greater side effects. If we force the use of a node that has already failed, it becomes pointless. The reason we mark a node as faulty is to avoid using it until it recovers. If the Redis cluster is performing a normal failover, with a new node taking over the failed node, it will be difficult to discover the new node information without using MOVE, which is the purpose of Redis-server's MOVE response to the client. Additionally, when the node experiences a network failure, accessing it again will worsen an already unhealthy network state. |
@srikar-jilugu You can try setting only the ReadOnly parameter; it will select nodes only within the nodes responsible for the hash slot and will not randomly select other nodes. func (c *clusterState) slotSlaveNode(slot int) (*clusterNode, error) {
nodes := c.slotNodes(slot)
switch len(nodes) {
case 0:
return c.nodes.Random()
case 1:
return nodes[0], nil
case 2:
if slave := nodes[1]; !slave.Failing() {
return slave, nil
}
return nodes[0], nil
default:
var slave *clusterNode
for i := 0; i < 10; i++ {
n := rand.Intn(len(nodes)-1) + 1
slave = nodes[n]
if !slave.Failing() {
return slave, nil
}
}
// All slaves are loading - use master.
return nodes[0], nil
}
} In the above code, only when no corresponding node is found for the hash slot (regardless of whether the nodes are healthy), will a random node be selected. In all other cases, the node corresponding to the hash slot will be returned. |
But doing so deviates from your expected setup; it won't look for nodes that are closer to itself, it will only randomly select a node. |
@monkey92t we are opting in for |
We weren't experiencing any node failure, we had lot of cache misses when we benchmarked using v9.5.1 which resulted in nodes getting marked as failed
|
@monkey92t I agree that using a bad node can cause an issue, but the library is marking nodes as failed whenever there is a badconnection issue which can be caused by even single intermittent context timeout (and deadline exceeded error) leading to frequent failures. (possibly the logic on when the node should be marked as failed needs to change?)
Can you let me know if this can be done, like a configurable option for |
We weren't doing this in any other node selection functions: |
I've recently been reflecting on the request flow in redis-cluster. It seems that we shouldn't randomly select another node to execute commands when all the nodes corresponding to a hash slot are down. Similar to the |
@vmihailenco Do you understand why we choose a random node? |
@monkey92t @vmihailenco Is there a reason why RouteByLatency and RouteRandomly includes master node during selection? Would it be possible to have node selection among replicas itself (I think Use Case: We want to have RouteByLatency for our write heavy clients too, but the read requests should only be served by slaves |
@monkey92t Any update on this issue? if we are onboard on the decision that random selection cannot be done outside the hash slot, will raise a pr for the same. |
@vmihailenco @ofekshenawa view? |
@monkey92t @vmihailenco @ofekshenawa Are there any hindrances we are facing regarding this approach?, would love to get your feedback. |
I am in favor of canceling the random selection of nodes because there seems to be no clear evidence that a migration has occurred in the Redis cluster; it is merely that the nodes are unavailable or there is a network failure. However, I don't understand why the initial implementation required randomly selecting from all nodes. We should use a default node, such as nodes[0], when all nodes responsible for the hash slot are down. |
@monkey92t choosing default node could overwhelm the master node(node[0]) in case of temporary failures, instead we could choose the node with the lowest latency itself when all nodes are failing? wdyt? |
@monkey92t It looks like the other authors haven't responded yet. It is greatly appreciated if its possible for you to take ownership of this issue and review the PR that I can raise? We currently depend on this fix for new improvements in our redis ecosystem. |
OK, Welcome your PR. Although other maintainers haven't responded, we can proceed with the changes. |
If no usable nodes are available, we can randomly choose a node responsible for that hash slot. If a fixed node is selected, go-redis might attempt TCP connections, which could lead to larger failures if there are too many attempts. |
A better solution might be to use goroutines to continuously ping the lost nodes and mark them as RUNNING when access is restored. If there are any access requests during this period, directly return an error such as 'ERR: no available nodes'. |
I think we are failing(marking) the nodes for a temporary duration, if the node is able to serve the requests again, it will not be marked as failed further, correct me if I am wrong |
Yes, your idea is not wrong, but it carries a significant risk. If node A goes down and many subsequent requests still use node A, it will send many TCP handshakes or Redis requests, further worsening the situation. If we use a separate goroutine to continuously ping, it will have only one request working instead of a large number of requests. |
makes sense, we can take aid from updateLatency() function which is run periodically(I think every 10sec afaik)
lets say if a node is genuinely down, this func will be returning constantly high latency, which can be used to ignore these kind of nodes when choosing the node when all nodes are marked as failed. A node which might be temporarily down will eventually respond to a ping and update its latency info with a lower value. |
@monkey92t made the PR as discussed:#3043, please review |
fix #3043 |
When we benchmarked our elasticache(cluster mode enabled) with Routebylatency option enabled with
goredis v9.5.1
, we saw increase in average response time in our redis operations(get and pipeline cmds), when we tried to debug this issue and added certain logs in the process, we saw a lot of moved errors that caused retries which in turn increased latency overall.go-redis/osscluster.go
Line 966 in d43a9fa
In further debugging we observed that
slotClosestNode
func is returning a random node across all the shards in the case when all the nodes are marked as failed.go-redis/osscluster.go
Line 750 in d43a9fa
In our case, this situation(where all nodes failing) is happening frequently which is causing frequent moved errors
Expected Behavior
There shouldn't be increase in response time when
Routebylatency
enabled infact it should decrease if possible and moved errors shouldn't be much once the client's current cluster state is updated.Current Behavior
Increase in moved errors, hence increase in throughput of
Get
(with the same traffic), engine cpu utilisation of all the nodes and overall latency.Possible Solution
In the case when all the nodes are marked as failed, choosing a random node within the shard associated with the slot(even though they are marked as failed) might work for this problem, this is what is done when
RouteRandomly
is enabled.Steps to Reproduce
go-redis
v9.5.1 withRoutebyLatency
enabled, throughput around 10-20k rpm withget
andpipeline.get
Context (Environment)
Detailed Description
Possible Implementation
We made changes in the
slotClosestNode
func implementing the fix we thought of, actually reduced the moved errors(and hence response time) when we benchmarked again.This is the fix we made in our fork.
The text was updated successfully, but these errors were encountered: