[IMPROVED] NRG: Peer activity tracking#7402
Merged
neilalexander merged 1 commit intomainfrom Oct 8, 2025
Merged
Conversation
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
neilalexander
added a commit
that referenced
this pull request
Oct 10, 2025
neilalexander
added a commit
that referenced
this pull request
Nov 6, 2025
Follow-up of #7402. When shutting down a server with LDM or having the leader step down, all peer timestamps would be cleared. This resulted in quorum being reported as lost for all Raft nodes that the server was leader for, a "NO quorum, stalled" message to be printed, and an advisory to be sent. This PR fixes that by ensuring the leader remembers the timestamps after stepping down. Once the new leader comes online the other follower's timestamp can still be cleared. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
neilalexander
added a commit
that referenced
this pull request
Mar 2, 2026
Follow-up to #7402. Also resetting the last replicated index, since we only know it if we're the leader, since only the leader is in contact with all servers and knows the amount of messages they have persisted in their logs. The `Lag` and `Current` values were mostly unusable/stale/incorrect when read on a follower node. These values are returned in requests like stream and consumer info, as well as JSZ. So, this PR makes the `n.Peers()` state more usable and consistent: - The `Lag/Current` fields now accurately describe whether the current server is current (or lags) when compared with the peer in the list. - Only the leader will report `Lag`. All followers that are part of quorum will report `Current: true, Lag: 0`. Other non-current followers not part of quorum will show `Lag` of the amount of entries that are committed but not yet persisted on this peer. (This was already the case) - A follower will always report other followers as not current with no lag. It doesn't have any contact with other peers, so this data is not useful either way. (Previously this contained stale/unused data making it incorrect and deceiving at worst). - A follower will report in `Current` whether it has seen the leader recently. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The peer activity timestamp tracking was somewhat inconsistent.
The leader should always be able to reach and report on all peer activities, because it sends heartbeats it will know best when all peers were last seen. Followers only talk to the leader, so they can only track the leader's activity.
However, during leader changes or cluster resizes old active timestamps would remain and would not be updated. This is not an issue per se, but since it's exposed as "last seen" and "active" in various APIs this would look very misleading. A last seen timestamp could show "17 hours ago" and that would feel very problematic "a node hasn't been active for 17 hours!?". But actually that server happened to be a leader 17 hours ago and there's a new leader since, so that timestamp has just not been updated anymore after that. This PR ensures we track these timestamps more consistently, and clear these old timestamps on leader changes.
Signed-off-by: Maurice van Veen github@mauricevanveen.com