Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple the sending of probes from the latency reporting in the NodeLatencyMonitor #6570

Open
antoninbas opened this issue Jul 29, 2024 · 2 comments · May be fixed by #6812
Open

Decouple the sending of probes from the latency reporting in the NodeLatencyMonitor #6570

antoninbas opened this issue Jul 29, 2024 · 2 comments · May be fixed by #6812
Labels
good first issue Good for newcomers

Comments

@antoninbas
Copy link
Contributor

At the moment, the NodeLatencyMonitor in the Agent reports latency measurements immediately after sending ICMP probes:

case <-tickerCh:
// Try to send pingAll signal
m.pingAll(ipv4Socket, ipv6Socket)
// We no not delete IPs from nodeIPLatencyMap as part of the Node delete event handler
// to avoid consistency issues and because it would not be sufficient to avoid stale entries completely.
// This means that we have to periodically invoke DeleteStaleNodeIPs to avoid stale entries in the map.
m.latencyStore.DeleteStaleNodeIPs()
m.report()

I believe this is not ideal, because when outputting the NodeLatencyStats, the values for the lastRecvTime and lastSendTime
fields can be a bit confusing / misleading:

kubectl get nodelatencystats/kind-worker -o yaml
apiVersion: stats.antrea.io/v1alpha1
kind: NodeLatencyStats
metadata:
  creationTimestamp: null
  name: kind-worker
peerNodeLatencyStats:
- nodeName: kind-control-plane
  targetIPLatencyStats:
  - lastMeasuredRTTNanoseconds: 5837000
    lastRecvTime: "2024-07-26T22:40:03Z"
    lastSendTime: "2024-07-26T22:40:33Z"
    targetIP: 10.10.0.1
- nodeName: kind-worker2
  targetIPLatencyStats:
  - lastMeasuredRTTNanoseconds: 4704000
    lastRecvTime: "2024-07-26T22:40:03Z"
    lastSendTime: "2024-07-26T22:40:33Z"
    targetIP: 10.10.2.1

We are "always" going to have lastRecvTime < lastSendTime, because we always update NodeLatencyStats right after sending a new probe (before the response has had a chance to be received). Ideally most of the time, especially with very low inter-Node latency like we have here (a few ms), most of the time we would observe timestamps which are very close to each other / identical. This can be achieved by providing enough time to the NodeLatencyMonitor to receive / process the response, before calling m.report().

Another advantage of decoupling the sending of probes from the latency reporting would be the ability to enforce a minimum time interval between two consecutive reports. At the moment it is possible for someone to set pingIntervalSeconds to 1s (minimum supported value in the NodeLatencyMonitor CRD). In turn, this would cause m.report() to be invoked every second. That may be a bit too frequent for a monitoring tool, especially for a large cluster. So we could consider enforcing a minimum interval of 10s (even though that would mean that values of pingIntervalSeconds under 10s are not very useful).

Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 28, 2024
@antoninbas antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 28, 2024
@faheem047
Copy link

faheem047 commented Nov 16, 2024

@antoninbas Kindly Check i have raised an PR regarding This issue i have tried to decouple the both PingTicker and ReportTicker your guidance regarding this wiil be highly appericiated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants