Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-router Holding on to Routes #1738

Closed
aauren opened this issue Sep 12, 2024 · 3 comments · Fixed by #1739
Closed

kube-router Holding on to Routes #1738

aauren opened this issue Sep 12, 2024 · 3 comments · Fixed by #1739
Assignees
Labels
bug override-stale Don't allow automatic management of stale issues / PRs

Comments

@aauren
Copy link
Collaborator

aauren commented Sep 12, 2024

What happened?

Over time, it appears that kube-router hits conditions where it will keep BGP routes that have since been withdrawn. This likely happens because route_sync.go contains its own cache of routes and at some point it isn't able to receive a BGP update for one reason or another.

Because of this problem, kube-router continues to put back bad routes to nexthops that no longer contain the service which essentially blackholes the traffic bound for that service.

What did you expect to happen?

kube-router to have an accurate route state at all times.

How can we reproduce the behavior you experienced?

This behavior is not easily reproduced and the exact cause of the issue is not yet known. It is something that involves state over time.

System Information (please complete the following information)

  • Kube-Router Version (kube-router --version): 2.1.3
  • Kube-Router Parameters:
--advertise-external-ip=true --bgp-graceful-restart=true --bgp-graceful-restart-deferral-time=60s --enable-ibgp=false --enable-overlay=false --hairpin-mode=true --kubeconfig=/etc/kubernetes/kubectl-config.yaml --metrics-port=9081 --nodes-full-mesh=false --run-router=true --run-firewall=true --service-cluster-ip-range=172.28.0.0/16 --service-external-ip-range=192.168.1.0/24 --service-external-ip-range=192.168.2.0/24 --peer-router-ips=192.168.3.1,192.168.3.2,192.168.3.3 --peer-router-asns=4220000001,4220000001,4220000001 --peer-router-passwords-file=/etc/kube-router-bgp.conf --cluster-asn=4220000001
  • Kubernetes Version (kubectl version) : 1.28.10
  • Cloud Type: On Prem
  • Kubernetes Deployment Type: Custom
  • Kube-Router Deployment Type: System Service
  • Cluster Size: ~100 nodes

Logs, other output, metrics

No logs show up with this issue

Additional context

kube-router probably needs to add a consistency check that happens periodically when the routes_sync controller is running.

This would allow the controller to be primarily event driven, but also retrue it's state from time to time to ensure that it doesn't get into an inconsistent state with the desired state of the BGP subsystem.

@aauren aauren added the bug label Sep 12, 2024
@aauren aauren self-assigned this Sep 17, 2024
@aauren aauren reopened this Oct 20, 2024
@aauren
Copy link
Collaborator Author

aauren commented Oct 20, 2024

This was accidentally closed on the merge of #1739 and is not yet fully completed.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Nov 20, 2024
@aauren aauren added override-stale Don't allow automatic management of stale issues / PRs and removed Stale labels Nov 20, 2024
@aauren
Copy link
Collaborator Author

aauren commented Nov 21, 2024

Fixed via #1763

@aauren aauren closed this as completed Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug override-stale Don't allow automatic management of stale issues / PRs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant