Reflector should backoff with Jitter to avoid global synchronization during API server OOM #87794
Labels
kind/bug
Categorizes issue or PR as related to a bug.
sig/api-machinery
Categorizes an issue or PR as relevant to SIG API Machinery.
sig/scalability
Categorizes an issue or PR as relevant to SIG Scalability.
What happened:
We recently had a production issue where one of our large cluster (~3k nodes) has it's API server OOMKilled. This caused all kubelets' connection to master broken, and caused a global synchronization of kubelet's reconnecting to master. Due to the burst of API call (18K QPS, 7x than normal), API server went into even worse condition, and caused more replicas to get OOMKilled.
Current reflector impl has fixed 1sec backoff before calling next
ListWatch
, which shall be the root cause of such instability.What you expected to happen:
reflectors to backoff with jitter
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
):cat /etc/os-release
):uname -a
):/assign
/sig api-machinery scalability
The text was updated successfully, but these errors were encountered: