Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reflector should backoff with Jitter to avoid global synchronization during API server OOM #87794

Closed
zhan849 opened this issue Feb 3, 2020 · 1 comment · Fixed by #87795
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@zhan849
Copy link
Contributor

zhan849 commented Feb 3, 2020

What happened:
We recently had a production issue where one of our large cluster (~3k nodes) has it's API server OOMKilled. This caused all kubelets' connection to master broken, and caused a global synchronization of kubelet's reconnecting to master. Due to the burst of API call (18K QPS, 7x than normal), API server went into even worse condition, and caused more replicas to get OOMKilled.

Current reflector impl has fixed 1sec backoff before calling next ListWatch, which shall be the root cause of such instability.

What you expected to happen:
reflectors to backoff with jitter

How to reproduce it (as minimally and precisely as possible):

  1. create a cluster with a few thousands of nodes
  2. kill api server and reboot it

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

/assign

/sig api-machinery scalability

@zhan849 zhan849 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 3, 2020
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 3, 2020
@jktomer
Copy link

jktomer commented Feb 4, 2020

/cc @lavalamp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants