-
Notifications
You must be signed in to change notification settings - Fork 42.3k
kube-up.sh: set inotify limits #130990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-up.sh: set inotify limits #130990
Conversation
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
That's a bit low on max_user_instances, but not all that low on max_user_watches. /test |
This comment was marked as resolved.
This comment was marked as resolved.
|
/test pull-kubernetes-e2e-gce-cos |
|
COS: |
|
/test pull-kubernetes-e2e-gce-cos |
also debug inotify limits before/after setting
fb865a2 to
a264b00
Compare
|
Ok that worked on Ubuntu, pushed an updated patch aimed at fixing on COS ... |
|
COS works now: |
|
/test pull-kubernetes-e2e-gce (both passed and worked as intended but running them again because this is aimed at fixing flaky test(s) hitting the limits) |
|
Flaked for other reasons including timeout but not the target test case.
🤨 did we add something that creates a LOT of pods? |
|
/test pull-kubernetes-e2e-gce |
kubelet/containerd creates 3-4 inotify watch per pod container. So I guess we are expected to exceed max_user_instances limit. |
Yeah. We shouldn't after this change though? But I also didn't expect CI tests to be creating 110+ pods at once (ok a few are from the cluster system pods), on a worker node, that error suggests we may have started hitting the inotify limit more often due to creating more pods. I have not seen that failure mode in these jobs before. |
|
Again no failures that time, and still no failures of the sig-cli last line test on this PR |
known issue, #124784 (comment) there is a test that indeed creates a lot of pods and can cause issues, there is also an open issue about it #124369 |
|
/lgtm This addresses CI flakiness and there is no code change, only helpers script that set up the infra for testing |
|
LGTM label has been added. DetailsGit tree hash: 1365e9030dde742b9f61dd4777ce7c3fd497f500 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, BenTheElder The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test pull-kubernetes-e2e-gce |
|
Flaky ssh tests which reminds me ... we should probably sort out making those Feature:SSH or something, and not have a manual skip in kind etc. |
|
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
Hey Ben, Can I increase it to 40 seconds and submit PR or were you thinking in a diff direction? |
I hadn't looked further yet. We're generally avoiding depending on SSH in cluster e2e tests, though the node e2e is another story since it's primarily testing kubelet <> container runtime / OS. For cluster e2es we've migrated them to a hostexec pod with kubectl. This particular test case seems to be for testing the SSH functionality itself, 20 seconds sounds plenty long to start an SSH connection? If we go to go.k8s.io/triage and search for results for "SSH should SSH to all nodes and run commands" as the test we can find more failures: In this particular case, if we look at the logs more: That doesn't seem like an issue that would be resolved by increasing the timeout. Seems more like the SSH key is taking too long to propagate to the VM, or we're using randomized keys or some other issue with how the cluster is setup or how we setup the SSH client. |
|
I would start by:
Thanks! |
|
There's also the question ... should we run this in the default PR blocking tests to begin with? It's the only test case directly mentioning SSH, which isn't a property of Kubernetes (but is used by some other tests possibly). We might attempt to discover if any of the other tests that run in pull-kubernetes-e2e-gce use SSH, and if not, consider skipping this test. That's a fair bit of work though, I didn't dig that far yet personally. IF there are no other SSH-based tests, then we could reconsider self-skip vs tagging as a "Feature:SSH" #131038 (comment) But even then, we probably want this to work reliably if any tests anywhere use it ... or fix the tests to not use it. |
What type of PR is this?
/kind flake
What this PR does / why we need it:
Debugs and sets inotify limits in kube-up.sh CI clusters to avoid flakes reading pod logs.
Which issue(s) this PR fixes:
Fixes #130983 (flaking test)
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: