kube-up.sh: set inotify limits #130990

BenTheElder · 2025-03-21T19:56:14Z

What type of PR is this?

/kind flake

What this PR does / why we need it:

Debugs and sets inotify limits in kube-up.sh CI clusters to avoid flakes reading pod logs.

Which issue(s) this PR fixes:

Fixes #130983 (flaking test)

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-03-21T19:56:23Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

BenTheElder · 2025-03-21T20:55:01Z

https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce/1903173838099714048/artifacts/bootstrap-e2e-master/kube-master-configuration.log

Mar 21 20:14:36.951430 bootstrap-e2e-master configure-helper.sh[8786]: current max_user_watches / max_user_instances
Mar 21 20:14:36.952857 bootstrap-e2e-master configure-helper.sh[8929]: fs.inotify.max_user_watches = 61948
Mar 21 20:14:36.954325 bootstrap-e2e-master configure-helper.sh[8930]: fs.inotify.max_user_instances = 128
Mar 21 20:14:36.956688 bootstrap-e2e-master configure-helper.sh[8932]: /home/kubernetes/bin/configure-helper.sh: line 78: systemd-sysctl: command not found

That's a bit low on max_user_instances, but not all that low on max_user_watches.

/test

BenTheElder · 2025-03-21T20:55:42Z

/test pull-kubernetes-e2e-gce-cos

BenTheElder · 2025-03-21T22:34:02Z

COS:

Mar 21 21:20:11.247706 bootstrap-e2e-master configure-helper.sh[1423]: current max_user_watches / max_user_instances
Mar 21 21:20:11.249073 bootstrap-e2e-master configure-helper.sh[1563]: fs.inotify.max_user_watches = 63491
Mar 21 21:20:11.250580 bootstrap-e2e-master configure-helper.sh[1564]: fs.inotify.max_user_instances = 1024

https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce-cos/1903188801165987840/artifacts/bootstrap-e2e-master/kube-master-configuration.log

BenTheElder · 2025-03-21T22:36:02Z

/test pull-kubernetes-e2e-gce-cos

also debug inotify limits before/after setting

BenTheElder · 2025-03-21T23:44:27Z

Ok that worked on Ubuntu, pushed an updated patch aimed at fixing on COS ...
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce/1903213950619619328
/test pull-kubernetes-e2e-gce-cos

BenTheElder · 2025-03-22T01:02:49Z

COS works now:

Mar 22 00:09:26.938825 bootstrap-e2e-master configure-helper.sh[1430]: Default max_user_watches / max_user_instances:
Mar 22 00:09:26.940601 bootstrap-e2e-master configure-helper.sh[1570]: fs.inotify.max_user_watches = 63491
Mar 22 00:09:26.942320 bootstrap-e2e-master configure-helper.sh[1571]: fs.inotify.max_user_instances = 1024
Mar 22 00:09:26.955176 bootstrap-e2e-master configure-helper.sh[1430]: Updated max_user_watches / max_user_instances:
Mar 22 00:09:26.956981 bootstrap-e2e-master configure-helper.sh[1574]: fs.inotify.max_user_watches = 65536
Mar 22 00:09:26.959332 bootstrap-e2e-master configure-helper.sh[1575]: fs.inotify.max_user_instances = 8192

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce-cos/1903231268615622656

BenTheElder · 2025-03-22T01:05:52Z

/test pull-kubernetes-e2e-gce
/test pull-kubernetes-e2e-gce-cos

(both passed and worked as intended but running them again because this is aimed at fixing flaky test(s) hitting the limits)

BenTheElder · 2025-03-22T02:35:10Z

Flaked for other reasons including timeout but not the target test case.

message: 'Pod was rejected: Node didn''t have enough resource: pods, requested: 1,
used: 110, capacity: 110'

🤨 did we add something that creates a LOT of pods?

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce/1903251761490038784

BenTheElder · 2025-03-22T02:35:35Z

/test pull-kubernetes-e2e-gce
/test pull-kubernetes-e2e-gce-cos

AnishShah · 2025-03-22T07:23:38Z

message: 'Pod was rejected: Node didn''t have enough resource: pods, requested: 1,
     used: 110, capacity: 110'

kubelet/containerd creates 3-4 inotify watch per pod container. So I guess we are expected to exceed max_user_instances limit.

BenTheElder · 2025-03-22T07:38:24Z

kubelet/containerd creates 3-4 inotify watch per pod container. So I guess we are expected to exceed max_user_instances limit.

Yeah. We shouldn't after this change though?

But I also didn't expect CI tests to be creating 110+ pods at once (ok a few are from the cluster system pods), on a worker node, that error suggests we may have started hitting the inotify limit more often due to creating more pods. I have not seen that failure mode in these jobs before.

BenTheElder · 2025-03-22T07:39:59Z

Again no failures that time, and still no failures of the sig-cli last line test on this PR
/test pull-kubernetes-e2e-gce
/test pull-kubernetes-e2e-gce-cos

aojea · 2025-03-22T10:56:01Z

🤨 did we add something that creates a LOT of pods?

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce/1903251761490038784

known issue, #124784 (comment)

there is a test that indeed creates a lot of pods and can cause issues,

there is also an open issue about it #124369

aojea · 2025-03-22T11:35:18Z

/lgtm
/approve
/milestone v1.33

This addresses CI flakiness and there is no code change, only helpers script that set up the infra for testing

k8s-ci-robot · 2025-03-22T11:35:26Z

LGTM label has been added.

Details

Git tree hash: 1365e9030dde742b9f61dd4777ce7c3fd497f500

k8s-ci-robot · 2025-03-22T11:35:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, BenTheElder

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~cluster/gce/gci/OWNERS~~ [BenTheElder,aojea]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2025-03-22T11:36:40Z

/test pull-kubernetes-e2e-gce

BenTheElder · 2025-03-22T14:52:47Z

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce/1903410506631221248

Flaky ssh tests

which reminds me ... we should probably sort out making those Feature:SSH or something, and not have a manual skip in kind etc.
/retest

k8s-triage-robot · 2025-03-22T18:02:08Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

itayporezky · 2025-03-26T19:29:51Z

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce/1903410506631221248

Flaky ssh tests

which reminds me ... we should probably sort out making those Feature:SSH or something, and not have a manual skip in kind etc. /retest

Hey Ben,
I'm looking to contribute :)
After checking the test log and source code, seems that there is SSH timeout at 20 seconds, this is defined in here

Can I increase it to 40 seconds and submit PR or were you thinking in a diff direction?
I couldn't find a source of the flake, it seems that the pod may not be ready within 20 seconds and timeout will be reached.

BenTheElder · 2025-03-26T21:20:14Z

Can I increase it to 40 seconds and submit PR or were you thinking in a diff direction?
I couldn't find a source of the flake, it seems that the pod may not be ready within 20 seconds and timeout will be reached.

I hadn't looked further yet.

We're generally avoiding depending on SSH in cluster e2e tests, though the node e2e is another story since it's primarily testing kubelet <> container runtime / OS.

For cluster e2es we've migrated them to a hostexec pod with kubectl.

This particular test case seems to be for testing the SSH functionality itself, 20 seconds sounds plenty long to start an SSH connection?

If we go to go.k8s.io/triage and search for results for "SSH should SSH to all nodes and run commands" as the test we can find more failures:
https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=SSH%20should%20SSH%20to%20all%20nodes%20and%20run%20commands

In this particular case, if we look at the logs more:

  error dialing prow@34.67.245.27:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain', retrying
  error dialing prow@34.67.245.27:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain', retrying
  error dialing prow@34.67.245.27:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain', retrying
  error dialing prow@34.67.245.27:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain', retrying

That doesn't seem like an issue that would be resolved by increasing the timeout.

Seems more like the SSH key is taking too long to propagate to the VM, or we're using randomized keys or some other issue with how the cluster is setup or how we setup the SSH client.

BenTheElder · 2025-03-26T21:21:09Z

I would start by:

file a tracking issue (like flake: [It] [sig-cli] Kubectl client Simple pod should contain last line of the log #130983)
investigate failures across CI runs to identify the recurring failure mode (go.k8s.io/triage can help, there are some docs about debugging flaky tests in the github.com/kubernetes/community repo as well)
root-cause ...
propose fix

Thanks!

BenTheElder · 2025-03-26T21:25:59Z

There's also the question ... should we run this in the default PR blocking tests to begin with? It's the only test case directly mentioning SSH, which isn't a property of Kubernetes (but is used by some other tests possibly).

We might attempt to discover if any of the other tests that run in pull-kubernetes-e2e-gce use SSH, and if not, consider skipping this test. That's a fair bit of work though, I didn't dig that far yet personally.

IF there are no other SSH-based tests, then we could reconsider self-skip vs tagging as a "Feature:SSH" #131038 (comment)

But even then, we probably want this to work reliably if any tests anywhere use it ... or fix the tests to not use it.

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 21, 2025

k8s-ci-robot requested review from SergeyKanzhelev and roycaihw March 21, 2025 19:56

This comment was marked as resolved.

Sign in to view

BenTheElder mentioned this pull request Mar 21, 2025

flake: [It] [sig-cli] Kubectl client Simple pod should contain last line of the log #130983

Closed

kube-up.sh: set inotify limits

a264b00

also debug inotify limits before/after setting

BenTheElder force-pushed the inotify-watches-kube-up branch from fb865a2 to a264b00 Compare March 21, 2025 23:43

BenTheElder changed the title ~~WIP: kube-up.sh: set inotify limits~~ kube-up.sh: set inotify limits Mar 22, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 22, 2025

BenTheElder mentioned this pull request Mar 22, 2025

[Flaking-Test] [sig-cli] Kubectl client Simple pod should contain last line of the log #130262

Closed

BenTheElder mentioned this pull request Mar 22, 2025

docker-rootful: Increase inotify limits by default lima-vm/lima#1179

Open

k8s-ci-robot added this to the v1.33 milestone Mar 22, 2025

k8s-ci-robot assigned aojea Mar 22, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 22, 2025

k8s-infra-ci-robot added this to [sig-release] Bug Triage Mar 22, 2025

k8s-ci-robot merged commit 9a8c2a9 into kubernetes:master Mar 22, 2025
15 checks passed

github-project-automation bot moved this to Done in [sig-release] Bug Triage Mar 22, 2025

BenTheElder deleted the inotify-watches-kube-up branch March 24, 2025 15:45

This was referenced Mar 24, 2025

[Flaky] Garbage collector should not delete dependents that have both valid owner and owner that's waiting for dependents to be deleted #124369

Closed

cleanup CI hardcoded ssh skips #131038

Open

kube-up.sh: set inotify limits #130990

kube-up.sh: set inotify limits #130990

Uh oh!

Conversation

BenTheElder commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Mar 21, 2025

Uh oh!

BenTheElder commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

BenTheElder commented Mar 21, 2025

Uh oh!

BenTheElder commented Mar 21, 2025

Uh oh!

BenTheElder commented Mar 21, 2025

Uh oh!

BenTheElder commented Mar 21, 2025

Uh oh!

BenTheElder commented Mar 22, 2025

Uh oh!

BenTheElder commented Mar 22, 2025

Uh oh!

BenTheElder commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenTheElder commented Mar 22, 2025

Uh oh!

AnishShah commented Mar 22, 2025

Uh oh!

BenTheElder commented Mar 22, 2025

Uh oh!

BenTheElder commented Mar 22, 2025

Uh oh!

aojea commented Mar 22, 2025

Uh oh!

aojea commented Mar 22, 2025

Uh oh!

k8s-ci-robot commented Mar 22, 2025

Uh oh!

k8s-ci-robot commented Mar 22, 2025

Uh oh!

aojea commented Mar 22, 2025

Uh oh!

BenTheElder commented Mar 22, 2025

Uh oh!

k8s-triage-robot commented Mar 22, 2025

Uh oh!

Uh oh!

itayporezky commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenTheElder commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenTheElder commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenTheElder commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

BenTheElder commented Mar 21, 2025 •

edited

Loading

BenTheElder commented Mar 21, 2025 •

edited

Loading

BenTheElder commented Mar 22, 2025 •

edited

Loading

itayporezky commented Mar 26, 2025 •

edited

Loading

BenTheElder commented Mar 26, 2025 •

edited

Loading

BenTheElder commented Mar 26, 2025 •

edited

Loading