Skip to content

Conversation

@BenTheElder
Copy link
Member

@BenTheElder BenTheElder commented Mar 21, 2025

What type of PR is this?

/kind flake

What this PR does / why we need it:

Debugs and sets inotify limits in kube-up.sh CI clusters to avoid flakes reading pod logs.

Which issue(s) this PR fixes:

Fixes #130983 (flaking test)

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 21, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 21, 2025
@k8s-ci-robot k8s-ci-robot added area/provider/gcp Issues or PRs related to gcp provider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 21, 2025
@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 21, 2025

https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce/1903173838099714048/artifacts/bootstrap-e2e-master/kube-master-configuration.log

Mar 21 20:14:36.951430 bootstrap-e2e-master configure-helper.sh[8786]: current max_user_watches / max_user_instances
Mar 21 20:14:36.952857 bootstrap-e2e-master configure-helper.sh[8929]: fs.inotify.max_user_watches = 61948
Mar 21 20:14:36.954325 bootstrap-e2e-master configure-helper.sh[8930]: fs.inotify.max_user_instances = 128
Mar 21 20:14:36.956688 bootstrap-e2e-master configure-helper.sh[8932]: /home/kubernetes/bin/configure-helper.sh: line 78: systemd-sysctl: command not found

That's a bit low on max_user_instances, but not all that low on max_user_watches.

/test

@k8s-ci-robot

This comment was marked as resolved.

@BenTheElder
Copy link
Member Author

/test pull-kubernetes-e2e-gce-cos

@BenTheElder
Copy link
Member Author

COS:

Mar 21 21:20:11.247706 bootstrap-e2e-master configure-helper.sh[1423]: current max_user_watches / max_user_instances
Mar 21 21:20:11.249073 bootstrap-e2e-master configure-helper.sh[1563]: fs.inotify.max_user_watches = 63491
Mar 21 21:20:11.250580 bootstrap-e2e-master configure-helper.sh[1564]: fs.inotify.max_user_instances = 1024

https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce-cos/1903188801165987840/artifacts/bootstrap-e2e-master/kube-master-configuration.log

@BenTheElder
Copy link
Member Author

/test pull-kubernetes-e2e-gce-cos

also debug inotify limits before/after setting
@BenTheElder BenTheElder force-pushed the inotify-watches-kube-up branch from fb865a2 to a264b00 Compare March 21, 2025 23:43
@BenTheElder
Copy link
Member Author

Ok that worked on Ubuntu, pushed an updated patch aimed at fixing on COS ...
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce/1903213950619619328
/test pull-kubernetes-e2e-gce-cos

@BenTheElder
Copy link
Member Author

COS works now:

Mar 22 00:09:26.938825 bootstrap-e2e-master configure-helper.sh[1430]: Default max_user_watches / max_user_instances:
Mar 22 00:09:26.940601 bootstrap-e2e-master configure-helper.sh[1570]: fs.inotify.max_user_watches = 63491
Mar 22 00:09:26.942320 bootstrap-e2e-master configure-helper.sh[1571]: fs.inotify.max_user_instances = 1024
Mar 22 00:09:26.955176 bootstrap-e2e-master configure-helper.sh[1430]: Updated max_user_watches / max_user_instances:
Mar 22 00:09:26.956981 bootstrap-e2e-master configure-helper.sh[1574]: fs.inotify.max_user_watches = 65536
Mar 22 00:09:26.959332 bootstrap-e2e-master configure-helper.sh[1575]: fs.inotify.max_user_instances = 8192

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce-cos/1903231268615622656

@BenTheElder BenTheElder changed the title WIP: kube-up.sh: set inotify limits kube-up.sh: set inotify limits Mar 22, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 22, 2025
@BenTheElder
Copy link
Member Author

/test pull-kubernetes-e2e-gce
/test pull-kubernetes-e2e-gce-cos

(both passed and worked as intended but running them again because this is aimed at fixing flaky test(s) hitting the limits)

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 22, 2025

Flaked for other reasons including timeout but not the target test case.

message: 'Pod was rejected: Node didn''t have enough resource: pods, requested: 1,
used: 110, capacity: 110'

🤨 did we add something that creates a LOT of pods?

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce/1903251761490038784

@BenTheElder
Copy link
Member Author

/test pull-kubernetes-e2e-gce
/test pull-kubernetes-e2e-gce-cos

@AnishShah
Copy link
Contributor

message: 'Pod was rejected: Node didn''t have enough resource: pods, requested: 1,

     used: 110, capacity: 110'

kubelet/containerd creates 3-4 inotify watch per pod container. So I guess we are expected to exceed max_user_instances limit.

@BenTheElder
Copy link
Member Author

kubelet/containerd creates 3-4 inotify watch per pod container. So I guess we are expected to exceed max_user_instances limit.

Yeah. We shouldn't after this change though?

But I also didn't expect CI tests to be creating 110+ pods at once (ok a few are from the cluster system pods), on a worker node, that error suggests we may have started hitting the inotify limit more often due to creating more pods. I have not seen that failure mode in these jobs before.

@BenTheElder
Copy link
Member Author

Again no failures that time, and still no failures of the sig-cli last line test on this PR
/test pull-kubernetes-e2e-gce
/test pull-kubernetes-e2e-gce-cos

@aojea
Copy link
Member

aojea commented Mar 22, 2025

🤨 did we add something that creates a LOT of pods?

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce/1903251761490038784

known issue, #124784 (comment)

there is a test that indeed creates a lot of pods and can cause issues,

there is also an open issue about it #124369

@aojea
Copy link
Member

aojea commented Mar 22, 2025

/lgtm
/approve
/milestone v1.33

This addresses CI flakiness and there is no code change, only helpers script that set up the infra for testing

@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Mar 22, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 22, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: 1365e9030dde742b9f61dd4777ce7c3fd497f500

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, BenTheElder

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aojea
Copy link
Member

aojea commented Mar 22, 2025

/test pull-kubernetes-e2e-gce

@BenTheElder
Copy link
Member Author

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce/1903410506631221248

Flaky ssh tests

which reminds me ... we should probably sort out making those Feature:SSH or something, and not have a manual skip in kind etc.
/retest

@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-ci-robot k8s-ci-robot merged commit 9a8c2a9 into kubernetes:master Mar 22, 2025
15 checks passed
@BenTheElder BenTheElder deleted the inotify-watches-kube-up branch March 24, 2025 15:45
@itayporezky
Copy link

itayporezky commented Mar 26, 2025

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130990/pull-kubernetes-e2e-gce/1903410506631221248

Flaky ssh tests

which reminds me ... we should probably sort out making those Feature:SSH or something, and not have a manual skip in kind etc. /retest

Hey Ben,
I'm looking to contribute :)
After checking the test log and source code, seems that there is SSH timeout at 20 seconds, this is defined in here

Can I increase it to 40 seconds and submit PR or were you thinking in a diff direction?
I couldn't find a source of the flake, it seems that the pod may not be ready within 20 seconds and timeout will be reached.

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 26, 2025

Can I increase it to 40 seconds and submit PR or were you thinking in a diff direction?
I couldn't find a source of the flake, it seems that the pod may not be ready within 20 seconds and timeout will be reached.

I hadn't looked further yet.

We're generally avoiding depending on SSH in cluster e2e tests, though the node e2e is another story since it's primarily testing kubelet <> container runtime / OS.

For cluster e2es we've migrated them to a hostexec pod with kubectl.

This particular test case seems to be for testing the SSH functionality itself, 20 seconds sounds plenty long to start an SSH connection?

If we go to go.k8s.io/triage and search for results for "SSH should SSH to all nodes and run commands" as the test we can find more failures:
https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=SSH%20should%20SSH%20to%20all%20nodes%20and%20run%20commands

In this particular case, if we look at the logs more:

  error dialing prow@34.67.245.27:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain', retrying
  error dialing prow@34.67.245.27:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain', retrying
  error dialing prow@34.67.245.27:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain', retrying
  error dialing prow@34.67.245.27:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain', retrying

That doesn't seem like an issue that would be resolved by increasing the timeout.

Seems more like the SSH key is taking too long to propagate to the VM, or we're using randomized keys or some other issue with how the cluster is setup or how we setup the SSH client.

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 26, 2025

I would start by:

Thanks!

@BenTheElder
Copy link
Member Author

There's also the question ... should we run this in the default PR blocking tests to begin with? It's the only test case directly mentioning SSH, which isn't a property of Kubernetes (but is used by some other tests possibly).

We might attempt to discover if any of the other tests that run in pull-kubernetes-e2e-gce use SSH, and if not, consider skipping this test. That's a fair bit of work though, I didn't dig that far yet personally.

IF there are no other SSH-based tests, then we could reconsider self-skip vs tagging as a "Feature:SSH" #131038 (comment)

But even then, we probably want this to work reliably if any tests anywhere use it ... or fix the tests to not use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/gcp Issues or PRs related to gcp provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

flake: [It] [sig-cli] Kubectl client Simple pod should contain last line of the log

6 participants