Skip to content

Don't exit the probe on connection issues#240

Merged
k8s-ci-robot merged 1 commit intokubernetes-csi:masterfrom
jsafrane:dont-crash
Jan 5, 2024
Merged

Don't exit the probe on connection issues#240
k8s-ci-robot merged 1 commit intokubernetes-csi:masterfrom
jsafrane:dont-crash

Conversation

@jsafrane
Copy link
Copy Markdown
Contributor

@jsafrane jsafrane commented Jan 4, 2024

What type of PR is this?
/kind bug

What this PR does / why we need it:
Do not exit the liveness probe process when it cannot connect to the CSI driver. The driver could be crashlooping, and we should not crashloop the liveness probe process too.

The process should only fail all probes to /healthz endpoint. Since the HTTP server is not running when connecting to the driver for the first time, "connection refused" must be a good enough failure.

This PR is heavily inspired / copied from #237

Which issue(s) this PR fixes:

Fixes #236

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Liveness probe process does not crash when it cannot access the associated CSI driver. It only fails all kubelet probes, most probably with "connection refused".

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 4, 2024
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 4, 2024
@jsafrane
Copy link
Copy Markdown
Contributor Author

jsafrane commented Jan 4, 2024

Tested with a CSI driver that's crashlooping. The liveness probe sidecar is still running and just restarts the driver container. "connection refused" is not great, but it fails the probe just fine:

  Warning  ProbeError  89s (x5 over 2m9s)    kubelet            Liveness probe error: Get "http://10.0.2.31:10300/healthz": dial tcp 10.0.2.31:10300: connect: connection refused
body:
  Warning  Unhealthy  89s (x5 over 2m9s)   kubelet  Liveness probe failed: Get "http://10.0.2.31:10300/healthz": dial tcp 10.0.2.31:10300: connect: connection refused
  Normal   Killing    89s                  kubelet  Container csi-driver failed liveness probe, will be restarted

Copy link
Copy Markdown

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to simply remove acquireConnection altogether as I did in #237. It seems the main purpose of using it as a wrapper is to add timeout functionality (which connlib now gives us for free). I'm not sure if there's a practical reason someone would set probeTimeout > 30, but if they did, connlib.Connect would timeout before probeTimeout, causing confusion.

That being said, this looks like it fixes the issue identified in #236, so no major complaints if you go this route.

Do not exit the liveness probe process when the liveness probe cannot
connect to the CSI driver. The driver could be crashlooping, and we should
not crashloop the liveness probe process too. The process should only fail
all probes to /healthz endpoint. Since the HTTP server is not running when
connecting to the driver for the first time, "connection refused" must
be a good enough failure.
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 5, 2024
@jsafrane
Copy link
Copy Markdown
Contributor Author

jsafrane commented Jan 5, 2024

I basically copied #237 and updated it with 0 timeout, it looks much better now.

Copy link
Copy Markdown

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@xing-yang
Copy link
Copy Markdown
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 5, 2024
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ejweber, jsafrane, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 4ed8c89 into kubernetes-csi:master Jan 5, 2024
datadog-compute-robot pushed a commit to DataDog/kubernetes-csi-livenessprobe that referenced this pull request Jan 11, 2024
f8c8cc4 Merge pull request kubernetes-csi#237 from msau42/prow
b36b5bf Merge pull request kubernetes-csi#240 from dannawang0221/upgrade-go-version
adfddcc Merge pull request kubernetes-csi#243 from pohly/git-subtree-pull-fix
c465088 pull-test.sh: avoid "git subtree pull" error
7b175a1 Update csi-test version to v5.2.0
987c90c Update go version to 1.21 to match k/k
2c625d4 Add script to generate patch release notes

git-subtree-dir: release-tools
git-subtree-split: f8c8cc4
rhrmo pushed a commit to rhrmo/livenessprobe that referenced this pull request Jan 13, 2026
…ersion

Update go version to 1.21 to match k/k
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New csi-lib-utils/connection.Connect logic can cause permanent CSI plugin outage

4 participants