Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Csi-attacher Looses Connection to Driver Unix Socket #1875

Closed
dmitrii-didenko opened this issue Dec 15, 2023 · 16 comments
Closed

Csi-attacher Looses Connection to Driver Unix Socket #1875

dmitrii-didenko opened this issue Dec 15, 2023 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@dmitrii-didenko
Copy link

/kind bug

What happened?
We are observing the issue where csi-attacher sidecar container exits frequently with the following message:

1 connection.go:142] Lost connection to unix:///var/lib/csi/sockets/pluginproxy/csi.sock.
1 connection.go:97] Lost connection to CSI driver, exiting

According to this change kubernetes-csi/external-attacher#123 I assume that ebs-plugin container indeed closes socket somehow which causes csi-attacher to be restarted. However, we see no suspicious logs from ebs-plugin at the event time:

csi_handler.go:282] Detaching "csi-115cdec4b0fe42eb07f43b0af068d2b71c0e203992aa182fed625bea1ec0742e"
cloud.go:862] "Waiting for volume state" volumeID="vol-0290170d55c6d53b8" actual="detaching" desired="detached" |  
1 cloud.go:862] "Waiting for volume state" volumeID="vol-0290170d55c6d53b8" actual="detaching" desired="detached" |  
1 controller.go:465] "ControllerUnpublishVolume: detaching" volumeID="vol-0290170d55c6d53b8" nodeID="i-041ccc637fb15b3ec"

What you expected to happen?
csi-attacher should not exit and keep connected via socket

How to reproduce it (as minimally and precisely as possible)?
Unfortunately, we do not have exact steps to reproduce this.

Anything else we need to know?:
We noticed one thing - this happens only on clusters with frequent resizing events (by cluster autoscaler or spot instances changes).
Screenshot 2023-12-15 at 15 07 00

Do we have any options we can enable to collect more information on this?

Environment

  • Kubernetes version v1.28.4-eks-8cb36c9
  • Driver version: v1.25.0
  • Bottlerocket nodes: v1.16.1
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 15, 2023
@AndrewSirenko
Copy link
Contributor

AndrewSirenko commented Dec 15, 2023

Update: jsafrane@ had already filed an issue for this on the external-provisioner: The provisioner exits after 30 minutes of idle. · Issue #1099 · kubernetes-csi/external-provisioner

(and even merged a fix into csi-lib-utils ❤️)

Once the new sidecars that include the fix are released, we can include them in a release to aws-ebs-csi-driver.


TLDR the latest versions of csi-attacher (v4.4.2), csi-provisioner (v3.6.2), csi-resizer (v1.9.2) will restart due to a lost connection to driver unix socket when sidecar not in use for ~40 minutes.

If you are deploying driver via helm and replace the version tag for each sidecar with an older version, you should not see this issue.

You can follow along on Kubernetes Slack in the csi channel in this thread: https://kubernetes.slack.com/archives/C8EJ01Z46/p1702657617011189


Hi @dmitrii-didenko, I was able to isolate this bug the v4.4.2 of the csi-attacher (as well as the latest csi-provisioner and csi-resizer versions). I will cleanup and post my logs by tomorrow.

I have tried:

  • v1.25 of driver with latest sidecar images -> Restart bug
  • v1.25 of driver with previous patch version of sidecar images (v4.4.1 of csi-attacher) -> No restart bug
  • v1.23 of driver with latest sidecar images -> Restart bug
  • v1.23 of driver with older version of sidecar images bundled with it (v4.4.0 of csi-attacher) -> No restart bug

Reproduction steps:

  1. Deploy any version of the aws-ebs-csi-driver with the latest version of the csi-attacher, csi-provisioner, or csi-resizer sidecar images.
  2. Wait ~45 minutes
  3. Dynamically provision or resize a volume
  4. Note that the csi-attacher, csi-provisioner, csi-resizer container will restart due to lost connection to driver socket

@AndrewSirenko
Copy link
Contributor

The latest patch version of csi-attacher, csi-provisioner, and csi-resizer only had dependency upgrades. This suggests that one of those upgrades might have caused this regression.

image

We are almost finished releasing v1.26.0 of aws-ebs-csi-driver for helm, which includes dependency upgrades. I will look for this restart bug there to see if it still occurring before filing an issue on the external-attacher project.

Thank you for raising this issue @dmitrii-didenko!

@AndrewSirenko
Copy link
Contributor

AndrewSirenko commented Dec 15, 2023

Started adding logs here AndrewSirenko/csi-sidecar-container-restart-issue-logs

You will find that if we use csi-attacher <= 4.4.1, that the csi-attacher will no longer restart, but that the csi-provisioner sidecar will still restart.

Let me know if there is a more helpful way to present the information. I am trying to reproduce on driver v1.26.0 now.

Edit: Can confirm this still occurs on driver v1.26.0. See Proof of restart issue for driver v1.26 and latest sidecars

@dmitrii-didenko
Copy link
Author

Thank you very mush for update! So if I got you right the only workaround is to configure the following?

  • v1.23 of driver with older version of sidecar images bundled with it (v4.4.0 of csi-attacher)

@AndrewSirenko
Copy link
Contributor

@dmitrii-didenko the workarounds are to either:

  • Use aws-ebs-csi-driver v1.24.0 and lower (Which don't include the latest versions of the sidecar images)
  • Use newer versions of the aws-ebs-csi-driver, but explicitly replace the version tag for each sidecar in values.yaml to versions other than csi-attacher (v4.4.2), csi-provisioner (v3.6.2), csi-resizer (v1.9.2).

@AndrewSirenko
Copy link
Contributor

Update, jsafrane@ had already filed an issue for this on the external-provisioner: The provisioner exits after 30 minutes of idle. · Issue #1099 · kubernetes-csi/external-provisioner

(and even merged a fix into csi-lib-utils).

Once the new sidecars that include the fix are released, we can include them in a release to aws-ebs-csi-driver.

dstewen added a commit to dstewen/flux-cluster that referenced this issue Dec 27, 2023
@AndrewSirenko
Copy link
Contributor

Sig-storage has released new versions of the sidecars. We will include them in our next patch release.

@faganihajizada
Copy link

Sig-storage has released new versions of the sidecars. We will include them in our next patch release.

@AndrewSirenko Thanks! Do you have an ETA for the next patch release?

@AndrewSirenko
Copy link
Contributor

AndrewSirenko commented Jan 3, 2024

Update for Jan 4 evening, EKS-D images are still not out due to an internal blocker. I will submit the release PRs as soon as they are available (merging the release PRs takes some CI time).


Update for Jan 4 morning @faganihajizada, EKS-D images are still not out yet as of last night, but hopefully they'll release today and we can push our helm release.


Hi @faganihajizada, we're waiting on EKS Distro to release the patched versions of the sidecars during their bi-weekly release. Their ETA is today. We will start the release process as soon as those sidecar images are released.

ETA for the helm patch release is most likely tomorrow (jan 4). We will also start the EKS add-on release process today, but our team is not in control about when that will be released (typically a few business days after the helm release).

@AndrewSirenko
Copy link
Contributor

The new EKS-D csi sidecar images were released 24 minutes ago. We are starting our Helm release now.

@AndrewSirenko
Copy link
Contributor

Helm release is out on our release-1.26 branch.

Add-on release will likely* come out sometime on the later half of next week. @faganihajizada

@faganihajizada
Copy link

faganihajizada commented Jan 8, 2024

Helm release is out on our release-1.26 branch.

Add-on release will likely* come out sometime on the later half of next week. @faganihajizada

Thank you @AndrewSirenko 🤝

An engineer in the team tried it, and we got:

chart "aws-ebs-csi-driver" version "2.26.1" not found in https://kubernetes-sigs.github.io/aws-ebs-csi-driver

Do you have ETA for update https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/Chart.yaml in master?

@AndrewSirenko
Copy link
Contributor

Done @faganihajizada , we were waiting on the team to confirm there wouldn't be merge conflicts

/close

@k8s-ci-robot
Copy link
Contributor

@AndrewSirenko: Closing this issue.

In response to this:

Done @faganihajizada , we were waiting on the team to confirm there wouldn't be merge conflicts

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@AndrewSirenko
Copy link
Contributor

Thank you everybody!

@AndrewSirenko
Copy link
Contributor

One last update: the v1.26.1 add-on is released in all regions. Thank you!

> aws eks describe-addon-versions --addon-name aws-ebs-csi-driver --region us-east-1
{
    "addons": [
        {
            "addonName": "aws-ebs-csi-driver",
            "type": "storage",
            "addonVersions": [
                {
                    "addonVersion": "v1.26.1-eksbuild.1",
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants