Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume mount fail after some time (again) #1353

Closed
chroche opened this issue May 28, 2024 · 2 comments
Closed

Volume mount fail after some time (again) #1353

chroche opened this issue May 28, 2024 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@chroche
Copy link

chroche commented May 28, 2024

/kind bug

What happened?

After ~15 days of normal activity, our Airflow EFS volumes started failing to mount with the following messages:

  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    6m52s                default-scheduler  Successfully assigned airflow-prod/weeklypipelineindependentspark-dd11d2f655894aac854f0d63faf2ad8a to ip-10-149-33-40.ec2.internal
  Warning  FailedMount  67s (x3 over 6m49s)  kubelet            MountVolume.SetUp failed for volume "airflow-logs" : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-logs/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0381d9a2be246f275,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-logs/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
  Warning  FailedMount  67s (x2 over 5m12s)  kubelet  MountVolume.SetUp failed for volume "airflow-dags" : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-dags/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0df96db406cc8681f,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-dags/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf

The kubelet logs show this:

> journalctl -u kubelet
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: E0527 12:30:01.667906    3799 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f podName: nodeName:}" failed. No retries permitted until 2024-05-27 12:30:02.167875293 +0000 UTC m=+1548768.136185610 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "airflow-dags" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f") pod "dailypipelinecreatedailycluste-620341ee9bf246dfb76f476b8819b2bf" (UID: "a76697cd-af91-4a2a-a594-39807642d8c3") : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount": mount failed: exit status 1
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting command: mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting arguments: -t efs -o accesspoint=fsap-0df96db406cc8681f,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: E0527 12:30:01.668265    3799 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0381d9a2be246f275 podName: nodeName:}" failed. No retries permitted until 2024-05-27 12:30:02.168242255 +0000 UTC m=+1548768.136552560 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "airflow-logs" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0381d9a2be246f275") pod "dailypipelinecreatedailycluste-620341ee9bf246dfb76f476b8819b2bf" (UID: "a76697cd-af91-4a2a-a594-39807642d8c3") : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-logs/mount": mount failed: exit status 1
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting command: mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting arguments: -t efs -o accesspoint=fsap-0381d9a2be246f275,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-logs/mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf

Then after a short time this additional error:

May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: E0527 12:30:35.887401    3799 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f podName: nodeName:}" failed. No retries permitted until 2024-05-27 12:31:07.887378145 +0000 UTC m=+1548833.855688455 (durationBeforeRetry 32s). Error: MountVolume.SetUp failed for volume "airflow-dags" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f") pod "dailypipelinecreatedailycluste-620341ee9bf246dfb76f476b8819b2bf" (UID: "a76697cd-af91-4a2a-a594-39807642d8c3") : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount": mount failed: exit status 1
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting command: mount
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting arguments: -t efs -o accesspoint=fsap-0df96db406cc8681f,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Output: --- Logging error ---
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Traceback (most recent call last):
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1115, in find_tls_port_in_range_and_get_bind_sock
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: sock.bind(("localhost", tls_port))
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: OSError: [Errno 98] Address already in use
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: During handling of the above exception, another exception occurred:
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Traceback (most recent call last):
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/lib/python3.9/logging/handlers.py", line 74, in emit
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: self.doRollover()
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/lib/python3.9/logging/handlers.py", line 173, in doRollover
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: os.rename(sfn, dfn)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: FileNotFoundError: [Errno 2] No such file or directory: '/var/log/amazon/efs/mount.log.4' -> '/var/log/amazon/efs/mount.log.5'
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Call stack:
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 3933, in <module>
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: main()
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 3911, in main
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: mount_tls(
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 3043, in mount_tls
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: with bootstrap_tls(
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/lib/python3.9/contextlib.py", line 119, in __enter__
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: return next(self.gen)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1656, in bootstrap_tls
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: tls_port_sock = choose_tls_port_and_get_bind_sock(config, options, state_file_dir)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1078, in choose_tls_port_and_get_bind_sock
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: tls_port_sock = find_tls_port_in_range_and_get_bind_sock(
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1118, in find_tls_port_in_range_and_get_bind_sock
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: logging.warning(e)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Message: OSError(98, 'Address already in use')
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Arguments: ()
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf

Looking at the log files in /var/log/amazon/efs, they are chock full of this kind of messages:

> cat mount-watchdog.log.2
2024-05-28 10:06:28 UTC - ERROR - Unable to parse json in /var/run/efs/fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155
Traceback (most recent call last):
  File "/usr/bin/amazon-efs-mount-watchdog", line 1155, in check_efs_mounts
    state = json.load(f)
  File "/usr/lib64/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib64/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.7/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 1117 (char 1116)

The mentioned JSON file (and many/most of its siblings) is invalid with extra characters "0}" at the end:

> cat /var/run/efs/fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155
{"pid": 17200, "cmd": ["/usr/bin/stunnel5", "/var/run/efs/stunnel-config.fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155"], "files": ["/var/run/efs/stunnel-config.fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155"], "mount_time": 1715937339.0673409, "mountpoint": "/var/lib/kubelet/pods/1ea49582-822f-41d4-965f-14f27b31badc/volumes/kubernetes.io~csi/airflow-dags/mount", "accessPoint": "fsap-0df96db406cc8681f", "mountStateDir": "fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155+", "commonName": "ip-10-149-33-40.els.vpc.local", "region": "us-east-1", "certificateCreationTime": "240517091539Z", "certificate": "/var/run/efs/fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155+/certificate.pem", "privateKey": "/etc/amazon/efs/privateKey.pem", "fsId": "fs-00e74d7abdcb5d159", "unmount_count": 0}0}

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

The initial error seems to be the exact same as #103, with the available TCP ports being eaten up by unclosed stunnel5 connections:

> netstat -tnlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:20215         0.0.0.0:*               LISTEN      1128623/stunnel5
tcp        0      0 127.0.0.1:20087         0.0.0.0:*               LISTEN      927740/stunnel5
tcp        0      0 127.0.0.1:20055         0.0.0.0:*               LISTEN      926241/stunnel5
tcp        0      0 127.0.0.1:20279         0.0.0.0:*               LISTEN      2789644/stunnel5
tcp        0      0 127.0.0.1:20119         0.0.0.0:*               LISTEN      866695/stunnel5
tcp        0      0 127.0.0.1:20183         0.0.0.0:*               LISTEN      797232/stunnel5
tcp        0      0 127.0.0.1:20311         0.0.0.0:*               LISTEN      616031/stunnel5

It may be that the amazon-efs-mount-watchdog is prevented from doing its job by the failure to parse invalid JSON file?

Environment

  • Kubernetes version (use kubectl version): Server Version: v1.28.8-eks-adc7111
  • Driver version: v1.7.7 from Helm Chart v2.5.7

Please also attach debug logs to help us better diagnose

  • Instructions to gather debug logs can be found here
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 28, 2024
@mskanth972
Copy link
Contributor

mskanth972 commented May 31, 2024

Hi @chroche, the latest EFS CSI Driver version 2.0.4, which resolves this issue, is now available on GitHub. The Add-on version will be released soon. Please upgrade and test it if you are not using the Add-on.

@alexandrvb
Copy link

@mskanth972 we are encountering the same issue with a redundant '}' at the end of the EFS state file located in /var/run/efs/.... This occurs with addon versions v2.0.7-eksbuild.1 and v2.0.8-eksbuild.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants