You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After ~15 days of normal activity, our Airflow EFS volumes started failing to mount with the following messages:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m52s default-scheduler Successfully assigned airflow-prod/weeklypipelineindependentspark-dd11d2f655894aac854f0d63faf2ad8a to ip-10-149-33-40.ec2.internal
Warning FailedMount 67s (x3 over 6m49s) kubelet MountVolume.SetUp failed for volume "airflow-logs" : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-logs/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0381d9a2be246f275,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-logs/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
Warning FailedMount 67s (x2 over 5m12s) kubelet MountVolume.SetUp failed for volume "airflow-dags" : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-dags/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0df96db406cc8681f,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/42b8a77a-b9f9-4429-b77b-a2f8e7d6f6f9/volumes/kubernetes.io~csi/airflow-dags/mount
Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
The kubelet logs show this:
> journalctl -u kubelet
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: E0527 12:30:01.667906 3799 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f podName: nodeName:}" failed. No retries permitted until 2024-05-27 12:30:02.167875293 +0000 UTC m=+1548768.136185610 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "airflow-dags" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f") pod "dailypipelinecreatedailycluste-620341ee9bf246dfb76f476b8819b2bf" (UID: "a76697cd-af91-4a2a-a594-39807642d8c3") : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount": mount failed: exit status 1
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting command: mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting arguments: -t efs -o accesspoint=fsap-0df96db406cc8681f,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: E0527 12:30:01.668265 3799 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0381d9a2be246f275 podName: nodeName:}" failed. No retries permitted until 2024-05-27 12:30:02.168242255 +0000 UTC m=+1548768.136552560 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "airflow-logs" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0381d9a2be246f275") pod "dailypipelinecreatedailycluste-620341ee9bf246dfb76f476b8819b2bf" (UID: "a76697cd-af91-4a2a-a594-39807642d8c3") : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-logs/mount": mount failed: exit status 1
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting command: mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting arguments: -t efs -o accesspoint=fsap-0381d9a2be246f275,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-logs/mount
May 27 12:30:01 ip-10-149-33-40.els.vpc.local kubelet[3799]: Output: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
Then after a short time this additional error:
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: E0527 12:30:35.887401 3799 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f podName: nodeName:}" failed. No retries permitted until 2024-05-27 12:31:07.887378145 +0000 UTC m=+1548833.855688455 (durationBeforeRetry 32s). Error: MountVolume.SetUp failed for volume "airflow-dags" (UniqueName: "kubernetes.io/csi/efs.csi.aws.com^fs-00e74d7abdcb5d159::fsap-0df96db406cc8681f") pod "dailypipelinecreatedailycluste-620341ee9bf246dfb76f476b8819b2bf" (UID: "a76697cd-af91-4a2a-a594-39807642d8c3") : rpc error: code = Internal desc = Could not mount "fs-00e74d7abdcb5d159:/" at "/var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount": mount failed: exit status 1
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting command: mount
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Mounting arguments: -t efs -o accesspoint=fsap-0df96db406cc8681f,tls fs-00e74d7abdcb5d159:/ /var/lib/kubelet/pods/a76697cd-af91-4a2a-a594-39807642d8c3/volumes/kubernetes.io~csi/airflow-dags/mount
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Output: --- Logging error ---
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Traceback (most recent call last):
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1115, in find_tls_port_in_range_and_get_bind_sock
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: sock.bind(("localhost", tls_port))
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: OSError: [Errno 98] Address already in use
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: During handling of the above exception, another exception occurred:
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Traceback (most recent call last):
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/lib/python3.9/logging/handlers.py", line 74, in emit
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: self.doRollover()
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/lib/python3.9/logging/handlers.py", line 173, in doRollover
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: os.rename(sfn, dfn)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: FileNotFoundError: [Errno 2] No such file or directory: '/var/log/amazon/efs/mount.log.4' -> '/var/log/amazon/efs/mount.log.5'
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Call stack:
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 3933, in <module>
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: main()
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 3911, in main
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: mount_tls(
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 3043, in mount_tls
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: with bootstrap_tls(
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/lib/python3.9/contextlib.py", line 119, in __enter__
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: return next(self.gen)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1656, in bootstrap_tls
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: tls_port_sock = choose_tls_port_and_get_bind_sock(config, options, state_file_dir)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1078, in choose_tls_port_and_get_bind_sock
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: tls_port_sock = find_tls_port_in_range_and_get_bind_sock(
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: File "/sbin/mount.efs", line 1118, in find_tls_port_in_range_and_get_bind_sock
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: logging.warning(e)
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Message: OSError(98, 'Address already in use')
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Arguments: ()
May 27 12:30:35 ip-10-149-33-40.els.vpc.local kubelet[3799]: Failed to locate an available port in the range [20049, 20449], try specifying a different port range in /etc/amazon/efs/efs-utils.conf
Looking at the log files in /var/log/amazon/efs, they are chock full of this kind of messages:
> cat mount-watchdog.log.2
2024-05-28 10:06:28 UTC - ERROR - Unable to parse json in /var/run/efs/fs-00e74d7abdcb5d159.var.lib.kubelet.pods.1ea49582-822f-41d4-965f-14f27b31badc.volumes.kubernetes.io~csi.airflow-dags.mount.20155
Traceback (most recent call last):
File "/usr/bin/amazon-efs-mount-watchdog", line 1155, in check_efs_mounts
state = json.load(f)
File "/usr/lib64/python3.7/json/__init__.py", line 296, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/usr/lib64/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python3.7/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 1117 (char 1116)
The mentioned JSON file (and many/most of its siblings) is invalid with extra characters "0}" at the end:
Hi @chroche, the latest EFS CSI Driver version 2.0.4, which resolves this issue, is now available on GitHub. The Add-on version will be released soon. Please upgrade and test it if you are not using the Add-on.
@mskanth972 we are encountering the same issue with a redundant '}' at the end of the EFS state file located in /var/run/efs/.... This occurs with addon versions v2.0.7-eksbuild.1 and v2.0.8-eksbuild.1
/kind bug
What happened?
After ~15 days of normal activity, our Airflow EFS volumes started failing to mount with the following messages:
The
kubelet
logs show this:Then after a short time this additional error:
Looking at the log files in
/var/log/amazon/efs
, they are chock full of this kind of messages:The mentioned JSON file (and many/most of its siblings) is invalid with extra characters "
0}
" at the end:What you expected to happen?
How to reproduce it (as minimally and precisely as possible)?
Anything else we need to know?:
The initial error seems to be the exact same as #103, with the available TCP ports being eaten up by unclosed
stunnel5
connections:It may be that the
amazon-efs-mount-watchdog
is prevented from doing its job by the failure to parse invalid JSON file?Environment
kubectl version
):Server Version: v1.28.8-eks-adc7111
v1.7.7
from Helm Chartv2.5.7
Please also attach debug logs to help us better diagnose
The text was updated successfully, but these errors were encountered: