Transport endpoint disconnection errors during borg backup on GlusterFS Snapshot following upgrade to 11.1 #4377

ironhalik · 2024-06-11T13:25:24Z

Description of problem:
Scenario:
~3TB and 10 million files gluster volume hosted on three nodes with SSDs, gbit ethernet, 4 cores, 16gb ram
Thinly provisioned LVM.
I do a snapshot of the volume:

$ gluster snapshot create shapi shapi no-timestamp
$ gluster snapshot activate shapi

mount the snapshot on one of the nodes:

$ mount -t glusterfs "localhost:/snaps/shapi/shapi" "/mnt/snapshot/shapi"

and finally run a borg backup on the snapshot:

$ borg create --info --stats --compression lz4 --upload-ratelimit 10000 "::${datetime}" /mnt/snapshot/

The full output of the command that failed:

2024-06-11 12:32:17 - Sending backup via borg
Creating archive at "ssh://<host>/data/backups/cfn-storage01::240611-1232"
/mnt/snapshot/shapi/media/preflight_thumb/2022/10/16/b37e1018-64b1-4011-be31-9c162c519244/234434-202943-9-54-59-thumb.jpg: stat: [Errno 107] Transport endpoint is not connected: '234434-202943-9-54-59-thumb.jpg'
/mnt/snapshot/shapi/media/preflight_thumb/2022/10/16/cef5bbc0-915e-4fd6-8f5e-550ae7a3be35/352919-202940-9-45-57-thumb.jpg: stat: [Errno 107] Transport endpoint is not connected: '352919-202940-9-45-57-thumb.jpg'
/mnt/snapshot/shapi/media/preflight_thumb/2022/10/16/a97d603c-273e-4e5e-8e93-56e922c07f5f/998265-202936-9-46-55-thumb.jpg: stat: [Errno 107] Transport endpoint is not connected: '998265-202936-9-46-55-thumb.jpg'
[...]
/mnt/snapshot/shapi/media/media_firm: stat: [Errno 107] Transport endpoint is not connected: 'media'
/mnt/snapshot/shapi/media/tmp_addresses_upload: stat: [Errno 107] Transport endpoint is not connected: 'tmp_addresses_upload'
------------------------------------------------------------------------------
Repository: ssh://<host>/data/backups/cfn-storage01
Archive name: 240611-1232
Archive fingerprint: a6af08a14635c68591cdb1d8356f738a103875ba229c618e7a65c2f209d6de9b
Time (start): Tue, 2024-06-11 12:32:20
Time (end):   Tue, 2024-06-11 12:44:39
Duration: 12 minutes 19.69 seconds
Number of files: 34748
Utilization of max. archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:                6.00 GB              2.61 GB              1.80 MB
All archives:               55.13 TB             49.53 TB              2.71 TB

                       Unique chunks         Total chunks
Chunk index:                 5775603            121522768
------------------------------------------------------------------------------

During creation of the backup (intense read operation), I can see in logs:

# mnt-snapshot-shapi.log
[...]
[2024-06-11 10:40:25.241179 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-shapi-client-3: remote operation failed. [{path=/media/preflight_thumb/2022/10/14/964b4d7c-a207-42cb-8fcf-40cf645c0c66/366668-201310-13-33-28-thumb.jpg}, {gfid=bae03e89-0a85-4af1-987e-77d64153ba42}, {errno=2}, {error=No such file or directory}]
[2024-06-11 10:40:25.247712 +0000] W [MSGID: 114061] [client-common.c:530:client_pre_flush_v2] 0-shapi-client-3: remote_fd is -1. EBADFD [{gfid=bae03e89-0a85-4af1-987e-77d64153ba42}, {errno=77}, {error=File descriptor in bad state}]
[2024-06-11 10:40:28.847614 +0000] I [MSGID: 114018] [client.c:2242:client_rpc_notify] 0-shapi-client-1: disconnected from client, process will keep trying to connect glusterd until brick's port is available [{conn-name=shapi-client-1}]
[2024-06-11 10:40:28.847984 +0000] E [rpc-clnt.c:313:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1db)[0x773bf525741b] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x751f)[0x773bf51f551f] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11d)[0x773bf51fc8bd] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xf6c0)[0x773bf51fd6c0] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x27)[0x773bf51f8767] ))))) 0-shapi-client-1: forced unwinding frame type(GlusterFS 4.x v1) op(OPENDIR(20)) called at 2024-06-11 10:40:27 +0000 (xid=0x320b8)
[2024-06-11 10:40:28.848004 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:2470:client4_0_opendir_cbk] 0-shapi-client-1: remote operation failed. [{path=/media/preflight_thumb/2022/10/14/c5c3e629-e040-4b86-9a38-f4d1f92fffc9}, {gfid=88550deb-8eb1-4681-b09c-afce7a74e406}, {errno=107}, {error=Transport endpoint is not connected}]
[2024-06-11 10:40:31.852180 +0000] E [MSGID: 114058] [client-handshake.c:946:client_query_portmap_cbk] 0-shapi-client-1: failed to get the port number for remote subvolume. Please run gluster volume status on server to see if brick process is running []
[2024-06-11 10:40:31.852220 +0000] I [MSGID: 114018] [client.c:2242:client_rpc_notify] 0-shapi-client-1: disconnected from client, process will keep trying to connect glusterd until brick's port is available [{conn-name=shapi-client-1}]
[2024-06-11 10:40:33.790369 +0000] W [MSGID: 108027] [afr-common.c:2889:afr_attempt_readsubvol_set] 0-69d767ccba4f4f9bb1dccc3282f8942b-replicate-0: no read subvols for /media/preflight_thumb/2022/10/16/b37e1018-64b1-4011-be31-9c162c519244/234434-202943-9-54-59-thumb.jpg
[2024-06-11 10:40:33.790392 +0000] W [fuse-bridge.c:1052:fuse_entry_cbk] 0-glusterfs-fuse: 417019: LOOKUP() /media/preflight_thumb/2022/10/16/b37e1018-64b1-4011-be31-9c162c519244/234434-202943-9-54-59-thumb.jpg => -1 (Transport endpoint is not connected)
[2024-06-11 10:40:33.824006 +0000] W [MSGID: 108027] [afr-common.c:2889:afr_attempt_readsubvol_set] 0-69d767ccba4f4f9bb1dccc3282f8942b-replicate-0: no read subvols for /media/preflight_thumb/2022/10/16/cef5bbc0-915e-4fd6-8f5e-550ae7a3be35/352919-202940-9-45-57-thumb.jpg
[2024-06-11 10:40:33.824022 +0000] W [fuse-bridge.c:1052:fuse_entry_cbk] 0-glusterfs-fuse: 417089: LOOKUP() /media/preflight_thumb/2022/10/16/cef5bbc0-915e-4fd6-8f5e-550ae7a3be35/352919-202940-9-45-57-thumb.jpg => -1 (Transport endpoint is not connected)
[...]

# bricks/run-gluster-snaps-<id>-brick3-gluster-internal-shapi.log
[...]
[2024-06-11 10:32:15.040076 +0000] I [MSGID: 115029] [server-handshake.c:645:server_setvolume] 0-69d767ccba4f4f9bb1dccc3282f8942b-server: accepted client from CTX_ID:d1788551-5718-4510-86c1-b23fcd70ba58-GRAPH_ID:0-PID:538971-HOST:cfn-storage01-PC_NAME:shapi-client-3-RECON_NO:-0 (version: 11.1) with subvol /run/gluster/snaps/69d767ccba4f4f9bb1dccc3282f8942b/brick3/gluster-internal/shapi
[2024-06-11 10:32:15.055272 +0000] I [rpcsvc.c:849:rpcsvc_handle_rpc_call] 0-rpc-service: spawned a request handler thread for queue 0
[2024-06-11 10:32:41.129354 +0000] I [rpcsvc.c:849:rpcsvc_handle_rpc_call] 0-rpc-service: spawned a request handler thread for queue 1

[backup happening]

[2024-06-11 10:45:22.850187 +0000] W [socket.c:751:__socket_rwv] 0-tcp.69d767ccba4f4f9bb1dccc3282f8942b-server: readv on 127.0.0.1:49138 failed (No data available)
[2024-06-11 10:45:22.850406 +0000] I [MSGID: 115036] [server.c:494:server_rpc_notify] 0-69d767ccba4f4f9bb1dccc3282f8942b-server: disconnecting connection [{client-uid=CTX_ID:d1788551-5718-4510-86c1-b23fcd70ba58-GRAPH_ID:0-PID:538971-HOST:cfn-storage01-PC_NAME:shapi-client-3-RECON_NO:-0}]
[2024-06-11 10:45:22.850660 +0000] I [MSGID: 101054] [client_t.c:374:gf_client_unref] 0-69d767ccba4f4f9bb1dccc3282f8942b-server: Shutting down connection CTX_ID:d1788551-5718-4510-86c1-b23fcd70ba58-GRAPH_ID:0-PID:538971-HOST:cfn-storage01-PC_NAME:shapi-client-3-RECON_NO:-0
[2024-06-11 10:45:24.041267 +0000] W [glusterfsd.c:1427:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7b28bfc9ca94] -->/usr/sbin/glusterfsd(+0x13805) [0x596b8e039805] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x67) [0x596b8e031987] ) 0-: received signum (15), shutting down

# glusterd.log
[2024-06-11 12:51:05.954732 +0000] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2024-06-11 12:51:06.277648 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume /snaps/shapi/shapi
[2024-06-11 12:51:06.303140 +0000] I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume shapi
[2024-06-11 12:51:07.822320 +0000] I [glusterd-utils.c:6446:glusterd_brick_start] 0-management: starting a fresh brick process for brick /run/gluster/snaps/8ef9e88c97a7482d8c2417ac578cfa89/brick3/gluster-internal/shops
[2024-06-11 12:51:07.368341 +0000] I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume shapi
[2024-06-11 12:51:07.829305 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume /snaps/shops/8ef9e88c97a7482d8c2417ac578cfa89/8ef9e88c97a7482d8c2417ac578cfa89.cfn-storage01.run-gluster-snaps-8ef9e88c97a7482d8c2417ac578cfa89-brick3-gluster-internal-shops
[2024-06-11 12:51:07.829335 +0000] W [MSGID: 106048] [glusterd-handshake.c:114:get_snap_volname_and_volinfo] 0-management: failed to get the volinfo for the volume 8ef9e88c97a7482d8c2417ac578cfa89
[2024-06-11 12:51:07.864328 +0000] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2024-06-11 12:51:08.706969 +0000] I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume shapi
The message "I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume shapi" repeated 88 times between [2024-06-11 12:51:08.706969 +0000] and [2024-06-11 12:52:42.568572 +0000]
[2024-06-11 12:52:43.631935 +0000] I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume shapi
The message "I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume shapi" repeated 112 times between [2024-06-11 12:52:43.631935 +0000] and [2024-06-11 12:54:43.055195 +0000]
[2024-06-11 12:54:44.118804 +0000] I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume shapi
The message "I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume shapi" repeated 111 times between [2024-06-11 12:54:44.118804 +0000] and [2024-06-11 12:56:42.588066 +0000]
[2024-06-11 12:56:43.658896 +0000] I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume shapi
The message "I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume shapi" repeated 112 times between [2024-06-11 12:56:43.658896 +0000] and [2024-06-11 12:58:43.172517 +0000]
[2024-06-11 12:58:44.236678 +0000] I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume shapi
The message "I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume shapi" repeated 111 times between [2024-06-11 12:58:44.236678 +0000] and [2024-06-11 13:00:42.821282 +0000]
[2024-06-11 13:00:43.884134 +0000] I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume shapi

The issues start after ~10 minutes of read operations. Interestingly, if I modify the volume with:

gluster volume set shapi group metadata-cache
gluster volume set shapi performance.readdir-ahead on
gluster volume set shapi performance.parallel-readdir on
gluster volume set shapi client.event-threads 3
gluster volume set shapi server.event-threads 3
gluster volume set shapi cluster.lookup-optimize on
gluster volume set shapi performance.cache-size 512MB

it does fail as well, but after about an hour. So seems to be more resilient.

Expected results:
All read operations successful. Backup done.

Mandatory info:
- The output of the gluster volume info command:

root@cfn-storage01:~# gluster volume info shapi

Volume Name: shapi
Type: Replicate
Volume ID: 5e4983aa-13a6-4d29-b148-eb8d2c995255
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: cfn-storage02:/data/gluster-internal/shapi
Brick2: cfn-storage03:/data/gluster-internal/shapi
Brick3: cfn-storage01:/data/gluster-internal/shapi
Options Reconfigured:
transport.address-family: inet
cluster.shd-wait-qlength: 2048
cluster.shd-max-threads: 4
features.barrier: disable
performance.client-io-threads: off
auth.allow: 10.*
storage.fips-mode-rchecksum: on

- The output of the gluster volume status command:

root@cfn-storage01:~# gluster volume status shapi
Status of volume: shapi
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick cfn-storage02:/data/gluster-internal/
shapi                                       59557     0          Y       158993
Brick cfn-storage03:/data/gluster-internal/
shapi                                       59794     0          Y       147742
Brick cfn-storage01:/data/gluster-internal/
shapi                                       60209     0          Y       496521
Self-heal Daemon on localhost               N/A       N/A        Y       12455
Self-heal Daemon on cfn-storage03           N/A       N/A        Y       11762
Self-heal Daemon on cfn-storage02           N/A       N/A        Y       6343

Task Status of Volume shapi
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command:
Note: there is currently a heal operation in progress, but the issue was happening before it as well.

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/

**- Is there any crash ? Provide the backtrace and coredump

Additional info:

- The operating system / glusterfs version:
Ubuntu 24.04
Gluster 11.1
Recently upgraded from 10.5 which worked fine.

The text was updated successfully, but these errors were encountered:

babipanghang · 2024-07-01T15:03:13Z

I've seen this "File descriptor in bad state" issue pop up several times also, it causes files not to heal. There's another bug report somewhere over here mentioning it. A temporary solution is to restart the volume (gluster volume volname stop && gluster volume volname start).

ironhalik · 2024-07-01T15:13:05Z

Unfortunately, restarting the volume doesn't help in any way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transport endpoint disconnection errors during borg backup on GlusterFS Snapshot following upgrade to 11.1 #4377

Transport endpoint disconnection errors during borg backup on GlusterFS Snapshot following upgrade to 11.1 #4377

ironhalik commented Jun 11, 2024

babipanghang commented Jul 1, 2024

ironhalik commented Jul 1, 2024

Transport endpoint disconnection errors during borg backup on GlusterFS Snapshot following upgrade to 11.1 #4377

Transport endpoint disconnection errors during borg backup on GlusterFS Snapshot following upgrade to 11.1 #4377

Comments

ironhalik commented Jun 11, 2024

babipanghang commented Jul 1, 2024

ironhalik commented Jul 1, 2024