`bad file descriptor` when rapidly appending and `writeback_cache` is enabled #5315

aeblyve · 2024-11-21T20:07:57Z

What happened:

I ran R code which rapidly appended data to a .csv file, but I got errors about a bad file descriptor when I call for warnings:

> warnings()
Warning messages:
1: In close.connection(file) : Problem closing connection:  Bad file descriptor
2: In close.connection(file) : Problem closing connection:  Bad file descriptor
3: In close.connection(file) : Problem closing connection:  Bad file descriptor
4: In close.connection(file) : Problem closing connection:  Bad file descriptor
5: In close.connection(file) : Problem closing connection:  Bad file descriptor
6: In close.connection(file) : Problem closing connection:  Bad file descriptor
7: In close.connection(file) : Problem closing connection:  Bad file descriptor
8: In close.connection(file) : Problem closing connection:  Bad file descriptor
9: In close.connection(file) : Problem closing connection:  Bad file descriptor
10: In close.connection(file) : Problem closing connection:  Bad file descriptor
11: In close.connection(file) : Problem closing connection:  Bad file descriptor
12: In close.connection(file) : Problem closing connection:  Bad file descriptor

The file is not correctly written, it is missing substantial pieces that should be there. It seems random, like a race condition, how many pieces are missing.

What you expected to happen:

The data should be appended (even if it takes some time).

How to reproduce it (as minimally and precisely as possible):

This bash simulation worked to get the same (or very similar) behavior on our mount:

[root@hostname juicefs]# for i in $(seq 1 10); do echo $i >> afile; done   
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor

Anything else we need to know?

The issue consistently disappeared when we removed writeback_cache from the mount options. writeback still works.

Here is a sample .accesslog from when the bug with the R code was happening. Notice the bad file descriptor on read.


2024.11.21 12:26:57.506861 [uid:uid,gid:gid,pid:28644] read (57698439,4096,28672): (0) - bad file descriptor <0.000002>
2024.11.21 12:26:57.507507 [uid:uid,gid:gid,pid:28644] setattr (57698439[41684],0x460,[mtime=1732210017]): (57698439,[-rw-r--r--:0100644,1,uid,gid,1732209892,1732210017,1732210017,32225]) - OK <0.000609>
2024.11.21 12:26:57.507536 [uid:uid,gid:gid,pid:28644] flush (57698439,41684,83806B71FE2A25BE) - OK <0.000002>
2024.11.21 12:26:57.507566 [uid:0,gid:0,pid:0] release (57698439,41684) - OK <0.000006>
2024.11.21 12:26:57.511805 [uid:uid,gid:gid,pid:28644] getattr (57698438): (57698438,[-rwxr--r--:0100744,1,uid,gid,1732209874,1732209874,1732209874,4724]) - OK <0.000460>
2024.11.21 12:26:57.515278 [uid:uid,gid:gid,pid:28805] flush (57698438,39669,1D60D24A7855051D) - OK <0.000001>
2024.11.21 12:26:57.515346 [uid:uid,gid:gid,pid:28644] flush (57698438,39669,83806B71FE2A25BE) - OK <0.000001>
2024.11.21 12:26:57.515384 [uid:0,gid:0,pid:0] release (57698438,39669) - OK <0.000006>

Environment:

JuiceFS version (use juicefs --version) or Hadoop Java SDK version:
juicefs version 1.2.1+2024-08-30.cd871d1
Cloud provider or hardware configuration running JuiceFS:
AWS EC2 instance.
OS (e.g cat /etc/os-release):

NAME="AlmaLinux"
VERSION="8.10 (Cerulean Leopard)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="AlmaLinux 8.10 (Cerulean Leopard)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:8::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"

ALMALINUX_MANTISBT_PROJECT="AlmaLinux-8"
ALMALINUX_MANTISBT_PROJECT_VERSION="8.10"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"
SUPPORT_END=2029-06-01

Kernel (e.g. uname -a):

Linux hostname 4.18.0-553.27.1.el8_10.x86_64 #1 SMP Tue Nov 5 04:50:16 EST 2024 x86_64 x86_64 x86_64 GNU/Linu

Object storage (cloud provider and region, or self maintained):
Amazon S3.
Metadata engine info (version, cloud provider managed or self maintained):
Redis on Amazon EC2, self maintained. Version 6.2.7
Network connectivity (JuiceFS to metadata engine, JuiceFS to object storage):
Amazon EC2 IP networking, and networking to Amazon S3.
Others:
Systemd mount options at the time of failure looked like this:

Options=_netdev,allow_other,writeback_cache,writeback,max-uploads=40,buffer-size=1024,backup-meta=0

The text was updated successfully, but these errors were encountered:

jiefenghuang · 2024-11-22T02:23:42Z

Hi @aeblyve, for your use case, we recommend using only writeback.
writeback_cache is a mechanism of FUSE, while writeback is a mechanism of JuiceFS.

The simplified data flow is: userspace -> kernel -> FUSE (default: write-through) -> JuiceFS -> object storage.
So, under the default FUSE IO mode, i.e., write-through, append writes are converted into sequential IO requests to JuiceFS. JuiceFS then uses the writeback mechanism to first write to local disks and then asynchronously upload to object storage.

If you change the FUSE IO mode to writeback_cache, the kernel will first cache the requests in the page cache and then flush pages to the underlying layer at unpredictable intervals. The IO requests received by JuiceFS might become random writes, which is why this mode is not recommended.

Additionally, we will look into your issue to see if it can be reproduced and investigate the specific problem.

frostwind · 2024-12-02T17:57:05Z

@jiefenghuang
Add another data point of such failure. Simply enable writeback_cache and compile redis (eg I am compiling redis-7.4.1) it will encounter such error. I am using ceph rados 16.2.7 as backend storage and juicefs 1.2.1.

[root@node1 redis-7.4.1]# ps -ef|grep jfs
root     23264     1  0 09:52 ?        00:00:00 /sbin/mount.juicefs postgres://jfs_admin@xxxxx:5432/jfs /jfs -o rw,allow_other,writeback_cache,backup-meta=0,buffer-size=2000,cache-partial-only,_netdev
root     23273 23264 36 09:52 ?        00:00:32 /usr/local/bin/juicefs.ceph postgres://jfs_admin@xxxxx:5432/jfs /jfs -o rw,allow_other,writeback_cache,backup-meta=0,buffer-size=2000,cache-partial-only,_netdev

[root@node1 redis]# ll
total 3489
-rw-r--r--. 1 yliu users 3567859 Oct  2 22:31 7.4.1.tar.gz
drwxrwxr-x. 8 root root     4096 Oct  2 12:04 redis-7.4.1
[root@node1 redis]# cd redis-7.4.1/
[root@node1 redis-7.4.1]# make
cd src && make all
./mkreleasehdr.sh: line 12: echo: write error: Bad file descriptor
./mkreleasehdr.sh: line 13: echo: write error: Bad file descriptor
./mkreleasehdr.sh: line 14: echo: write error: Bad file descriptor
./mkreleasehdr.sh: line 15: echo: write error: Bad file descriptor
make[1]: Entering directory `/jfs/redis/redis-7.4.1/src'
    CC Makefile.dep

jiefenghuang · 2024-12-12T06:57:30Z

hi @frostwind, we fixed this issue in the main branch, but it wasn’t merged into the release branch earlier. It has been resolved in #5369.

aeblyve · 2024-12-12T21:11:36Z

@jiefenghuang To confirm, it is now possible to use writeback_cache without the bad file descriptor error?

I understand you don't recommend it either way for this use case, but we have other use cases that could benefit from this option.

zhijian-pro · 2024-12-13T02:25:46Z

v1.2.0-1.2.2 cannot use writeback_cache parameter. This bug fix will be released in v1.2.3. You can test it manually by building the main branch or the release-1.2 branch.

aeblyve added the kind/bug Something isn't working label Nov 21, 2024

davies closed this as completed Dec 16, 2024

zhoucheng361 added the missed missed bug label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`bad file descriptor` when rapidly appending and `writeback_cache` is enabled #5315

`bad file descriptor` when rapidly appending and `writeback_cache` is enabled #5315

aeblyve commented Nov 21, 2024 •

edited

Loading

jiefenghuang commented Nov 22, 2024

frostwind commented Dec 2, 2024

jiefenghuang commented Dec 12, 2024

aeblyve commented Dec 12, 2024

zhijian-pro commented Dec 13, 2024

bad file descriptor when rapidly appending and writeback_cache is enabled #5315

bad file descriptor when rapidly appending and writeback_cache is enabled #5315

Comments

aeblyve commented Nov 21, 2024 • edited Loading

jiefenghuang commented Nov 22, 2024

frostwind commented Dec 2, 2024

jiefenghuang commented Dec 12, 2024

aeblyve commented Dec 12, 2024

zhijian-pro commented Dec 13, 2024

`bad file descriptor` when rapidly appending and `writeback_cache` is enabled #5315

`bad file descriptor` when rapidly appending and `writeback_cache` is enabled #5315

aeblyve commented Nov 21, 2024 •

edited

Loading