Writeback cache synchronization issue #3143

seth-hunter · 2022-12-31T22:41:19Z

What happened:
JuiceFS appears to have a writeback cache synchronization issue. Example below may be exacerbated by slow IO and intolerance for misbehavior of the storage provider (STORJ), but it seems to me that the root cause is within JuiceFS.

See chunk 11008_0_4194304 in attached log; several parallel upload attempts (resulting in server errors), continuing to try to upload after file has been deleted (at time 21:32:04), and other general signs of missing synchronization. Also appears (not observable in log below) as though --upload-delay is not being fully obeyed (uploads begin within about one second of file creation). This example is very repeatable. This problem does not occur without --writeback in the mount command.

What you expected to happen:
One concurrent upload attempt per chunk, and no attempt to continue uploading significantly after chunk has been deleted locally.

How to reproduce it (as minimally and precisely as possible):
Mount command: juicefs mount --no-usage-report --cache-size 512 --writeback -o allow_other --upload-delay 5 --backup-meta 0 --max-uploads 5 --verbose sqlite3://storj-test.db /jfs

Test scenario: create 10 4MB files in rapid succession:
for i in {1..10}; do dd if=/dev/urandom bs=1M count=4 of=./test${i}; done

Environment:

JuiceFS version: juicefs version 1.0.3+2022-12-27.ec6c8abd
OS: Linux (AlmaLinux 8.7)
Kernel: 4.18.0-425.3.1.el8.x86_64
Object storage: STORJ
Metadata engine info: SQLite

Log:
See juicefs-storj-writeback-issue.log. Reflects that I have placed additional DEBUG statements just before and after s3.go:Put()'s call to s3.PutObject() to help clarify the misbehavior. Redacted my IP with <MY_IP>, juicefs cache path with <JFS_CACHE>

SandyXSD · 2023-01-04T09:25:54Z

The delayed objects will be uploaded in advance if the cache disk is too full, see: https://github.com/juicedata/juicefs/blob/v1.0.3/pkg/chunk/disk_cache.go#L123-L125
Yes, we are lack of synchronization here, leading to a terrible situation when the cache is full and the object storage is slow. Will fix it soon.

solracsf · 2023-01-04T20:35:53Z

I can also observe the behavior of --upload-delay not respected.
My disk cache is certainly not full (unless what you call "disk" is not the physical drive/partition free space available but the
--cache-size value. Which should not be the case IMO).

SandyXSD · 2023-01-05T02:46:36Z

I can also observe the behavior of --upload-delay not respected. My disk cache is certainly not full (unless what you call "disk" is not the physical drive/partition free space available but the --cache-size value. Which should not be the case IMO).

That's weird, the snippet I linked above should be the only place that uploads objects in advance. Could you describe the problem you met in more detail?

seth-hunter · 2023-01-10T02:55:52Z

@SandyXSD, I've tested your dedup-uploads branch in the scenario that was causing me the original issue reported above. This change does appear to resolve the problem in this test case, thank you! I'm not sure if related to the change made or if due to me switching from v1.0.3 to this main-based branch, but I believe I am also now seeing upload-delay being respected.

SandyXSD · 2023-01-10T03:20:50Z

Thanks for the feedback. Again if the cache-dir is too full (by default df shows the Use% > 90%), the blocks will be uploaded in advance to free the disk, no matter what the upload-delay is. Other than that, the time point for uploading may be a bit later the upload-delay, but should never be before it. If you encounter another unexpected situation, please describe more details here.

SandyXSD · 2023-01-10T03:29:07Z

Reopen this issue because there's more details need to be confirmed.

SandyXSD · 2023-01-10T03:50:53Z

Another thing to mention is that the issue "block is still being uploaded after the file is deleted" is not fixed for the following reasons:

It rarely happens. Usually the block kept locally will be deleted as well after the file is deleted, so it won't be uploaded. This issue happens only if the file is deleted while the block is already being uploaded but not finished.
Even if it does happen, the wrongly uploaded block will not affect other files, and can be deleted with the juicefs gc command.
It's difficult to be fixed fundamentally because the file may be deleted by another client, in which case nothing we can do for now.

solracsf · 2023-01-15T16:56:09Z

About 3rd point, can't this be fixed in a condition if only one mount is detected?
I fully understand about multiple mounts/clients, but maybe a single mount/non distributed setup could be somehow fixed
(and warn about this potential issue in the docs in the meanwhile)?

SandyXSD · 2023-01-16T02:22:18Z

Adding a context with cancel should be workable for a single mount. Though the ongoing PUT request can't be cancelled, later retried ones can be avoided. However, as 1 & 2 state, this is an issue with low priority.

polyrabbit · 2024-05-04T14:44:37Z

@solracsf #4748 should have fixed the 3rd point when only one client is active.

solracsf · 2024-05-04T20:31:14Z

Thanks, as i'm using JuiceFS with a single client, this should help here!

seth-hunter added the kind/bug Something isn't working label Dec 31, 2022

SandyXSD self-assigned this Jan 3, 2023

SandyXSD mentioned this issue Jan 6, 2023

chunk: fix the issue that staging file may be uploaded for many times #3157

Merged

davies closed this as completed in #3157 Jan 10, 2023

SandyXSD reopened this Jan 10, 2023

SandyXSD added needs-more-info This issue requires more information to address and removed kind/bug Something isn't working labels Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writeback cache synchronization issue #3143

Writeback cache synchronization issue #3143

seth-hunter commented Dec 31, 2022

SandyXSD commented Jan 4, 2023

solracsf commented Jan 4, 2023 •

edited

Loading

SandyXSD commented Jan 5, 2023

seth-hunter commented Jan 10, 2023

SandyXSD commented Jan 10, 2023

SandyXSD commented Jan 10, 2023

SandyXSD commented Jan 10, 2023

solracsf commented Jan 15, 2023 •

edited

Loading

SandyXSD commented Jan 16, 2023

polyrabbit commented May 4, 2024

solracsf commented May 4, 2024

Writeback cache synchronization issue #3143

Writeback cache synchronization issue #3143

Comments

seth-hunter commented Dec 31, 2022

SandyXSD commented Jan 4, 2023

solracsf commented Jan 4, 2023 • edited Loading

SandyXSD commented Jan 5, 2023

seth-hunter commented Jan 10, 2023

SandyXSD commented Jan 10, 2023

SandyXSD commented Jan 10, 2023

SandyXSD commented Jan 10, 2023

solracsf commented Jan 15, 2023 • edited Loading

SandyXSD commented Jan 16, 2023

polyrabbit commented May 4, 2024

solracsf commented May 4, 2024

solracsf commented Jan 4, 2023 •

edited

Loading

solracsf commented Jan 15, 2023 •

edited

Loading