Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writeback cache synchronization issue #3143

Open
seth-hunter opened this issue Dec 31, 2022 · 11 comments · Fixed by #3157
Open

Writeback cache synchronization issue #3143

seth-hunter opened this issue Dec 31, 2022 · 11 comments · Fixed by #3157
Assignees
Labels
needs-more-info This issue requires more information to address

Comments

@seth-hunter
Copy link

What happened:
JuiceFS appears to have a writeback cache synchronization issue. Example below may be exacerbated by slow IO and intolerance for misbehavior of the storage provider (STORJ), but it seems to me that the root cause is within JuiceFS.

See chunk 11008_0_4194304 in attached log; several parallel upload attempts (resulting in server errors), continuing to try to upload after file has been deleted (at time 21:32:04), and other general signs of missing synchronization. Also appears (not observable in log below) as though --upload-delay is not being fully obeyed (uploads begin within about one second of file creation). This example is very repeatable. This problem does not occur without --writeback in the mount command.

What you expected to happen:
One concurrent upload attempt per chunk, and no attempt to continue uploading significantly after chunk has been deleted locally.

How to reproduce it (as minimally and precisely as possible):
Mount command: juicefs mount --no-usage-report --cache-size 512 --writeback -o allow_other --upload-delay 5 --backup-meta 0 --max-uploads 5 --verbose sqlite3://storj-test.db /jfs

Test scenario: create 10 4MB files in rapid succession:
for i in {1..10}; do dd if=/dev/urandom bs=1M count=4 of=./test${i}; done

Environment:

  • JuiceFS version: juicefs version 1.0.3+2022-12-27.ec6c8abd
  • OS: Linux (AlmaLinux 8.7)
  • Kernel: 4.18.0-425.3.1.el8.x86_64
  • Object storage: STORJ
  • Metadata engine info: SQLite

Log:
See juicefs-storj-writeback-issue.log. Reflects that I have placed additional DEBUG statements just before and after s3.go:Put()'s call to s3.PutObject() to help clarify the misbehavior. Redacted my IP with <MY_IP>, juicefs cache path with <JFS_CACHE>

@seth-hunter seth-hunter added the kind/bug Something isn't working label Dec 31, 2022
@SandyXSD SandyXSD self-assigned this Jan 3, 2023
@SandyXSD
Copy link
Contributor

SandyXSD commented Jan 4, 2023

  1. The delayed objects will be uploaded in advance if the cache disk is too full, see: https://github.com/juicedata/juicefs/blob/v1.0.3/pkg/chunk/disk_cache.go#L123-L125
  2. Yes, we are lack of synchronization here, leading to a terrible situation when the cache is full and the object storage is slow. Will fix it soon.

@solracsf
Copy link
Contributor

solracsf commented Jan 4, 2023

I can also observe the behavior of --upload-delay not respected.
My disk cache is certainly not full (unless what you call "disk" is not the physical drive/partition free space available but the
--cache-size value. Which should not be the case IMO).

@SandyXSD
Copy link
Contributor

SandyXSD commented Jan 5, 2023

I can also observe the behavior of --upload-delay not respected. My disk cache is certainly not full (unless what you call "disk" is not the physical drive/partition free space available but the --cache-size value. Which should not be the case IMO).

That's weird, the snippet I linked above should be the only place that uploads objects in advance. Could you describe the problem you met in more detail?

@seth-hunter
Copy link
Author

@SandyXSD, I've tested your dedup-uploads branch in the scenario that was causing me the original issue reported above. This change does appear to resolve the problem in this test case, thank you! I'm not sure if related to the change made or if due to me switching from v1.0.3 to this main-based branch, but I believe I am also now seeing upload-delay being respected.

@SandyXSD
Copy link
Contributor

Thanks for the feedback. Again if the cache-dir is too full (by default df shows the Use% > 90%), the blocks will be uploaded in advance to free the disk, no matter what the upload-delay is. Other than that, the time point for uploading may be a bit later the upload-delay, but should never be before it. If you encounter another unexpected situation, please describe more details here.

@SandyXSD
Copy link
Contributor

Reopen this issue because there's more details need to be confirmed.

@SandyXSD SandyXSD reopened this Jan 10, 2023
@SandyXSD
Copy link
Contributor

Another thing to mention is that the issue "block is still being uploaded after the file is deleted" is not fixed for the following reasons:

  1. It rarely happens. Usually the block kept locally will be deleted as well after the file is deleted, so it won't be uploaded. This issue happens only if the file is deleted while the block is already being uploaded but not finished.
  2. Even if it does happen, the wrongly uploaded block will not affect other files, and can be deleted with the juicefs gc command.
  3. It's difficult to be fixed fundamentally because the file may be deleted by another client, in which case nothing we can do for now.

@SandyXSD SandyXSD added needs-more-info This issue requires more information to address and removed kind/bug Something isn't working labels Jan 10, 2023
@solracsf
Copy link
Contributor

solracsf commented Jan 15, 2023

About 3rd point, can't this be fixed in a condition if only one mount is detected?
I fully understand about multiple mounts/clients, but maybe a single mount/non distributed setup could be somehow fixed
(and warn about this potential issue in the docs in the meanwhile)?

@SandyXSD
Copy link
Contributor

Adding a context with cancel should be workable for a single mount. Though the ongoing PUT request can't be cancelled, later retried ones can be avoided. However, as 1 & 2 state, this is an issue with low priority.

@polyrabbit
Copy link
Contributor

@solracsf #4748 should have fixed the 3rd point when only one client is active.

@solracsf
Copy link
Contributor

solracsf commented May 4, 2024

Thanks, as i'm using JuiceFS with a single client, this should help here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-more-info This issue requires more information to address
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants