-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restored data does not always match original (Ceph bug) #136
Comments
The problematic image I have is 200GiB, trying to shrink this to a size that may allow uploading and reproducing the problem... |
The following are the same steps which yield no errors either but result in a corrupted image after it's restored. Benji configuration file:
Problematic image where I can consistently re-produce corruption when restoring images:
|
The sha string for a 4 MiB block of zeros is 'K8y9LzjxXBPrfVqJ/Z2F9ZXiO8M', looks like to me that it should contain other data but gets restored with zeros. Calculating sha1 sum of 4 MiB of zeros:
|
Just a clarify the above: Symptom is that running a file system integrity check on the backed up snapshot reports no errors whilst the restored image reports errors. |
Holy shit, that's bad! I can't really help, other than to note that benji seems a bit abandony, so I hope someone will pick up maintainership of it soon. There are critical PRs waiting several months to be merged or at least commented on, so that doesn't bode well for the vitality of the project, either. |
@bbs2web this actually sounds bad but I'm having a hard time making the time to work on Benji as already noted in another issue. But I'd still like to help as best as I can at the moment:
|
Apologies about my absence. I applied the changes to /usr/local/benji/lib/python3.9/site-packages/benji/benji.py and /usr/local/benji/lib/python3.9/site-packages/benji/database.py. When I then ran a source image deep-scrub it did indeed report changes but it doesn't actually mark the image as invalid: PS: I found a smaller image that demonstrates the problem (40 GiB).
Image is reported as being valid though:
When I run the image comparison deep scrub again is reports the exact same information and the image remains marked as valid. Running a standard incremental backup predictably didn't change the outcome. Removing the b- snapshot and running the backup again performed a full backup which again yielded the same error when running a subsequent comparison of the newly re-created snapshot (created as part of last full backup) and the resulting backup image. No change either when I retrieve the files in their entirety from the scrub-sparse-blocks branch. The report of blocks 1 and 2 not matching now appears to match with restored images not booting (loading boot loader such as GRUB2). |
Many thanks for the time that you are able to provide, Benji has been a great help on numerous occasions. No change either when I delete the snapshot and run a full backup using the rbd module, instead of rbdaio. Identical result to my last message. |
Thanks for your tests! I still have no clue what might be going wrong here so I concentrated on improving the scrub-sparse-blocks branch, so that we can better assess the extent of the problem and maybe we'll see some pattern in the affected versions or in the affected blocks. |
Many thanks, it's now more verbose and flags the image as invalid:
After a full backup a subsequent comparison is however still problematic:
|
@bbs2web Okay, I've an hypothesis that I'd like you to check. Could you please remove the benji/src/benji/helpers/ceph.py Lines 70 to 73 in b22639e
That's the whole line 72. And then force an initial backup again and check if that version is correct? If you have your own scripts the change should be similar. My theory is that Ceph might be erronously reporting these blocks as sparse when they are actually not... |
PS: If you're able to provide the output of EDIT: Only if the backup works without the hints of course. |
FYI: All the changes regarding |
Apologies for taking a little time to get these tests done. Running a deep scrub comparison against the snapshot it marked the image status as invalid. The next backup was then automatically a full backup but a subsequent comparison unfortunately still fails. Herewith the modification to the ceph.py file:
Herewith the steps:
I've uploaded the output of a ceph whole-object diff:
I presume Benji messages about block 1 and 2 being different means that block 0 matches but that blocks 1 and 2 are incorrectly reported by the diff as not existing? |
Generating SHA1 sums for 4 MiB blocks of the source image does confirm that blocks 0-3 are non-zero (4 MiBs worth of zeros has a SHA1 sum of
|
The backed up image is a snapshot, of a clone.
The clone's parent shows blocks 0,1 and 2 being defined before seeking to block 65:
Perhaps there is an error in the Ceph tool logic when the clone doesn't define blocks as they fall through to the parent image?
|
I've opened an issue in the Ceph tracker: https://tracker.ceph.com/issues/54970 Any chance you could perhaps work with 'rbd diff' output and then pad to the Ceph object size? |
I really appreciate your thoroughness and the great quality of your analyses. Benji doesn't require Lines 745 to 766 in 82a3e0c
I've changed the title of this issue slightly so that users browsing the issue won't be scared off from using Benji. I've added a warning message about |
I'm going to close this issue as this has been traced backed to Ceph. It will hopefully be fixed in Pacific ( |
This was fixed in Ceph Pacific 16.2.9, many thanks again for this project! |
I have a particular Ceph RBD image that is corrupt when I restore it. I additionally yesterday removed the snapshot which resulted in the subsequent backup yesterday evening being a full backup. A deep scrub completes without errors, as does a deep scrub where I provide the snapshot Benji created whilst initiating the backup yesterday. Comparing sha sums of 4 MiB blocks of the snapshot and restored images however results in differences.
I presume a possible hash collision, or perhaps something going out of alignment during the restore process?
The image isn't massive, it's allocated 99 out of 200 GiB. The following is a much smaller test (124 MiB) where everything works as expected so I presume that this isn't a configuration/S3/Ceph problem:
I'm running Ceph Pacific 16.2.7 on Proxmox (Debian 11.2) with Benji 0.15.0:
The above steps appear to demonstrate that Benji does generally work, we have observed no other instances of data ever corrupting during a restore.
Just to clarify, a benji deep scrub on the problematic backup image yields no errors. Doing a deep scrub where I compare the data to the snapshot also completes without errors. Restoring that image however yields differences in the SHA sums when I compute sha1_base64 for 4 MiB blocks of the original and restored RBD image.
The text was updated successfully, but these errors were encountered: