Differential Ceph backups scan all allocated space (Ceph bug) #114

bbs2web · 2021-05-08T12:29:07Z

I'm a benji backup convert but since upgrading to 0.14.1 differential backups appear to select all allocated space which each run.

We regularly schedule trim/discard operations on guests for space reclamation reasons. Two consecutive backups without running Defragment and Optimise Drive in Windows still results in an incremental backup of 195983 blocks (at 4 MiB a block that's 765 GiB out of the 1 TiB volume).

Backups themselves aren't requiring any more space or data transfer (vlan1) but reads from source VMs being backed up (vlan100) shows the increased usage since updating:

(benji) [admin@kvm1b ~]# ceph version
ceph version 15.2.11 (64a4c04e6850c6d9086e4c37f57c4eada541b05e) octopus (stable)

(benji) [admin@kvm1b ~]# benji version-info
    INFO: Benji version: 0.14.1.
    INFO: Configuration version: 1.0.0, supported >=1,<2.
    INFO: Metadata version: 3.0.0, supported >=1,<4.
    INFO: Object metadata version: 2.0.0, supported >=1,<3.

(benji) [admin@kvm1b ~]# cat /etc/benji.yaml
configurationVersion: '1'
logFile: /var/log/benji.log
databaseEngine: postgresql://benji-kvm1:********************************@postgresql.redacted.com:5432/benji-kvm1
defaultStorage: radosgw-kvm7
storages:
  - name: radosgw-kvm7
    module: s3
    storageId: 1
    configuration:
      simultaneousReads: 12
      simultaneousWrites: 12
      activeTransforms:
        - zstd
        - aes_256_gcm
      awsAccessKeyId: '********************'
      awsSecretAccessKey: '****************************************'
      bucketName: benji-kvm1
      endpointUrl: https://kvm7b.redacted.com:7480/
ios:
  - name: file
    module: file
  - name: rbd
    module: rbdaio
    configuration:
      simultaneousReads: 12
      simultaneousWrites: 12
      cephConfigFile: /etc/ceph/ceph.conf
      clientIdentifier: admin
      newImageFeatures:
        - RBD_FEATURE_LAYERING
        - RBD_FEATURE_EXCLUSIVE_LOCK
        - RBD_FEATURE_STRIPINGV2
        - RBD_FEATURE_OBJECT_MAP
        - RBD_FEATURE_FAST_DIFF
#        - RBD_FEATURE_DEEP_FLATTEN
transforms:
  - name: zstd
    module: zstd
    configuration:
      level: 4
  - name: aes_256_gcm
    module: aes_256_gcm
    configuration:
      kdfSalt: '********************************************'
      kdfIterations: 1000
      password: '********************************'

We are using https://github.com/elemental-lf/benji/blob/master/scripts/ceph.sh as the '/usr/local/bin/benji-backup.sh' function script with "${BENJI_LOG_LEVEL:=INFO}"

[root@kvm6b ~]# . /usr/local/benji/bin/activate;
(benji) [root@kvm6b ~]# . /usr/local/bin/benji-backup.sh;
(benji) [root@kvm6b ~]# benji::backup::ceph ************-disk0 rbd_hdd vm-104-disk-0 AutomatedBackup
Snapshot found for rbd_hdd/vm-104-disk-0 is b-2021-05-08T12:31:42.
Performing differential backup of ************-disk0:rbd_hdd/vm-104-disk-0 from RBD snapshot b-2021-05-08T12:31:42 and Benji version ************-disk0-cysw10.
Removing snap: 100% complete...done.
    INFO: Starting sanity check with 0.1% of the ignored blocks.
    INFO: Finished sanity check. Checked 0 blocks.
    INFO: Starting backup of rbd:rbd_hdd/vm-104-disk-0@b-2021-05-08T14:01:13 to storage radosgw-kvm7, the newly created version is ************-disk0-1erueg.
    INFO: Active transforms for storage radosgw-kvm7: zstd (zstd), aes_256_gcm (aes_256_gcm).
    INFO: Backed up 980/195983 blocks (0.5%)
    INFO: Backed up 1960/195983 blocks (1.0%)
    INFO: Backed up 2940/195983 blocks (1.5%)
    INFO: Backed up 3920/195983 blocks (2.0%)
    INFO: Backed up 4900/195983 blocks (2.5%)
    INFO: Backed up 5880/195983 blocks (3.0%)
    INFO: Backed up 6860/195983 blocks (3.5%)
<snip>

I've stepped through the scripts and have a sample Ceph RBD diff which contains information such as:

[admin@kvm1b ~]# dd if=/tmp/ceph-rbd-diff-tmp.dnkX33qAAm bs=8K count=1
[{"offset":0,"length":4194304,"exists":"true"},{"offset":4194304,"length":4194304,"exists":"true"},{"offset":8388608,"length":4194304,"exists":"true"},{"offset":12582912,"length":4194304,"exists":"true"},{"offset":16777216,"length":4194304,"exists":"true"},{"offset":20971520,"length":4194304,"exists":"true"},{"offset":25165824,"length":4194304,"exists":"true"},{"offset":29360128,"length":4194304,"exists":"true"},{"offset":33554432,"length":4194304,"exists":"true"},{"offset":37748736,"length":4194304,"exists":"true"},{"offset":41943040,"length":4194304,"exists":"true"},{"offset":46137344,"length":4194304,"exists":"true"},{"offset":50331648,"length":4194304,"exists":"true"},{"offset":54525952,"length":4194304,"exists":"true"},{"offset":58720256,"length":4194304,"exists":"true"},{"offset":62914560,"length":4194304,"exists":"true"},{"offset":67108864,"length":4194304,"exists":"true"},{"offset":71303168,"length":4194304,"exists":"true"},{"offset":75497472,"length":4194304,"exists":"true"},{"offset":79691776,"length":4194304,"exists":"true"},{"offset":83886080,"length":4194304,"exists":"true"},{"offset":88080384,"length":4194304,"exists":"true"},{"offset":92274688,"length":4194304,"exists":"true"},{"offset":96468992,"length":4194304,"exists":"true"},{"offset":100663296,"length":4194304,"exists":"true"},{"offset":104857600,"length":4194304,"exists":"true"},{"offset":109051904,"length":4194304,"exists":"true"},{"offset":113246208,"length":4194304,"exists":"true"},{"offset":117440512,"length":4194304,"exists":"true"},{"offset":121634816,"length":4194304,"exists":"true"},{"offset":125829120,"length":4194304,"exists":"true"},{"offset":130023424,"length":4194304,"exists":"true"},{"offset":134217728,"length":4194304,"exists":"true"},{"offset":138412032,"length":4194304,"exists":"true"},{"offset":142606336,"length":4194304,"exists":"true"},{"offset":146800640,"length":4194304,"exists":"true"},{"offset":150994944,"length":4194304,"exists":"true"},{"offset":155189248,"length":4194304,"exists":"true"},{"offset":159383552,"length":4194304,"exists":"true"},{"offset":163577856,"length":4194304,"exists":"true"},{"offset":167772160,"length":4194304,"exists":"true"},{"offset":171966464,"length":4194304,"exists":"true"},{"offset":176160768,"length":4194304,"exists":"true"},{"offset":180355072,"length":4194304,"exists":"true"},{"offset":184549376,"length":4194304,"exists":"true"},{"offset":188743680,"length":4194304,"exists":"true"},{"offset":192937984,"length":4194304,"exists":"true"},{"offset":197132288,"length":4194304,"exists":"true"},{"offset":201326592,"length":4194304,"exists":"true"},{"offset":205520896,"length":4194304,"exists":"true"},{"offset":209715200,"length":4194304,"exists":"true"},{"offset":213909504,"length":4194304,"exists":"true"},{"offset":218103808,"length":4194304,"exists":"true"},{"offset":222298112,"length":4194304,"exists":"true"},{"offset":226492416,"length":4194304,"exists":"true"},{"offset":230686720,"length":4194304,"exists":"true"},{"offset":234881024,"length":4194304,"exists":"true"},{"offset":239075328,"length":4194304,"exists":"true"},{"offset":243269632,"length":4194304,"exists":"true"},{"offset":247463936,"length":4194304,"exists":"true"},{"offset":251658240,"length":4194304,"exists":"true"},{"offset":255852544,"length":4194304,"exists":"true"},{"offset":260046848,"length":4194304,"exists":"true"},{"offset":264241152,"length":4194304,"exists":"true"},{"offset":268435456,"length":4194304,"exists":"true"},{"offset":272629760,"length":4194304,"exists":"true"},{"offset":276824064,"length":4194304,"exists":"true"},{"offset":281018368,"length":4194304,"exists":"true"},{"offset":285212672,"length":4194304,"ex

Can share the entire file but presume it not to be relevant...

The text was updated successfully, but these errors were encountered:

bbs2web · 2021-05-08T13:13:11Z

Is this possibly the cause? The sample diff output shows 'exists' being true or false whereas the manual indicates something else:
https://manpages.debian.org/buster/ceph-common/rbd.8.en.html

Not aware of RBD parameters being positional (ie sensitive to order)...

From the script:

rbd diff --whole-object "$CEPH_POOL"/"$CEPH_RBD_IMAGE"@"$CEPH_RBD_SNAPSHOT" --from-snap "$CEPH_RBD_SNAPSHOT_LAST" \
        --format=json >"$CEPH_RBD_DIFF_FILE"

bbs2web · 2021-05-12T22:03:26Z

My tests appear to indicate a bug with Ceph Octopus 15.2.11 when the '--whole-object' switch is used:

[admin@kvm1a ~]# rbd create rbd_hdd/test --size 40G
[admin@kvm1a ~]# rbd info rbd_hdd/test
rbd image 'test':
        size 40 GiB in 10240 objects
        order 22 (4 MiB objects)
        snapshot_count: 0
        id: 73363f8443987b
        block_name_prefix: rbd_data.73363f8443987b
        format: 2
        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
        op_features:
        flags:
        create_timestamp: Wed May 12 23:01:11 2021
        access_timestamp: Wed May 12 23:01:11 2021
        modify_timestamp: Wed May 12 23:01:11 2021
[admin@kvm1a ~]# rbd map rbd_hdd/test
/dev/rbd18
[admin@kvm1a ~]# dd if=/dev/zero of=/dev/rbd18 bs=64M count=1
1+0 records in
1+0 records out
67108864 bytes (67 MB, 64 MiB) copied, 0.668701 s, 100 MB/s
[admin@kvm1a ~]# sync
[admin@kvm1a ~]# rbd snap create rbd_hdd/test@snap1
[admin@kvm1a ~]# dd if=/dev/zero of=/dev/rbd18 bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.265691 s, 15.8 MB/s
[admin@kvm1a ~]# sync
[admin@kvm1a ~]# rbd snap create rbd_hdd/test@snap2
[admin@kvm1a ~]# rbd diff --from-snap snap1 rbd_hdd/test@snap2 --format=json
[{"offset":0,"length":4194304,"exists":"true"}]

This is 100% correct, if I however add the '--whole-object' switch it lists every allocated object from the entire image:

[admin@kvm1b ~]# rbd diff --from-snap snap1 rbd_hdd/test@snap2 --format=json --whole-object
[{"offset":0,"length":4194304,"exists":"true"},
{"offset":4194304,"length":4194304,"exists":"true"},
{"offset":8388608,"length":4194304,"exists":"true"},
{"offset":12582912,"length":4194304,"exists":"true"},
{"offset":16777216,"length":4194304,"exists":"true"},
{"offset":20971520,"length":4194304,"exists":"true"},
{"offset":25165824,"length":4194304,"exists":"true"},
{"offset":29360128,"length":4194304,"exists":"true"},
{"offset":33554432,"length":4194304,"exists":"true"},
{"offset":37748736,"length":4194304,"exists":"true"},
{"offset":41943040,"length":4194304,"exists":"true"},
{"offset":46137344,"length":4194304,"exists":"true"},
{"offset":50331648,"length":4194304,"exists":"true"},
{"offset":54525952,"length":4194304,"exists":"true"},
{"offset":58720256,"length":4194304,"exists":"true"},
{"offset":62914560,"length":4194304,"exists":"true"}]

[admin@kvm1a ~]# rbd du rbd_hdd/test
NAME        PROVISIONED  USED
test@snap1       40 GiB   64 MiB
test@snap2       40 GiB   64 MiB
test             40 GiB    4 MiB
<TOTAL>          40 GiB  132 MiB

bbs2web · 2021-05-12T22:34:00Z

Opened a bug with Ceph here:
https://tracker.ceph.com/issues/50787

bbs2web · 2021-05-17T07:07:38Z

Pretty sure this change in behaviour is a regression due to the patch introduced in January 2020 for Ceph issue https://tracker.ceph.com/issues/42248

Have logged a bug with Ceph directly:
https://tracker.ceph.com/issues/50787

elemental-lf · 2021-05-27T14:32:13Z

@bbs2web just saw your ticket. Thanks for all the information. Hopefully this will be fixed soon.

bbs2web · 2021-12-13T11:51:32Z

Herewith confirmation that the fix for this has been merged in to the main development branch and also backported to Ceph Pacific 16.2.7 and pending for back porting to Ceph Octopus.

Just tested on a test cluster that we upgraded to 16.2.7 today and it's working as expected again.

As per the steps to reproduce the problem whilst running 16.2.6 (https://tracker.ceph.com/issues/50787) it now works as expected:

[admin@kvm2 ~]# rbd diff --from-snap snap1 rbd_hdd/test@snap2 --format=json
[{"offset":0,"length":4194304,"exists":"true"}]
[admin@kvm2 ~]# rbd diff --from-snap snap1 rbd_hdd/test@snap2 --format=json --whole-object
[{"offset":0,"length":4194304,"exists":"true"}]

bbs2web added the bug Something isn't working label May 8, 2021

bbs2web closed this as completed May 12, 2021

elemental-lf removed the bug Something isn't working label Mar 18, 2022

elemental-lf changed the title ~~Differential Ceph backups scans all allocated space~~ Differential Ceph backups scans all allocated space (Ceph bug) Mar 18, 2022

elemental-lf changed the title ~~Differential Ceph backups scans all allocated space (Ceph bug)~~ Differential Ceph backups scan all allocated space (Ceph bug) Mar 18, 2022

elemental-lf pinned this issue Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differential Ceph backups scan all allocated space (Ceph bug) #114

Differential Ceph backups scan all allocated space (Ceph bug) #114

bbs2web commented May 8, 2021 •

edited

Loading

bbs2web commented May 8, 2021 •

edited

Loading

bbs2web commented May 12, 2021

bbs2web commented May 12, 2021

bbs2web commented May 17, 2021

elemental-lf commented May 27, 2021

bbs2web commented Dec 13, 2021

Differential Ceph backups scan all allocated space (Ceph bug) #114

Differential Ceph backups scan all allocated space (Ceph bug) #114

Comments

bbs2web commented May 8, 2021 • edited Loading

bbs2web commented May 8, 2021 • edited Loading

bbs2web commented May 12, 2021

bbs2web commented May 12, 2021

bbs2web commented May 17, 2021

elemental-lf commented May 27, 2021

bbs2web commented Dec 13, 2021

bbs2web commented May 8, 2021 •

edited

Loading

bbs2web commented May 8, 2021 •

edited

Loading