Sparse file support (integration) #5620

ThomasWaldmann · 2021-01-08T00:30:10Z

the chunker now yields namedtuples (data, meta), so that sparse holes can be processed differently than normal data.

to avoid hashing the all-zero blocks again and again, a LRUcache for a few popular block sizes and their hashes is maintained.
this speeds up processing of sparse holes / all-zero blocks a lot.

additionally to all-zero sparse holes, the fixed-blocksize chunker now can also detect all-zero from-disk reads (allocation type CH_ALLOC).

also refactored all places that need to compare against zeros so that we only have one big all-zero-bytes object.

also added all-zeros detection for the buzhash chunker (here the results for a huge, completely sparse file).

Some stats:

buzhash chunker, before this PR: Duration: 2 minutes 20.14 seconds
buzhash chunker, after this PR: Duration: 1 minutes 2.62 seconds
fixed chunker, before this PR: Duration: 1 minutes 25.48 seconds
fixed chunker, without --sparse, after this PR: Duration: 11.48 seconds
fixed chunker, with --sparse, after this PR: Duration: 0.42 seconds

src/borg/archive.py

src/borg/chunker.pyx

src/borg/testsuite/chunker.py

codecov-io · 2021-01-08T01:24:16Z

Codecov Report

Merging #5620 (2d76365) into master (37d4aee) will increase coverage by 0.09%.
The diff coverage is 91.42%.

@@            Coverage Diff             @@
##           master    #5620      +/-   ##
==========================================
+ Coverage   83.09%   83.18%   +0.09%     
==========================================
  Files          38       38              
  Lines       10119    10139      +20     
  Branches     1678     1680       +2     
==========================================
+ Hits         8408     8434      +26     
+ Misses       1211     1208       -3     
+ Partials      500      497       -3

Impacted Files	Coverage Δ
src/borg/archive.py	`81.57% <90.32%> (+0.53%)`	⬆️
src/borg/archiver.py	`80.02% <100.00%> (+0.11%)`	⬆️
src/borg/constants.py	`100.00% <100.00%> (ø)`
src/borg/repository.py	`84.02% <0.00%> (-0.19%)`	⬇️
src/borg/helpers/parseformat.py	`90.05% <0.00%> (+0.16%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37d4aee...2391d16. Read the comment docs.

comparing zeros is quicker than hashing them. the comparison should fail quickly inside non-zero data.

at least for major amounts of fixed-size replacement hashes, this will be much faster. also less memory management overhead.

also: zeros.startswith() is faster

timmc

Nice. One nitpick on some maybe overly dense code, but nothing important.

Is there anything preventing the content-defined chunker from detecting all-zeroes chunks? I expect with that implementation, almost all zero-chunks would be the same size as each other (the max size param).

src/borg/chunker.pyx

ThomasWaldmann · 2021-01-15T20:29:07Z

@timmc About "Is there anything preventing the content-defined chunker from detecting all-zeroes chunks? I expect with that implementation, almost all zero-chunks would be the same size as each other (the max size param)."

Now that you say it... - this can be easily done, see latest commit. Thanks for the hint.

I just did not want to add real SEEK_DATA/SEEK_HOLE support to the C code. But detecting if we got all-zero is easy.

…chunker

Chunker: yield Chunk namedtuple instead of bytes/memoryview

8c29969

ThomasWaldmann commented Jan 8, 2021

View reviewed changes

ThomasWaldmann marked this pull request as draft January 8, 2021 01:02

ThomasWaldmann added 3 commits January 8, 2021 17:33

adapt the existing chunker tests

7319f85

integrate Chunk type, avoid hashing holes

52bd55b

detect all-zero chunks, avoid hashing them

6d0f9a5

comparing zeros is quicker than hashing them. the comparison should fail quickly inside non-zero data.

ThomasWaldmann force-pushed the sparse-file-integr2 branch 3 times, most recently from 77ea8fa to bf1d60e Compare January 8, 2021 22:37

ThomasWaldmann added 6 commits January 8, 2021 23:39

refactor new zero chunk handling to be reusable

9fd284c

reuse chunker.zeros for sparse extraction

b3659e0

refactor recreate to use chunk_to_id_data

92f2210

rename chunk_to_id_data to cached_hash

f3088a9

use cached_hash also to generate all-zero replacement chunks

ef19d93

at least for major amounts of fixed-size replacement hashes, this will be much faster. also less memory management overhead.

reuse zeros also in fixed-size chunker for all-zero chunk detection

4e3be1d

also: zeros.startswith() is faster

ThomasWaldmann force-pushed the sparse-file-integr2 branch from bf1d60e to 4e3be1d Compare January 8, 2021 22:40

ThomasWaldmann marked this pull request as ready for review January 8, 2021 22:54

ThomasWaldmann added 4 commits January 14, 2021 19:56

remove max_chunk_size (unused)

3b9798c

move zeros to constants module

be25772

use zeros for benchmarks

e41dc6e

cached_hash is only used in archive, move it there

8162e2e

ThomasWaldmann force-pushed the sparse-file-integr2 branch from 866f392 to 8162e2e Compare January 14, 2021 19:50

timmc approved these changes Jan 15, 2021

View reviewed changes

src/borg/chunker.pyx Outdated Show resolved Hide resolved

ThomasWaldmann added 2 commits January 15, 2021 21:10

cosmetic: directly set allocation instead going via is_zero

2d76365

add all-zero detection to buzhash chunk data processing

2391d16

fixup: improve comment about assumptions in the item metadata stream …

6dc3344

…chunker

This was referenced Jan 15, 2021

chunker: sparsify if all-zero data was read #5566

Closed

Consider holes in sparse files when reading #1354

Closed

This was referenced Jan 15, 2021

pass around metadata #765

Closed

advanced sparse file support #14

Open

ThomasWaldmann merged commit 699256e into borgbackup:master Jan 17, 2021

ThomasWaldmann deleted the sparse-file-integr2 branch January 17, 2021 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse file support (integration) #5620

Sparse file support (integration) #5620

ThomasWaldmann commented Jan 8, 2021 •

edited

Loading

codecov-io commented Jan 8, 2021 •

edited

Loading

timmc left a comment

ThomasWaldmann commented Jan 15, 2021 •

edited

Loading

Sparse file support (integration) #5620

Sparse file support (integration) #5620

Conversation

ThomasWaldmann commented Jan 8, 2021 • edited Loading

codecov-io commented Jan 8, 2021 • edited Loading

Codecov Report

timmc left a comment

Choose a reason for hiding this comment

ThomasWaldmann commented Jan 15, 2021 • edited Loading

ThomasWaldmann commented Jan 8, 2021 •

edited

Loading

codecov-io commented Jan 8, 2021 •

edited

Loading

ThomasWaldmann commented Jan 15, 2021 •

edited

Loading