Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse file support (integration) #5620

Merged
merged 17 commits into from
Jan 17, 2021

Conversation

ThomasWaldmann
Copy link
Member

@ThomasWaldmann ThomasWaldmann commented Jan 8, 2021

fixes #5566

the chunker now yields namedtuples (data, meta), so that sparse holes can be processed differently than normal data.

to avoid hashing the all-zero blocks again and again, a LRUcache for a few popular block sizes and their hashes is maintained.
this speeds up processing of sparse holes / all-zero blocks a lot.

additionally to all-zero sparse holes, the fixed-blocksize chunker now can also detect all-zero from-disk reads (allocation type CH_ALLOC).

also refactored all places that need to compare against zeros so that we only have one big all-zero-bytes object.

also added all-zeros detection for the buzhash chunker (here the results for a huge, completely sparse file).

Some stats:

  • buzhash chunker, before this PR: Duration: 2 minutes 20.14 seconds
  • buzhash chunker, after this PR: Duration: 1 minutes 2.62 seconds
  • fixed chunker, before this PR: Duration: 1 minutes 25.48 seconds
  • fixed chunker, without --sparse, after this PR: Duration: 11.48 seconds
  • fixed chunker, with --sparse, after this PR: Duration: 0.42 seconds

src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/archive.py Show resolved Hide resolved
src/borg/archive.py Outdated Show resolved Hide resolved
src/borg/chunker.pyx Outdated Show resolved Hide resolved
src/borg/chunker.pyx Outdated Show resolved Hide resolved
src/borg/testsuite/chunker.py Outdated Show resolved Hide resolved
@ThomasWaldmann ThomasWaldmann marked this pull request as draft January 8, 2021 01:02
@codecov-io
Copy link

codecov-io commented Jan 8, 2021

Codecov Report

Merging #5620 (2d76365) into master (37d4aee) will increase coverage by 0.09%.
The diff coverage is 91.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5620      +/-   ##
==========================================
+ Coverage   83.09%   83.18%   +0.09%     
==========================================
  Files          38       38              
  Lines       10119    10139      +20     
  Branches     1678     1680       +2     
==========================================
+ Hits         8408     8434      +26     
+ Misses       1211     1208       -3     
+ Partials      500      497       -3     
Impacted Files Coverage Δ
src/borg/archive.py 81.57% <90.32%> (+0.53%) ⬆️
src/borg/archiver.py 80.02% <100.00%> (+0.11%) ⬆️
src/borg/constants.py 100.00% <100.00%> (ø)
src/borg/repository.py 84.02% <0.00%> (-0.19%) ⬇️
src/borg/helpers/parseformat.py 90.05% <0.00%> (+0.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37d4aee...2391d16. Read the comment docs.

comparing zeros is quicker than hashing them.
the comparison should fail quickly inside non-zero data.
@ThomasWaldmann ThomasWaldmann force-pushed the sparse-file-integr2 branch 3 times, most recently from 77ea8fa to bf1d60e Compare January 8, 2021 22:37
@ThomasWaldmann ThomasWaldmann marked this pull request as ready for review January 8, 2021 22:54
Copy link

@timmc timmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. One nitpick on some maybe overly dense code, but nothing important.

Is there anything preventing the content-defined chunker from detecting all-zeroes chunks? I expect with that implementation, almost all zero-chunks would be the same size as each other (the max size param).

src/borg/chunker.pyx Outdated Show resolved Hide resolved
@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Jan 15, 2021

@timmc About "Is there anything preventing the content-defined chunker from detecting all-zeroes chunks? I expect with that implementation, almost all zero-chunks would be the same size as each other (the max size param)."

Now that you say it... - this can be easily done, see latest commit. Thanks for the hint.

I just did not want to add real SEEK_DATA/SEEK_HOLE support to the C code. But detecting if we got all-zero is easy.

This was referenced Jan 15, 2021
@ThomasWaldmann ThomasWaldmann merged commit 699256e into borgbackup:master Jan 17, 2021
@ThomasWaldmann ThomasWaldmann deleted the sparse-file-integr2 branch January 17, 2021 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chunker: sparsify if all-zero data was read
3 participants