-
-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse file support (integration) #5620
Sparse file support (integration) #5620
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5620 +/- ##
==========================================
+ Coverage 83.09% 83.18% +0.09%
==========================================
Files 38 38
Lines 10119 10139 +20
Branches 1678 1680 +2
==========================================
+ Hits 8408 8434 +26
+ Misses 1211 1208 -3
+ Partials 500 497 -3
Continue to review full report at Codecov.
|
comparing zeros is quicker than hashing them. the comparison should fail quickly inside non-zero data.
77ea8fa
to
bf1d60e
Compare
at least for major amounts of fixed-size replacement hashes, this will be much faster. also less memory management overhead.
also: zeros.startswith() is faster
bf1d60e
to
4e3be1d
Compare
866f392
to
8162e2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. One nitpick on some maybe overly dense code, but nothing important.
Is there anything preventing the content-defined chunker from detecting all-zeroes chunks? I expect with that implementation, almost all zero-chunks would be the same size as each other (the max size param).
@timmc About "Is there anything preventing the content-defined chunker from detecting all-zeroes chunks? I expect with that implementation, almost all zero-chunks would be the same size as each other (the max size param)." Now that you say it... - this can be easily done, see latest commit. Thanks for the hint. I just did not want to add real SEEK_DATA/SEEK_HOLE support to the C code. But detecting if we got all-zero is easy. |
fixes #5566
the chunker now yields namedtuples (data, meta), so that sparse holes can be processed differently than normal data.
to avoid hashing the all-zero blocks again and again, a LRUcache for a few popular block sizes and their hashes is maintained.
this speeds up processing of sparse holes / all-zero blocks a lot.
additionally to all-zero sparse holes, the fixed-blocksize chunker now can also detect all-zero from-disk reads (allocation type CH_ALLOC).
also refactored all places that need to compare against
zeros
so that we only have one big all-zero-bytes object.also added all-zeros detection for the buzhash chunker (here the results for a huge, completely sparse file).
Some stats: