-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clean up and export crc32c function #22274
Conversation
I would prefer we don't make this the default. It shouldn't be much faster (the bottleneck is typically the hardware fetch speed. If it is much slower, we should probably investigate our IO speed further.). However, mmap can also be much slower and reveal surprising bugs (if you try to try to open something that isn't a local file). And any failure to read the file (networking error, concurrent file modification, etc.) can cause abrupt termination (usually reported as a SEGV) rather than being able to report failures as Julia errors with backtraces. |
@vtjnash, okay. I just verified that reading the file in 16k chunks is actually slightly faster than |
Fixed the documentation build. |
Can we use |
Sure, will do. |
…optimized open(crc32c, filename), make IOBuffer checksums consistent with other streams
base/util.jl
Outdated
buf = Array{UInt8}(min(nb, 16384)) | ||
while !eof(f) && nb > 16384 | ||
n = readbytes!(f, buf) | ||
while !eof(io) && nb > 16384 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be 8192 * 3
? That's the LONG
block size used in the sse4.2 version (and also on ARM in one of my up coming change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I tried 16384 and 32768 and the latter wasn't any faster on my machine, but 8192 * 3 is fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be catched by the short version so it won't matter too much but in principle 8192 * 2 and 8192 * 4 are equally bad since neither of them makes full use of the LONG loop.
Whoops, looks like I forgot to push my copy-free IOBuffer method; will add that tomorrow. (Done.) |
Tests look good; there was a Travis failure in |
Will squash/merge in a day or two if there are no further objections. |
…en to while we are reading from it
* clean up and export crc32c function * added PR to NEWS * restore crc32 of String, add crc32c(io) to read all of a stream, add optimized open(crc32c, filename), make IOBuffer checksums consistent with other streams * use crc32c block size of 8192*3, matching the underling C library * optimized IOBuffer crc32c
As discussed in #21154, this exports and documents the
crc32c
function for CRC-32c checksums, which we already use internally to validate.ji
files.You can do
crc32c(data)
on anArray{UInt8}
, a contiguous subarray thereof.Originally, I also alloweddata
to be aString
, but I removed this. Since the CRC of a string is encoding-dependent, it seemed better to require the caller to explicitly docrc32c(Vector{UInt8}(s))
.data
can also be aString
or anIO
stream, with an optimized method forIOBuffer
.Since computing the CRC of a file is a common operation and doing it efficiently is easy to get wrong (the fastest way is to use
mmapa sequence ofreadbytes!
calls), I included acrc32c(io, numbytes)
methods. As is noted in the documentation, you can checksum an entire file efficiently withopen(crc32c, filename)
.