-
Notifications
You must be signed in to change notification settings - Fork 62
fix: avoid scanning through all local file headers when opening an archive #281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1068,14 +1065,6 @@ pub(crate) fn central_header_to_zip_file<R: Read + Seek>( | |||
)); | |||
} | |||
|
|||
let data_start = find_data_start(&file, reader)?; | |||
|
|||
if data_start > central_directory.directory_start { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we still check for this eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is what the ultimate purpose of this check is. It looks a bit like a conservative check of some specification requirement.
But is it require for correctness / ruling out weird edge cases? After all no tests fail after removing this check.
In some way, being able to do this check runs counter to the idea of this PR of not having to scan through the file for random access. I.e. even if we defer, e.g. to the below ZipFileData.data_start
or move the check into find_data_start
, the random access use case I have in mind here will never execute the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC it does relate to edge cases that have come up in fuzzing, such as when archives are concatenated or nested or the magic bytes occur in filenames. Even if the spec is ambiguous in those cases about which one to extract, which I'm pretty sure is true for concatenation when the second one is under 64KiB, we should still consistently choose one or the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good in principle, but I don't like the idea of totally removing the validation when we could just defer it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this is merged I'll look at where the data-start-after-header-start check might be added back without an additional seek, to ensure we're not reading an old central directory that's been superseded (at least, not by adding or updating files).
@Pr0methean Just a suggestion: If you can't find a way to add the check back without incurring a performance penalty, maybe it could be made optional with a parameter on the |
Fixes #280
The idea is to make
ZipFileData.data_start
be calculated lazy to avoid accessing all local file headers already when opening a file. This required a change to the signature ofZipFileData.data_start()
(which seems to be non-public after all).