Skip to content

fix: avoid scanning through all local file headers when opening an archive #281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 3, 2025

Conversation

jrudolph
Copy link
Contributor

Fixes #280

The idea is to make ZipFileData.data_start be calculated lazy to avoid accessing all local file headers already when opening a file. This required a change to the signature of ZipFileData.data_start() (which seems to be non-public after all).

@@ -1068,14 +1065,6 @@ pub(crate) fn central_header_to_zip_file<R: Read + Seek>(
));
}

let data_start = find_data_start(&file, reader)?;

if data_start > central_directory.directory_start {
Copy link
Member

@Pr0methean Pr0methean Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we still check for this eventually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is what the ultimate purpose of this check is. It looks a bit like a conservative check of some specification requirement.

But is it require for correctness / ruling out weird edge cases? After all no tests fail after removing this check.

In some way, being able to do this check runs counter to the idea of this PR of not having to scan through the file for random access. I.e. even if we defer, e.g. to the below ZipFileData.data_start or move the check into find_data_start, the random access use case I have in mind here will never execute the check.

Copy link
Member

@Pr0methean Pr0methean Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC it does relate to edge cases that have come up in fuzzing, such as when archives are concatenated or nested or the magic bytes occur in filenames. Even if the spec is ambiguous in those cases about which one to extract, which I'm pretty sure is true for concatenation when the second one is under 64KiB, we should still consistently choose one or the other.

Copy link
Member

@Pr0methean Pr0methean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good in principle, but I don't like the idea of totally removing the validation when we could just defer it.

@Pr0methean Pr0methean enabled auto-merge April 3, 2025 17:39
Copy link
Member

@Pr0methean Pr0methean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this is merged I'll look at where the data-start-after-header-start check might be added back without an additional seek, to ensure we're not reading an old central directory that's been superseded (at least, not by adding or updating files).

@Pr0methean Pr0methean added this pull request to the merge queue Apr 3, 2025
Merged via the queue into zip-rs:master with commit f4d71a4 Apr 3, 2025
39 checks passed
@Pr0methean Pr0methean mentioned this pull request Apr 3, 2025
@Lynnesbian
Copy link

@Pr0methean Just a suggestion: If you can't find a way to add the check back without incurring a performance penalty, maybe it could be made optional with a parameter on the Config struct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zip 2.2.2 scans for large parts of the file while opening a ZipArchive
3 participants