Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zip 2.1.3 regressed large zips with >64k files #189

Closed
Swatinem opened this issue Jun 7, 2024 · 5 comments
Closed

zip 2.1.3 regressed large zips with >64k files #189

Swatinem opened this issue Jun 7, 2024 · 5 comments
Assignees
Labels
bug Something isn't working fix merged Fix is merged, but the issue will stay open until it's released.

Comments

@Swatinem
Copy link

Swatinem commented Jun 7, 2024

I have a huge zip file with well over 64k files.

Version 2.1.1 was able to correctly parse that because inside of get_metadata, there would only be one ok_results, the one with the correct number of files.

Version 2.1.3 however regressed that behavior, and it would parse both zip64 and zip32 indices, reading both indices into an IndexMap, and then picking the wrong one which has the number of files capped at 64k.

Bildschirmfoto 2024-06-07 um 11 26 36
Bildschirmfoto 2024-06-07 um 11 30 42

As the dir_start is the same for both, max_by_key picks the last one as per its documentation.

As mentioned, version 2.1.1 rejects the index with the capped number of files and it fails somewhere in central_header_to_zip_file, though I haven’t debugged it deeper.

This regression might be related to 8efd233, or 68f7f5d which are touching the relevant code, though I’m not quite sure about that.

@Swatinem
Copy link
Author

Swatinem commented Jun 7, 2024

Another interesting result here is that performance also suffers because it is collecting into two IndexMaps. A better idea might be to try using the "better" directory info first, and just return that right away when it succeeds parsing and collecting also the file entries.
If that fails for whatever reason, then fall back to a secondary directory, potentially even clearing and reusing the already allocated IndexMap.

@cosmicexplorer
Copy link
Contributor

I suspect #93 is likely the culprit. Will take a look at this.

@cosmicexplorer
Copy link
Contributor

@Swatinem could I ask you to provide a repro case as in OP of #138? Your investigation seems very likely to be correct (thanks so much!) but I would like to be sure of the result.

@Swatinem
Copy link
Author

Thanks for taking a look here.

I was able to quickly reproduce this by using an older version of zip to create the archive, and adding a custom header in front.

This is to mimic the usage within https://github.com/getsentry/symbolic/blob/master/symbolic-debuginfo/src/sourcebundle.rs as closely as possible.

I can confirm using that testcase that version 2.1.3 throws a FileNotFound error, whereas master (which I believe will be published as 2.1.4 pretty soon) runs that testcase fine:

    #[test]
    fn test_64k() {
        let mut buffer = vec![];
        buffer.write_all(b"SYSB").unwrap();
        buffer.write_all(&0u32.to_le_bytes()).unwrap();

        let cursor = Cursor::new(buffer);
        let mut zipwriter = zip064::write::ZipWriter::new(cursor);

        let opt =
            zip064::write::FileOptions::default().last_modified_time(zip064::DateTime::default());

        for i in 0..100_000 {
            let file_contents = format!("{i}.txt");
            zipwriter.start_file(&file_contents, opt).unwrap();
            zipwriter.write_all(file_contents.as_bytes()).unwrap();
        }

        let cursor = zipwriter.finish().unwrap();

        let mut zipreader = zip213::read::ZipArchive::new(cursor).unwrap();

        for i in 0..100_000 {
            let expected_contents = format!("{i}.txt");
            let mut file = zipreader.by_name(&expected_contents).unwrap();
            let mut file_contents = String::new();
            file.read_to_string(&mut file_contents).unwrap();
            assert_eq!(file_contents, expected_contents);
        }
    }

feel free to adopt this to your testsuite, though mind you that it runs very slowly on a debug build.

@Pr0methean Pr0methean added fix merged Fix is merged, but the issue will stay open until it's released. and removed fix merged Fix is merged, but the issue will stay open until it's released. labels Jun 22, 2024
@Pr0methean
Copy link
Member

Done adapting it, and I've now used it to test a fix with some refactoring that avoids enumerating the files repeatedly and unnecessarily. When writing a lot of tiny files, it's important to choose CompressionMethod::Stored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fix merged Fix is merged, but the issue will stay open until it's released.
Projects
None yet
Development

No branches or pull requests

3 participants