Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Zip64 write support #230

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

kajkal
Copy link

@kajkal kajkal commented Jan 27, 2025

Resolves #229


Add support for automatic generation of Zip64 end of central directory record and Zip64 end of central directory locator when bytes available in standard End of central directory record are not enough.

Zip64 EOCD will be added when when at least one of the following is true:
1. total number of files in .zip archive is greater than 65,535
2. size of the central directory is greater than 4,294,967,295 bytes (4GB)
3. central directory offset is greater than 4,294,967,295 bytes (4GB)


Add support for automatic generation of Zip64 Extended Information Extra Field when bytes available in standard local file header or central directory file header are not enough.

Zip64 Extra Field with 2 fields will be added to local file header when:
1. file uncompressed/original size is greater than 4,294,967,295 bytes (4GB) or
2. file compressed size is greater than 4,294,967,295 bytes (4GB)

Zip64 Extra Field with 1,2 or 3 fields will be added to central directory file header when:
1. file uncompressed/original size is greater than 4,294,967,295 bytes (4GB) [1 field or more]
2. file compressed size is greater than 4,294,967,295 bytes (4GB) [2 fields or more]
3. local file header offset is greater than 4,294,967,295 bytes (4GB) [3 fields]


Add support for Data descriptors in Zip64 format activated by a new special flag zip64.

For file streams with this flag, an empty Zip64 Extra Field with 0 fields will be added to local file header and then, at the end of the file stream, a Data Descriptor in Zip64 format.


Other changes:

  1. Read values from Zip64 Extra Field (if it is present) even if there is no Zip64 EOCD in the zip archive. Currently Zip64 EFs are ignored when Zip64 EOCD is not found in the file.
  2. Take into account that the Zip64 EF can have different lengths, 0,8,16,24+. Currently Zip64 EF is assumed to have at least 2 fields (length >= 16 bytes).
  3. Fix for inconsistent mtime generated during zip stream. Currently, if the time between adding a file to the archive zip.add(myFile) and ending the stream zip.end() is greater than a few seconds, the mtime in the Local File Header differs from the mtime of the Central Directory Header. Assuming that the user has not defined myFile.mtime = some date.
  4. Prevent adding Data Descriptor for files added in one go zip.add(myFile), myFile.push(strToU8('the end'), true). Currently, during the zip stream, the Data Descriptor is added to every file in the archive even though this is not always necessary, and as a result it increases the final size of the zip file as well as making it more difficult to read from the stream later.
  5. Separate the code needed only for streaming from the code executed by the zipSync and zip (async) functions. Regarding the creation of the zip file header.

@kajkal
Copy link
Author

kajkal commented Jan 27, 2025

I've read the code some more and I don't see that fflate adds an 4.5.3 -Zip64 Extended Information Extra Field (0x0001) in the case where the fields in the standard File Headers are too small.
I found this thread #81 and while I agree that fflate should rather not be used to compress large files, I wonder if, along with increasing the file limit by saving the Zip64 EOCD, it would not be sensible to add the possibility of optionally extending the File Headers in accordance with Zip64.

@kajkal
Copy link
Author

kajkal commented Feb 1, 2025

I did a little research and experimented with other zip programs, mainly winrar, 7z, native windows zip app and linux zip command, and I found out that:

  1. Zip64 format is very fluid
  2. if one file has Zip64 Extra Field (Zip64 EF), this does not mean that other files in the archive will also have it
  3. if one file has Zip64 EF, this does not mean that a Zip64 EOCD will be added to the archive
  4. Zip64 EF can have various lengths 0/8/16/24
  5. and the thing that confuses me the most: Data Descriptors 0x08074b50 (DD) with Zip64. The size of the DD (4 or 8 byte versions are available) is determined by the presence of the Zip64 EF in the Local File Header (LFH). Zip64 EF data size in this case could be 0, just 4 bytes in total, 2 for id, and 2 for size of 0 (at least windows native zip app does this, despite the fact that according to the pkware specification Zip64 EF in LFH MUST have BOTH original and compressed sizes). This solution is still not the best one as when streaming, you have to assume in advance if the file will need an extra 4 bytes in the DD. Why isn't there just another signature for the larger Data Descriptor? :<

I have added a new special flag which when set results in the addition of an empty Zip64 EF and consequently a DD in Zip64 format to the end of the file stream.

Here is the code to test this in practice:

const entries = [
    [ 'big file', new Uint8Array(512 * 1024 * 1024).fill(10), 10 ], // ~5.4GB of data 
    [ 'not quite as big', new Uint8Array(512 * 1024 * 1024).fill(11), 5 ], // ~2.7GB of data
    [ 'pretty average', fflate.strToU8('file content') ],
]

const writeStream = createWriteStream('./zip64-test.zip');
const zip = new fflate.Zip((err, chunk, final) => {
    if (err) throw err;
    if (final) writeStream.end(chunk);
    else writeStream.write(chunk);
});
for (const [ fileName, content, rep = 1 ] of entries) {
    const file = new fflate.ZipDeflate(fileName, { level: 8 });
    file.zip64 = content.length * rep > 0xffffffff; // must be set 'manually' as stream is usually of unknown length
    file.mtime = Date.parse('2000-01-01'); // for consistency
    zip.add(file);
    for (let i = 1; i <= rep; i++) {
        file.push(content, i === rep);
    }
}
zip.end();

On my machine it takes a little over a minute to generate this file.
The .zip file generated in this way has the desired format:

[      0] [local file header 1]             [ su=0, sc=0 ] <- zeros, data will be in DD later
[     38]   [zip64 extra field] <0>         [] <- Zip64 EF with zero fields
[5239192]   [data descriptor 1] <24>        [ su=5368709120, sc=5239150 ] <- DD in Zip64 format, 24 bytes long
[5239216] [local file header 2]             [ su=0, sc=0 ]
[7858837]   [data descriptor 2] <16>        [ su=2684354560, sc=2619575 ] <- standard DD 4+4+4+4=16 bytes long
[7858853] [local file header 3]             [ su=0, sc=0 ]
[7858911]   [data descriptor 3] <16>        [ su=12, sc=14 ] <- standard DD
[7858927] [central directory header 1]      [ su=4294967295, sc=5239150, lho=0 ] <- `su` reached uint32 max
[7858981]   [zip64 extra field] <8>         [ su=5368709120 ] <- actual `su` value is here
[7858993] [central directory header 2]      [ su=2684354560, sc=2619575, lho=5239216 ]
[7859055] [central directory header 3]      [ su=12, sc=14, lho=7858853 ]
[7859115] [end of central directory record]

@101arrowz
Copy link
Owner

Thank you for the PR! Overall looks mostly good, bar some stylistic decisions, but I'll do a more thorough review when I get a chance. Unfortunately life has kept me quite busy of late so I'm not sure when that will be :/

I think overall it would be a good idea to properly support Zip64 writes, but I'd like to verify that this doesn't cost too much in terms of bundle size. If it's a huge change in bundle size it may not be worth it as this is a relatively niche use case for this library.

@kajkal
Copy link
Author

kajkal commented Feb 6, 2025

Sure no pressure, this PR can wait here 👍
My verification of the bundle size showed an increase of 190 bytes (looking at the file ./umd/index.js).

before:  32,692 bytes  31.9 KB
 after:  32,882 bytes  32.1 KB
  diff:    +190 bytes  +0.2 KB

Which I personally find quite acceptable value in exchange for increased tool reliability and the ability to create zip files up to 9 Petabytes!
As for the stylistic decisions, I have tried to maintain the already established theme: performance first then readability. But I admit, in places I found the one-letter names a little too hard to follow and allowed myself to shift the balance more towards readability :D

Edit:
I optimized the bundle size a little more

before:  32,692 bytes  31.9 KB
 after:  32,529 bytes  31.7 KB
  diff:    -163 bytes  -0.2 KB

and now it can be said that this PR has reached net-zero emissions of bundle size. And those saved 163 bytes could even be sold to other features as bundle-size credits!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zip does not support >65,535 files
2 participants