Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unzipping 4GB+ zip files #77181

Merged
merged 8 commits into from
Oct 28, 2022
Merged

Fix unzipping 4GB+ zip files #77181

merged 8 commits into from
Oct 28, 2022

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Oct 18, 2022

fixes #77159 without reverting #68106

I've tried to keep the changes as small as possible to make it easier to review and backport.

Explanation: for a file that is over 4 GB after zipping, some of the entries were reporting extraField.Size == 8, but readUncompressedSize, readCompressedSize and readStartDiskNumber equal false and readLocalHeaderOffset equal true. So what the code was doing, was reading first int64, not using it and exiting:

long value64 = reader.ReadInt64();
if (readUncompressedSize)
zip64Block._uncompressedSize = value64;
if (ms.Position > extraField.Size - sizeof(long))
return true;

The fix is to allow advancing the stream (by reading from it) only when all 4 fields are provided (extraField.Size == 28) or when they are explicitly requested (the boolean flags set to true)

@ghost
Copy link

ghost commented Oct 18, 2022

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

Issue Details

fixes #77159 without reverting #68106

Author: adamsitnik
Assignees: -
Labels:

area-System.IO.Compression

Milestone: -

@ghost ghost assigned adamsitnik Oct 18, 2022
@adamsitnik
Copy link
Member Author

/azp list

@azure-pipelines
Copy link

CI/CD Pipelines for this repository:

@adamsitnik
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@danmoseley
Copy link
Member

Seems reasonable but I'd rather someone on the IO crew sign off on the change.

BTW if you touch this PR again maybe you could fix this line which I believe should check uncompressed length.
https://github.com/dotnet/runtime/pull/68106/files#diff-ea21d1af009443a658bc821a6eb47860f7887ff70155a68f11990ec64875a2dcR860

@adamsitnik
Copy link
Member Author

Seems reasonable but I'd rather someone on the IO crew sign off on the change.

@jozkee PTAL

@jozkee
Copy link
Member

jozkee commented Oct 24, 2022

Taking a look...

[OuterLoop("It requires almost 12 GB of free disk space")]
public static void UnzipOver4GBZipFile()
{
byte[] buffer = GC.AllocateUninitializedArray<byte>(1_000_000_000); // 1 GB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this is not 1Gb, but since you are creating 6 files, the test is still valid.

I would rather create two tests, one that succeeds at the limit (say 3.999 Gb) without the fix and one that fails and then verify that both pass with your fix.

Since this is urgent for a backport, we can defer that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this is not 1Gb, but since you are creating 6 files, the test is still valid.

You mean that $1000^3$ is not 1 GB but $1024^3$ is?

I would rather create two tests, one that succeeds at the limit (say 3.999 Gb) without the fix and one that fails and then verify that both pass with your fix.

My typical workflow is to add test that fails first, then make it pass. We have plenty of tests that verify < 4 GB archives so I don't see the value in adding it (also it would require us to use even more disk space and take longer to execute).

In the long term we could add more tests, especially such that verify that decompress(compress(x)) == x (which this test avoids, as it was taking 10 minutes to do so on my beefy PC).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that $1000^3$ is not 1 GB but $1024^3$ is?

Yes.

We have plenty of tests that verify < 4 GB archives so I don't see the value in adding it (also it would require us to use even more disk space and take longer to execute).

I meant to verify on the edge cases of the bug. In this case that would be 4Gb - 1, 4Gb, and 4Gb + 1; rather than a 6Gb case. But yes, that would multiply test execution time which is already quite large with this new test.

Comment on lines 202 to 207
// Advancing the stream (by reading from it) is possible only when:
// 1. There is an explicit ask to do that (valid files, corresponding boolean flag(s) set to true, #77159).
// 2. When the size indicates that all the information is available ("slightly invalid files", #49580).
bool readAllFields = extraField.Size >= sizeof(long) + sizeof(long) + sizeof(long) + sizeof(int);

long value64 = readUncompressedSize || readAllFields ? reader.ReadInt64() : -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this fail if only 2 or 3 fields are available for reading but the bool params are false?

i.e: what if extraField.size = 16 and readUncompressedSize = false and readCompressedSize = false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is possible. @danmoseley has added following comment:

// The spec section 4.5.3:
// The order of the fields in the zip64 extended
// information record is fixed, but the fields MUST
// only appear if the corresponding Local or Central
// directory record field is set to 0xFFFF or 0xFFFFFFFF.
// However tools commonly write the fields anyway; the prevailing convention
// is to respect the size, but only actually use the values if their 32 bit
// values were all 0xFF.

My understanding is that we have two options:

  1. Correct archives: the fields are provided when the directory record fields are set (the right booleans are set to true, the size is their total size)
  2. "Slightly incorrect archives": the boolean fields are not set, but the provided size suggests that they are all present. Tests added in Fix reading slightly incorrect Zip files #68106 use such approach.

@adamsitnik
Copy link
Member Author

@jozkee @stephentoub @danmoseley ping

else if (readAllFields)
{
_ = reader.ReadInt32();
}

// original values are unsigned, so implies value is too big to fit in signed integer
if (zip64Block._uncompressedSize < 0) throw new InvalidDataException(SR.FieldTooBigUncompressedSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These validations only occur if extraField.Size >= 28 since you are now returning if there's less than that.

Shouldn't we validate when reader.ReadInt* is called and/or if read of the field was requested?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it always was, I would prefer not to do it now as I intend to backport it to 7.0 and trying to make it as defensive/safe as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I was asking is because it wasn't always like that. It was changed to how it is now on #68106, which is the PR that introduced the regression.

[OuterLoop("It requires almost 12 GB of free disk space")]
public static void UnzipOver4GBZipFile()
{
byte[] buffer = GC.AllocateUninitializedArray<byte>(1_000_000_000); // 1 GB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reasonable chance this will OOM? If so, consider catching that and skipping the test, rather than failing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reasonable chance this will OOM?

It should be executed only in Outerloop runs for the 64 bit "fast runtimes" and not in parallel with other tests so I don't think it's needed (at least for now).

[ConditionalFact(typeof(PlatformDetection), nameof(PlatformDetection.IsSpeedOptimized), nameof(PlatformDetection.Is64BitProcess))] // don't run it on slower runtimes

[Collection(nameof(DisableParallelization))]

public static bool IsSpeedOptimized => !IsSizeOptimized;
public static bool IsSizeOptimized => IsBrowser || IsAndroid || IsAppleMobile;

@adamsitnik adamsitnik merged commit 399c6dc into dotnet:main Oct 28, 2022
@adamsitnik
Copy link
Member Author

/backport to release/7.0

@github-actions
Copy link
Contributor

@adamsitnik
Copy link
Member Author

@danmoseley @jozkee @stephentoub thank you for the reviews!

Copy link
Member

@carlossanlop carlossanlop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the CI is green, this LGTM. I'd also wait for a sign-off from @jozkee.

Edit: Nevermind. This was merged while I was still adding a review. :)

}
else if (readAllFields)
{
_ = reader.ReadInt64();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own education: is there a perf improvement when using _ = ReadInt64(); to discard the value, compared to just not using _? Or is the idea here to explicitly signal that we are ignoring the value on purpose?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't have a runtime effect, it's self documenting, both for humans and also for linters/analyzers that might otherwise flag an unused method result.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DotNet 7 Regression when reading large Zip files
6 participants