Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export all headers from MultiGzDecoder #348

Closed
wants to merge 1 commit into from

Conversation

ghost
Copy link

@ghost ghost commented May 16, 2023

Hi and thanks for your crate.

This patch replaces the header interface with a headers interface in MultiGzDecoder as it's otherwise very hard if not impossible to inspect all the gzip "members" that are processed.

It has no impact on GzDecoder itself except for an extra empty vector added to it.

I've kept the diff as compact as possible.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing!

I am far from qualified to evaluate if headers() is the best way for accessing all headers encountered when decoding the gz stream, but definitely see why one would have a need to collect them.

As a summary, I think it's valuable is to retain backwards compatibility, and to make header-collection opt-in entirely to prevent regression in memory consumption.

Thanks

@@ -237,9 +237,9 @@ impl<R: Read> MultiGzDecoder<R> {
}

impl<R> MultiGzDecoder<R> {
/// Returns the current header associated with this stream, if it's valid.
pub fn header(&self) -> Option<&GzHeader> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need to make this a breaking change by removing the single-header access method. It can live side-by-side with the headers() method.

@@ -449,9 +453,13 @@ impl<R: BufRead> MultiGzDecoder<R> {
}

impl<R> MultiGzDecoder<R> {
/// Returns the current header associated with this stream, if it's valid
pub fn header(&self) -> Option<&GzHeader> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need to make this a breaking change by removing the single-header access method. It can live side-by-side with the headers() method.

pub fn header(&self) -> Option<&GzHeader> {
self.0.header()
/// Returns the headers processed so far
pub fn headers(&self) -> Vec<&GzHeader> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be allocation-free by returning impl Iterator<Item = &GzHeader> with this implementation:

        self.0.headers.iter().chain(self.0.header())

@@ -366,9 +369,10 @@ impl<R: BufRead> Read for GzDecoder<R> {
return Err(err);
}
};
headers.push(header);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if performance wouldn't be an issue, these headers contain a few allocations themselves and I can imagine that archives with millions of members would now see a problem with memory consumption where previously they would not have an issue.

In order to prevent regression in that regard, I think this feature must be opt-in, controllable from the MutliGzDecoder which already sets the multi flag.

@ghost
Copy link
Author

ghost commented May 17, 2023

Hi Byron,
the rationale for dropping header() on MultiGzDecoder is twofold:

  1. Functional: given the user has opted for the Multi decoder over the regular one, we can assume the stream being processed is very likely multi member. If that's the case then header() is very misleading. What header will it return? If called right away it will return the first header. But afterwards it will return some random header, entirely based on the amount read so far. And unlike zlib (with Z_BLOCK) this API offers no way of knowing that a new member was encountered.
  2. Technical: in GzDecoder the last header lives in GzState up until (including) GzState::End. The patch introduces a change (only in Multi) where on GzState::End the header is instead moved to the internal vector. This avoids a clone and simplifies the code in headers() which would otherwise emit the last header twice.

I agree on the change to Iterator for headers().

Thanks

@Byron
Copy link
Member

Byron commented May 17, 2023

Thanks for explaining. However, I do believe that headers() should be opt-in and offered in a backwards compatible way for the reasons provided, and there I see no technical issue in doing so.

@jongiddy
Copy link
Contributor

Do you have a use case for getting multiple headers from a multi-gzip file?

The use cases I've seen for multi-gzip files use it to represent a single stream broken into chunks for the convenience of compressing separately. In these cases, the metadata from different chunks isn't particularly useful. Of course it is needed for decompressing, but the filename metadata, for example, is usually not meaningfully different between chunks.

@Byron
Copy link
Member

Byron commented May 22, 2023

After having seen #301 and after having understood what multi-stream Gzip files are usually about, I see how one could use this format to either encode multiple streams of the same file or different files as individual streams. This is the case this PR addresses by allowing to extract the original filenames.

However, what I don't understand is how one would extract and separate the decoded data in this case to turn it into individual files. This makes me think that merely allowing access to all encountered headers is not enough to handle such a case.

@jongiddy
Copy link
Contributor

The main uses of multi-gzip files that I know are:

  • servers (e.g. Wikipedia) returning gzipped responses, where they pre-compress common snippets and concatenate them to create a single document.
  • the BGZF format which stores a single dataset as multiple gzips to allow fast indexing into the document.

In both cases, the data represents one final document.

For the Wikipedia case, there's no reason to extract the sections separately. The use of sections is solely for the convenience of the creating server.

For BGZF, you do need to examine the header and then read the data in the section. But the index gets you to the start of the correct section and then you can use the simple GzDecoder to decode the section you need.

Even if there was a case of someone using multi-gzip for multiple files, I think repeatedly using the simple GzDecoder would be the way to get the filenames and data in a synchronized manner.

@Byron
Copy link
Member

Byron commented May 23, 2023

I think there is a general sentiment here that keeping the current API is already suitable for handling the common scenarios of how multi-stream zip files are typically used, even though I think it wouldn't be super trivial to use GzDecoder repeatedly on the same input stream as it would be hard to keep track of consumed bytes. I am hopefully wrong about that.

The reason I am bringing this up is that an outcome of this PR could be that the documentation is improved or an example is added that shows how to handle the case that this PR attempts to address. Your help with this is definitely appreciated.

@jongiddy
Copy link
Contributor

An example code demonstrating how to extract multiple distinct files from a multi-gzip file using GzDecoder would be a good start. If there are any problems with this approach, then it may point to changes that would make it feasible or simpler, but also be useful for other use cases, such as a gzip file embedded in other data.

@Byron Byron self-assigned this Jul 17, 2023
@Byron
Copy link
Member

Byron commented Jul 17, 2023

Unfortunately I have to close this PR as by now it's sufficiently clear that it can't be merged in its current form. In #324 it's made clear that in order to reliably decode a GZ encoded file, one would use the MultiGzDecoder to handle multiple streams transparently.

To decode multiple streams where each stream represents a single file with additional meta-data to name the file (or set the path), I agree that the current primitives are hard to use or maybe not usable at all. For that, like @jongiddy suggests, we should start with an example to show how this would be written today, and make improvements from there. I am very much looking forward to such a contribution.

Let me state that for me this was an incredible journey of discovery - having recently joined the team of maintainers I can't claim to know too much about the intricacies of the GZ format or the implementation (all I needed this crate for is quite a narrow use-case, after all). However, by now and thanks to this PR, I have a much better idea on what to do with MultiGz streams, and feel this PR truly leads the way. Thanks everyone for their involvement and making this discovery possible :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants