Export all headers from MultiGzDecoder #348

ghost · 2023-05-16T15:22:47Z

Hi and thanks for your crate.

This patch replaces the header interface with a headers interface in MultiGzDecoder as it's otherwise very hard if not impossible to inspect all the gzip "members" that are processed.

It has no impact on GzDecoder itself except for an extra empty vector added to it.

I've kept the diff as compact as possible.

Byron

Thanks for contributing!

I am far from qualified to evaluate if headers() is the best way for accessing all headers encountered when decoding the gz stream, but definitely see why one would have a need to collect them.

As a summary, I think it's valuable is to retain backwards compatibility, and to make header-collection opt-in entirely to prevent regression in memory consumption.

Thanks

Byron · 2023-05-17T06:26:26Z

src/gz/read.rs

@@ -237,9 +237,9 @@ impl<R: Read> MultiGzDecoder<R> {
 }

 impl<R> MultiGzDecoder<R> {
-    /// Returns the current header associated with this stream, if it's valid.
-    pub fn header(&self) -> Option<&GzHeader> {


I think there is no need to make this a breaking change by removing the single-header access method. It can live side-by-side with the headers() method.

Byron · 2023-05-17T06:27:59Z

src/gz/bufread.rs

@@ -449,9 +453,13 @@ impl<R: BufRead> MultiGzDecoder<R> {
 }

 impl<R> MultiGzDecoder<R> {
-    /// Returns the current header associated with this stream, if it's valid
-    pub fn header(&self) -> Option<&GzHeader> {


I think there is no need to make this a breaking change by removing the single-header access method. It can live side-by-side with the headers() method.

Byron · 2023-05-17T06:30:55Z

src/gz/bufread.rs

-    pub fn header(&self) -> Option<&GzHeader> {
-        self.0.header()
+    /// Returns the headers processed so far
+    pub fn headers(&self) -> Vec<&GzHeader> {


This could be allocation-free by returning impl Iterator<Item = &GzHeader> with this implementation:

self.0.headers.iter().chain(self.0.header())

Byron · 2023-05-17T06:33:31Z

src/gz/bufread.rs

@@ -366,9 +369,10 @@ impl<R: BufRead> Read for GzDecoder<R> {
                                    return Err(err);
                                }
                            };
+                            headers.push(header);


Even if performance wouldn't be an issue, these headers contain a few allocations themselves and I can imagine that archives with millions of members would now see a problem with memory consumption where previously they would not have an issue.

In order to prevent regression in that regard, I think this feature must be opt-in, controllable from the MutliGzDecoder which already sets the multi flag.

ghost · 2023-05-17T07:35:22Z

Hi Byron,
the rationale for dropping header() on MultiGzDecoder is twofold:

Functional: given the user has opted for the Multi decoder over the regular one, we can assume the stream being processed is very likely multi member. If that's the case then header() is very misleading. What header will it return? If called right away it will return the first header. But afterwards it will return some random header, entirely based on the amount read so far. And unlike zlib (with Z_BLOCK) this API offers no way of knowing that a new member was encountered.
Technical: in GzDecoder the last header lives in GzState up until (including) GzState::End. The patch introduces a change (only in Multi) where on GzState::End the header is instead moved to the internal vector. This avoids a clone and simplifies the code in headers() which would otherwise emit the last header twice.

I agree on the change to Iterator for headers().

Thanks

Byron · 2023-05-17T08:13:42Z

Thanks for explaining. However, I do believe that headers() should be opt-in and offered in a backwards compatible way for the reasons provided, and there I see no technical issue in doing so.

jongiddy · 2023-05-21T11:47:00Z

Do you have a use case for getting multiple headers from a multi-gzip file?

The use cases I've seen for multi-gzip files use it to represent a single stream broken into chunks for the convenience of compressing separately. In these cases, the metadata from different chunks isn't particularly useful. Of course it is needed for decompressing, but the filename metadata, for example, is usually not meaningfully different between chunks.

Byron · 2023-05-22T14:40:19Z

After having seen #301 and after having understood what multi-stream Gzip files are usually about, I see how one could use this format to either encode multiple streams of the same file or different files as individual streams. This is the case this PR addresses by allowing to extract the original filenames.

However, what I don't understand is how one would extract and separate the decoded data in this case to turn it into individual files. This makes me think that merely allowing access to all encountered headers is not enough to handle such a case.

jongiddy · 2023-05-22T20:32:18Z

The main uses of multi-gzip files that I know are:

servers (e.g. Wikipedia) returning gzipped responses, where they pre-compress common snippets and concatenate them to create a single document.
the BGZF format which stores a single dataset as multiple gzips to allow fast indexing into the document.

In both cases, the data represents one final document.

For the Wikipedia case, there's no reason to extract the sections separately. The use of sections is solely for the convenience of the creating server.

For BGZF, you do need to examine the header and then read the data in the section. But the index gets you to the start of the correct section and then you can use the simple GzDecoder to decode the section you need.

Even if there was a case of someone using multi-gzip for multiple files, I think repeatedly using the simple GzDecoder would be the way to get the filenames and data in a synchronized manner.

Byron · 2023-05-23T05:24:10Z

I think there is a general sentiment here that keeping the current API is already suitable for handling the common scenarios of how multi-stream zip files are typically used, even though I think it wouldn't be super trivial to use GzDecoder repeatedly on the same input stream as it would be hard to keep track of consumed bytes. I am hopefully wrong about that.

The reason I am bringing this up is that an outcome of this PR could be that the documentation is improved or an example is added that shows how to handle the case that this PR attempts to address. Your help with this is definitely appreciated.

jongiddy · 2023-05-23T07:52:37Z

An example code demonstrating how to extract multiple distinct files from a multi-gzip file using GzDecoder would be a good start. If there are any problems with this approach, then it may point to changes that would make it feasible or simpler, but also be useful for other use cases, such as a gzip file embedded in other data.

Byron · 2023-07-17T06:59:29Z

Unfortunately I have to close this PR as by now it's sufficiently clear that it can't be merged in its current form. In #324 it's made clear that in order to reliably decode a GZ encoded file, one would use the MultiGzDecoder to handle multiple streams transparently.

To decode multiple streams where each stream represents a single file with additional meta-data to name the file (or set the path), I agree that the current primitives are hard to use or maybe not usable at all. For that, like @jongiddy suggests, we should start with an example to show how this would be written today, and make improvements from there. I am very much looking forward to such a contribution.

Let me state that for me this was an incredible journey of discovery - having recently joined the team of maintainers I can't claim to know too much about the intricacies of the GZ format or the implementation (all I needed this crate for is quite a narrow use-case, after all). However, by now and thanks to this PR, I have a much better idea on what to do with MultiGz streams, and feel this PR truly leads the way. Thanks everyone for their involvement and making this discovery possible :).

Export all headers from MultiGzDecoder

1fcdb5b

Byron requested changes May 17, 2023

View reviewed changes

Byron self-assigned this Jul 17, 2023

Byron closed this Jul 17, 2023

Byron mentioned this pull request Jul 17, 2023

Recommend MultiGzDecoder over GzDecoder in docs #324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export all headers from MultiGzDecoder #348

Export all headers from MultiGzDecoder #348

ghost commented May 16, 2023

Byron left a comment

Byron May 17, 2023

Byron May 17, 2023

Byron May 17, 2023

Byron May 17, 2023

ghost commented May 17, 2023

Byron commented May 17, 2023

jongiddy commented May 21, 2023

Byron commented May 22, 2023

jongiddy commented May 22, 2023

Byron commented May 23, 2023

jongiddy commented May 23, 2023

Byron commented Jul 17, 2023

Export all headers from MultiGzDecoder #348

Export all headers from MultiGzDecoder #348

Conversation

ghost commented May 16, 2023

Byron left a comment

Choose a reason for hiding this comment

Byron May 17, 2023

Choose a reason for hiding this comment

Byron May 17, 2023

Choose a reason for hiding this comment

Byron May 17, 2023

Choose a reason for hiding this comment

Byron May 17, 2023

Choose a reason for hiding this comment

ghost commented May 17, 2023

Byron commented May 17, 2023

jongiddy commented May 21, 2023

Byron commented May 22, 2023

jongiddy commented May 22, 2023

Byron commented May 23, 2023

jongiddy commented May 23, 2023

Byron commented Jul 17, 2023