Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GzDecoder stops decoding file toward the start. #339

Closed
nschuessler opened this issue Apr 2, 2023 · 2 comments
Closed

GzDecoder stops decoding file toward the start. #339

nschuessler opened this issue Apr 2, 2023 · 2 comments
Labels

Comments

@nschuessler
Copy link

nschuessler commented Apr 2, 2023

In trying to decode the common crawl index files. GzDecoder stops at about 1.8M of input of a 690M file.
The file is too large to use .read_to_end (i.e. read it into memory).

If you download the file and use gzip -d cdx-00010.gz the whole file is expanded.
How do you use GzDecoder to get the same behavior as gzip -d?

The code exits early because decoder.Read returns 0 bytes, whereas reading from the stream (input_stream.Read) will continue. So, I assume there is some format issue in the file that GzDecoder does not handle and gzip does. It prints 'Read 0 x' before exiting so I assume there are no errors.

Thanks

Example input:
https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2023-06/indexes/cdx-00010.gz

Example code:

 let mut file = File::open("cdx-00010.gz").expect("Could not open index file.");
 decode_to_stream(&mut file);

use std::io::prelude::*;
use std::io;
use std::io::BufReader;
use std::fs::File;
use flate2::read::{GzDecoder};


pub fn decode_to_stream(input_stream: &mut dyn Read)
{
    let mut output_file = File::create("decoded").expect("Could not create output file.");
    let mut decoder = GzDecoder::new(input_stream);
    let mut buffer = [0; 65536];
    let mut total_read = 0;
    while let Ok(read_size) = decoder.read(&mut buffer[..])
    {
        println!("Read {} ({}).", read_size, total_read);
        if read_size <= 0 {
            break;
        }

        output_file.write(&buffer[..read_size]);
        total_read = total_read + read_size;
    }
}
@nschuessler nschuessler changed the title GzDecoder stops decoding partial file GzDecoder stops decoding file toward the start. Apr 2, 2023
@nschuessler
Copy link
Author

So it appears this is a multi-member gzip format and requires MultiGzipDecoder.

@Byron
Copy link
Member

Byron commented Jul 23, 2023

Sorry for the late reply, and thanks for sharing!

We are currently working on improving the documentation around the usage of GzDecoder and MultiGzDecoder in the hopes that this will be less of a problem in future.

Closing, as this PR is not directly actionable.

@Byron Byron closed this as not planned Won't fix, can't repro, duplicate, stale Jul 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants