GzDecoder stops decoding file toward the start. #339

nschuessler · 2023-04-02T13:57:48Z

In trying to decode the common crawl index files. GzDecoder stops at about 1.8M of input of a 690M file.
The file is too large to use .read_to_end (i.e. read it into memory).

If you download the file and use gzip -d cdx-00010.gz the whole file is expanded.
How do you use GzDecoder to get the same behavior as gzip -d?

The code exits early because decoder.Read returns 0 bytes, whereas reading from the stream (input_stream.Read) will continue. So, I assume there is some format issue in the file that GzDecoder does not handle and gzip does. It prints 'Read 0 x' before exiting so I assume there are no errors.

Thanks

Example input:
https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2023-06/indexes/cdx-00010.gz

Example code:

 let mut file = File::open("cdx-00010.gz").expect("Could not open index file.");
 decode_to_stream(&mut file);

use std::io::prelude::*;
use std::io;
use std::io::BufReader;
use std::fs::File;
use flate2::read::{GzDecoder};


pub fn decode_to_stream(input_stream: &mut dyn Read)
{
    let mut output_file = File::create("decoded").expect("Could not create output file.");
    let mut decoder = GzDecoder::new(input_stream);
    let mut buffer = [0; 65536];
    let mut total_read = 0;
    while let Ok(read_size) = decoder.read(&mut buffer[..])
    {
        println!("Read {} ({}).", read_size, total_read);
        if read_size <= 0 {
            break;
        }

        output_file.write(&buffer[..read_size]);
        total_read = total_read + read_size;
    }
}

The text was updated successfully, but these errors were encountered:

nschuessler · 2023-04-09T16:44:48Z

So it appears this is a multi-member gzip format and requires MultiGzipDecoder.

Byron · 2023-07-23T09:25:55Z

Sorry for the late reply, and thanks for sharing!

We are currently working on improving the documentation around the usage of GzDecoder and MultiGzDecoder in the hopes that this will be less of a problem in future.

Closing, as this PR is not directly actionable.

nschuessler changed the title ~~GzDecoder stops decoding partial file~~ GzDecoder stops decoding file toward the start. Apr 2, 2023

drtconway mentioned this issue May 24, 2023

Multistream Zlib archives? #269

Open

workingjubilee mentioned this issue Jul 22, 2023

Recommend MultiGzDecoder over GzDecoder in docs #324

Merged

Byron added the wontfix label Jul 23, 2023

Byron closed this as not planned Won't fix, can't repro, duplicate, stale Jul 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GzDecoder stops decoding file toward the start. #339

GzDecoder stops decoding file toward the start. #339

nschuessler commented Apr 2, 2023 •

edited

Loading

nschuessler commented Apr 9, 2023

Byron commented Jul 23, 2023

GzDecoder stops decoding file toward the start. #339

GzDecoder stops decoding file toward the start. #339

Comments

nschuessler commented Apr 2, 2023 • edited Loading

nschuessler commented Apr 9, 2023

Byron commented Jul 23, 2023

nschuessler commented Apr 2, 2023 •

edited

Loading