You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In trying to decode the common crawl index files. GzDecoder stops at about 1.8M of input of a 690M file.
The file is too large to use .read_to_end (i.e. read it into memory).
If you download the file and use gzip -d cdx-00010.gz the whole file is expanded.
How do you use GzDecoder to get the same behavior as gzip -d?
The code exits early because decoder.Read returns 0 bytes, whereas reading from the stream (input_stream.Read) will continue. So, I assume there is some format issue in the file that GzDecoder does not handle and gzip does. It prints 'Read 0 x' before exiting so I assume there are no errors.
let mut file = File::open("cdx-00010.gz").expect("Could not open index file.");
decode_to_stream(&mut file);
use std::io::prelude::*;
use std::io;
use std::io::BufReader;
use std::fs::File;
use flate2::read::{GzDecoder};
pub fn decode_to_stream(input_stream: &mut dyn Read)
{
let mut output_file = File::create("decoded").expect("Could not create output file.");
let mut decoder = GzDecoder::new(input_stream);
let mut buffer = [0; 65536];
let mut total_read = 0;
while let Ok(read_size) = decoder.read(&mut buffer[..])
{
println!("Read {} ({}).", read_size, total_read);
if read_size <= 0 {
break;
}
output_file.write(&buffer[..read_size]);
total_read = total_read + read_size;
}
}
The text was updated successfully, but these errors were encountered:
nschuessler
changed the title
GzDecoder stops decoding partial file
GzDecoder stops decoding file toward the start.
Apr 2, 2023
We are currently working on improving the documentation around the usage of GzDecoder and MultiGzDecoder in the hopes that this will be less of a problem in future.
In trying to decode the common crawl index files.
GzDecoder
stops at about 1.8M of input of a 690M file.The file is too large to use
.read_to_end
(i.e. read it into memory).If you download the file and use
gzip -d cdx-00010.gz
the whole file is expanded.How do you use
GzDecoder
to get the same behavior asgzip -d
?The code exits early because
decoder.Read
returns 0 bytes, whereas reading from the stream (input_stream.Read
) will continue. So, I assume there is some format issue in the file thatGzDecoder
does not handle andgzip
does. It prints 'Read 0 x' before exiting so I assume there are no errors.Thanks
Example input:
https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2023-06/indexes/cdx-00010.gz
Example code:
The text was updated successfully, but these errors were encountered: