ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer #8525

rdettai · 2020-10-26T08:10:35Z

PR for https://issues.apache.org/jira/browse/ARROW-10387

When reading a Parquet file, you first need to read its footer and metadata. If you only have "read from start" capability, this means you need the size of the file to read relatively to end. On some storages, getting the size metadata can be expensive (e.g extra http call for blob storage).
The proposition is to add the capability to "read from end" to the ChunkReader trait as most file storages will have this feature (file storage as well as blob storages).

rdettai · 2020-10-26T08:17:47Z

@alamb has raised doubts in #8300 about the necessity for this as the file size could be known from an other source (list operation or metadata catalog). I still want to propose this for discussion as I believe it is more flexible than the current interface.

There is one catch I encountered though when implementing the S3 source: the http range header (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range) allows you to read from end, but not from end with a given offset. So implementing the ChunkReader as it is specified here forces you to query more bytes than required (up until the end) and stop reading the stream once you have the data you need.

github-actions · 2020-10-26T08:33:35Z

https://issues.apache.org/jira/browse/ARROW-10387

alamb

I read the code carefully and it looks correct and like an improvement to me. 👍

alamb · 2020-10-26T11:16:37Z

rust/parquet/src/file/footer.rs

+    let mut first_end_cursor = Cursor::new(first_len_end_buf);
    let metadata_read: Box<dyn Read>;
-    if footer_metadata_len > file_size as usize {
+    if first_end_len < DEFAULT_FOOTER_READ_SIZE


to check my understanding -- this branch is checking if the total file was smaller than DEFAULT_FOOTER_READ_SIZE (b/c it was all read in a single chunk) and the metadata size claims to be larger than this chunk

exactly, I should add a comment to explicit this

rust/parquet/src/util/io.rs

alamb · 2020-10-26T11:20:26Z

rust/parquet/src/file/reader.rs

 use crate::column::reader::ColumnReaderImpl;

+/// Parquet files must be read from end for the footer then from start for columns
+pub enum ChunkMode {


One suggestion I have is to call this ReadFrom rather than ChunkModeto make the name more specific.

Alternatively, given its similarity, I wonder if it would make sense to use std::io::SeekFrom here https://doc.rust-lang.org/std/io/enum.SeekFrom.html rather than a custom enum

std::io::SeekFrom is similar but it has also the SeekFrom::Current state that we are not interested in.
About the name, I agree that ChunkMode is not ideal, but I wanted it to explicitly relate to ChunkReader. What about the more verbose but probably also more explicit ReadChunkFrom ?

I thinkReadChunkFrom is reasonable. Or maybe ChunkFrom ?

rust/parquet/src/util/io.rs

sunchao · 2020-10-30T21:45:46Z

rust/parquet/src/file/reader.rs

+}
+
+impl ChunkMode {
+    /// FromStart offset can always be computed if you know the length


pls update to make it a more proper doc for the method, e.g., what should the caller pass for len?

sunchao · 2020-10-30T22:01:35Z

rust/parquet/src/file/footer.rs

-        .get_read(chunk_reader.len() - default_end_len as u64, default_end_len)?;
-    let mut default_len_end_buf = vec![0; default_end_len];
-    default_end_reader.read_exact(&mut default_len_end_buf)?;
+    let mut first_len_end_buf = vec![0; first_end_len];


nit: perhaps a better name first_len_end_buf is confusing.

sunchao · 2020-10-30T23:04:22Z

rust/parquet/src/file/footer.rs

+        metadata_read = Box::new(first_end_cursor);
    } else {
        // the end of file read by default is not long enough, read missing bytes
+        let complementary_end_len = FOOTER_SIZE + metadata_len as usize - first_end_len;


nit: reuse footer_metadata_len

sunchao · 2020-10-30T23:08:10Z

rust/parquet/src/file/footer.rs

-            file_size as i64 - footer_metadata_len as i64
+            "Invalid Parquet file. Metadata size exceeds file size."
        ));
    } else if footer_metadata_len < DEFAULT_FOOTER_READ_SIZE {


should this be <=?

sunchao · 2020-10-30T23:16:29Z

rust/parquet/src/file/footer.rs

-    let file_size = chunk_reader.len();
-    if file_size < (FOOTER_SIZE as u64) {
+    // read and cache up to DEFAULT_FOOTER_READ_SIZE bytes from the end and process the footer
+    let mut first_end_read = chunk_reader.get_read(


It's weird that we are handling the "chunked read" logic in the footer reader - should this be inside the ChunkReader implementation? e.g., the reader itself should do lazy loading on the input stream based on how application decide to seek & read the data.

That's a good point. I copied this logic from the C++ implem but it's not great in terms of seperation of concerns. If I understand you point:

here we should request a reader with get_read(ChunkMode::FromEnd(FOOTER_SIZE))

it should be up to the ChunkReader implem to buffer extra bytes if it founds that FOOTER_SIZE is too small and it is cheap for him to get more bytes (and for instance the S3 ChunkReader will get 16kB instead)

yes exactly, I feel the footer reader should not be aware of how the input stream is processed and also the logic can vary depending on the remote storage so the DEFAULT_FOOTER_READ_SIZE may not fit for all.

sunchao · 2020-10-30T23:24:50Z

rust/parquet/src/util/cursor.rs

-        {
-            return Err(Error::new(ErrorKind::InvalidInput, "out of bound"));
+    /// Create a slice cursor backed by the same data as a current one.
+    /// If the slice length is larger than the remaining bytes in the source, the slice is clamped.


can we replace clamp with something else more plain? like "shorten"? as a non-native English speaker, I had to check dictionary for the meaning of this word :)

I like the word but you are right 😄

rdettai · 2020-11-01T09:57:22Z

Thanks for the review @sunchao but before fixing them, I would like your point of view on the general value of this. Do you think it is really worth it adding this extra complexity to make it possible to read a file for which we ignore the length or should we stick with a simpler get_read(&self, start: u64, length: usize) ? I think that this interface is very important, so I really don't want to make it complicated for the sake of it 😄

sunchao · 2020-11-02T19:40:01Z

@rdettai I don't know enough about object storage to comment about the necessity of this change, but yeah for HDFS getting the file length is cheap. Also, seems we are still going to read from the end of the file so the length info is still required? how ChunkMode is going to be used without from_start?

rdettai · 2020-11-03T15:32:20Z

@sunchao with most object storage, you will select the bytes you want to read with the http Range header, which can read from end. You can use this to implement ChunkMode::FromEnd without knowing the length of the file.

Getting the length is expensive as it adds an extra GET request, and I guess that with HDFS it also implies a network round trip which is not free. But as @alamb mentioned earlier, you will often have the length around beforehand because you get it at the same time as you list your files/objects.

I would be in favor of stalling this PR until at least one other person expresses his interest for the ability to read without knowing the length of the file.

Meanwhile, one of your comments on the PR made me think, and it's true that there is some buffering logic that is managed in the footer parser that should be left to the ChunkReader implementation. I'll try to see if I can find an interface that better separates concerns.

rdettai · 2020-12-02T15:20:52Z

Closing because nobody showed interest in this feature

[feat] added read rom end in ChunckReader

63933b6

github-actions bot added the Component: Rust label Oct 26, 2020

nevi-me self-requested a review October 26, 2020 08:39

nevi-me added the Component: Parquet label Oct 26, 2020

alamb approved these changes Oct 26, 2020

View reviewed changes

andygrove requested a review from sunchao October 29, 2020 17:13

sunchao reviewed Oct 30, 2020

View reviewed changes

nevi-me added the needs-rebase A PR that needs to be rebased by the author label Nov 7, 2020

rdettai closed this Dec 2, 2020

ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer #8525

ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer #8525

Uh oh!

Conversation

rdettai commented Oct 26, 2020

Uh oh!

rdettai commented Oct 26, 2020

Uh oh!

github-actions bot commented Oct 26, 2020

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdettai commented Nov 1, 2020

Uh oh!

sunchao commented Nov 2, 2020

Uh oh!

rdettai commented Nov 3, 2020

Uh oh!

rdettai commented Dec 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants