-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[parquet] Add row group index virtual column #9117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[parquet] Add row group index virtual column #9117
Conversation
073e3d5 to
0c9e12d
Compare
|
@Jefffrey if you have some bandwidth, I'd be curious to get your thoughts |
|
cc @alamb |
I'll see if I can take a look at this PR soon, though I will say it's been a while since I looked at the parquet codebase 😅 |
|
This would be sweet! I like the idea of using an extension type as a marker, and letting the caller customize the column name. |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! (a few nits)
| fn read_records(&mut self, batch_size: usize) -> Result<usize> { | ||
| let starting_len = self.buffered_indices.len(); | ||
| self.buffered_indices | ||
| .extend((&mut self.remaining_indices).take(batch_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I know this is how the row index reader did it, but since that code merged I learned that Iterator::by_ref is a thing.
| .extend((&mut self.remaining_indices).take(batch_size)); | |
| .extend(self.remaining_indices.by_ref().take(batch_size)); |
It's not shorter, but does seem more readable?
(more below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL. Thank you @scovich
| if metadata.is_some_and(str::is_empty) { | ||
| Ok("") | ||
| } else { | ||
| Err(ArrowError::InvalidArgumentError( | ||
| "Virtual column extension type expects an empty string as metadata".to_owned(), | ||
| )) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is a match simpler?
| if metadata.is_some_and(str::is_empty) { | |
| Ok("") | |
| } else { | |
| Err(ArrowError::InvalidArgumentError( | |
| "Virtual column extension type expects an empty string as metadata".to_owned(), | |
| )) | |
| } | |
| match metadata { | |
| Some(&"") => Ok(""), | |
| _ => Err(ArrowError::InvalidArgumentError( | |
| "Virtual column extension type expects an empty string as metadata".to_owned(), | |
| )), | |
| } |
or even
| if metadata.is_some_and(str::is_empty) { | |
| Ok("") | |
| } else { | |
| Err(ArrowError::InvalidArgumentError( | |
| "Virtual column extension type expects an empty string as metadata".to_owned(), | |
| )) | |
| } | |
| if let Some(&"") = metadata { | |
| return Ok(""); | |
| }; | |
| Err(ArrowError::InvalidArgumentError( | |
| "Virtual column extension type expects an empty string as metadata".to_owned(), | |
| )) |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me 👍
0c9e12d to
0c019db
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @friendlymatthew and @scovich and @Jefffrey -- this is really cool. I can't wait to get this integrated into downstream projects
|
|
||
| fn consume_batch(&mut self) -> Result<ArrayRef> { | ||
| Ok(Arc::new(Int64Array::from_iter( | ||
| self.buffered_indices.drain(..), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the time each ArrayReader will be instantiated for pages from exactly one row group we can probably make this significantly faster by optimizing the case when reading from only a single row group.
The other thing would be to pre-calculate the arrays (where the row group values don't change) and return the same ArrayRef until the RowGroup actually changes
However, there is no reason we need to do that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed:
- perf: optimize
RowGroupIndexReaderfor single row group reads #9180 - perf: cache
ArrayRefinRowGroupIndexReaderwhen row group doesn't change #9181
I wonder if we should have the same optimization for RowNumberReader 🤔
|
|
||
| fn skip_records(&mut self, num_records: usize) -> Result<usize> { | ||
| // TODO: Use advance_by when it stabilizes to improve performance | ||
| Ok((self.remaining_indices.by_ref()).take(num_records).count()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0c019db to
4437dc6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew @Jefffrey and @scovich

Which issue does this PR close?
Rationale for this change
This PR adds support for storing row group indices as a virtual column, allowing users to determine which row group each row originated from
The usage pattern is quite simple, something like: