Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR adds support for storing row group indices as a virtual column, allowing users to determine which row group each row originated from

The usage pattern is quite simple, something like:

use parquet::arrow::RowGroupIndex;

let row_group_index_field = Arc::new(
    Field::new("row_group_index", DataType::Int64, false)
        .with_extension_type(RowGroupIndex)
);

let options = ArrowReaderOptions::new()
    .with_virtual_columns(vec![row_group_index_field])?;

let reader = ParquetRecordBatchReaderBuilder::try_new_with_options(file, options)?
    .build()?;

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 8, 2026
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/virtual-row-group-index branch 4 times, most recently from 073e3d5 to 0c9e12d Compare January 8, 2026 16:08
@friendlymatthew
Copy link
Contributor Author

@Jefffrey if you have some bandwidth, I'd be curious to get your thoughts

@friendlymatthew
Copy link
Contributor Author

cc @alamb

@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 9, 2026

@Jefffrey if you have some bandwidth, I'd be curious to get your thoughts

I'll see if I can take a look at this PR soon, though I will say it's been a while since I looked at the parquet codebase 😅

@adriangb
Copy link
Contributor

adriangb commented Jan 9, 2026

This would be sweet! I like the idea of using an extension type as a marker, and letting the caller customize the column name.

@alamb
Copy link
Contributor

alamb commented Jan 9, 2026

I think the extension type marker was pioneered by @jkylling and @vustef

I will also try and review this sooner rather than later

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! (a few nits)

fn read_records(&mut self, batch_size: usize) -> Result<usize> {
let starting_len = self.buffered_indices.len();
self.buffered_indices
.extend((&mut self.remaining_indices).take(batch_size));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know this is how the row index reader did it, but since that code merged I learned that Iterator::by_ref is a thing.

Suggested change
.extend((&mut self.remaining_indices).take(batch_size));
.extend(self.remaining_indices.by_ref().take(batch_size));

It's not shorter, but does seem more readable?

(more below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL. Thank you @scovich

Comment on lines +51 to +57
if metadata.is_some_and(str::is_empty) {
Ok("")
} else {
Err(ArrowError::InvalidArgumentError(
"Virtual column extension type expects an empty string as metadata".to_owned(),
))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is a match simpler?

Suggested change
if metadata.is_some_and(str::is_empty) {
Ok("")
} else {
Err(ArrowError::InvalidArgumentError(
"Virtual column extension type expects an empty string as metadata".to_owned(),
))
}
match metadata {
Some(&"") => Ok(""),
_ => Err(ArrowError::InvalidArgumentError(
"Virtual column extension type expects an empty string as metadata".to_owned(),
)),
}

or even

Suggested change
if metadata.is_some_and(str::is_empty) {
Ok("")
} else {
Err(ArrowError::InvalidArgumentError(
"Virtual column extension type expects an empty string as metadata".to_owned(),
))
}
if let Some(&"") = metadata {
return Ok("");
};
Err(ArrowError::InvalidArgumentError(
"Virtual column extension type expects an empty string as metadata".to_owned(),
))

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me 👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @friendlymatthew and @scovich and @Jefffrey -- this is really cool. I can't wait to get this integrated into downstream projects


fn consume_batch(&mut self) -> Result<ArrayRef> {
Ok(Arc::new(Int64Array::from_iter(
self.buffered_indices.drain(..),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the time each ArrayReader will be instantiated for pages from exactly one row group we can probably make this significantly faster by optimizing the case when reading from only a single row group.

The other thing would be to pre-calculate the arrays (where the row group values don't change) and return the same ArrayRef until the RowGroup actually changes

However, there is no reason we need to do that now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


fn skip_records(&mut self, num_records: usize) -> Result<usize> {
// TODO: Use advance_by when it stabilizes to improve performance
Ok((self.remaining_indices.by_ref()).take(num_records).count())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any test coverage for this method.

I ran code coverage

cargo llvm-cov --html --all-features -p parquet

And skip doesn't seem to be covered:

Image

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/virtual-row-group-index branch from 0c019db to 4437dc6 Compare January 14, 2026 23:15
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb merged commit d81d6c3 into apache:main Jan 15, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet] Support returning RowGroupIndex as a column

5 participants