Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent naming for Parquet page index structures #6097

Open
etseidl opened this issue Jul 19, 2024 · 8 comments
Open

Consistent naming for Parquet page index structures #6097

etseidl opened this issue Jul 19, 2024 · 8 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@etseidl
Copy link
Contributor

etseidl commented Jul 19, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The Parquet Page Index consists of two thrift structures: a ColumnIndex containing per-page statistics useful for filtering, and an OffsetIndex used to locate individual pages in the Parquet file. Size statistics were recently added to these structures (PARQUET-2261) and these are currently being added to arrow-rs (#5022). When #5022 is completed, the Parquet page index will be represented in the parquet crate by the Index enum for the column index, and a new OffsetIndexMetaData struct for the offset index (replacing the old offset index which simply returned the page_locations member of the thrift OffsetIndex). Somewhat confusingly, the Index encapsulates a vector of PageIndex structs, which contain per-page column index information.

Further, ParquetMetaData packages the page indexes for each row group into type aliased 2-D arrays: ParquetColumnIndex = Vec<Vec<Index>> and ParquetOffsetIndex = Vec<Vec<OffsetIndexMetaData>>.

Describe the solution you'd like
It would be desirable to make the naming for all of these structures a bit more consistent, so their intended use is a bit more obvious. One possible mapping could be:

  1. Rename Index to ParquetColumnIndex (and perhaps PageIndex -> ColumnIndexForPage).
  2. Rename OffsetIndexMetaData to ParquetOffsetIndex.
  3. The type aliases could be made plural, so the current ParquetColumnIndex -> ParquetColumnIndexes and ParquetOffsetIndex -> ParquetOffsetIndexes.

Describe alternatives you've considered
Maintain the status quo, or some other better names.

Additional context
See the comments here and here for more context.

@etseidl etseidl added the enhancement Any new improvement worthy of a entry in the changelog label Jul 19, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Jul 19, 2024

cc @alamb

@alamb
Copy link
Contributor

alamb commented Jul 21, 2024

Thank you for writing this ticket @etseidl. I think the naming of the various metadata structures in the parquet-rs crate to be quite confusing

I believe the core of my confusion stems from the two sets of structures:

  • The structs (from the thrift compiler) that exactly match the parquet thrift structures format::metadata
  • The structs defined in parquet-rs crate that are easier to work with file::metadata and map more directly to parquet structures

The current naming of the structures does not make the distinction clear and this I think the boundary gets muddled

I think the best thing we can do is to pick a naming convention that makes this distinction clear

@alamb
Copy link
Contributor

alamb commented Jul 21, 2024

What do you think about creating a Rust struct PageIndex that contains all page level metadata in it (along with a nicer API) rather than exactly mirroring the multiple thrift structures?

This would be more in line with how the existing RowGroupMetaDataand ColumnChunkMetaData represent the contents of thrift structures.

So that might look like

  1. Deprecate ParquetOffsetIndex and ParquetColumnIndex indvidually
  2. Add a new struct like PageIndexMetadata to access various parts of the PageIndex
  3. Add methods like PageIndexMetadata::offsets() and PageIndexMetadata::indexes() that returns the page offsets and indexes
  4. Add PageIndexMetadata::unencoded_byte_sizes() to access the byte sizes

Maybe we could do something similar to wrap the BloomFilters so ParquetMetadata also had BloomFilterMetada

🤔

cc @sunchao @adriangb @tustvold @XiangpengHao @Ted-Jiang @liukun4515 @viirya in case you have thoughts about the metadata structures

@etseidl
Copy link
Contributor Author

etseidl commented Jul 22, 2024

What do you think about creating a Rust struct PageIndex that contains all page level metadata in it (along with a nicer API) rather than exactly mirroring the multiple thrift structures?

I'm in favor. I like how the current Index takes the struct-of-arrays ColumnIndex and turns it into an array-of-structs so the info for each page is co-located. I could see doing something similar and fusing PageLocation and unencoded_byte_array_data_bytes into a single new OffsetIndex struct.

@sunchao
Copy link
Member

sunchao commented Jul 24, 2024

+1. The current state of Index and OffsetIndexMetaData does seem inconsistent. I like the idea of having a PageIndexMetadata struct which holds all information related to page indexes.

@adriangb
Copy link
Contributor

I don't have a strong opinion, I haven't spent as much time as the rest of you with these structs, but I do find them confusing and am in favor of changes to make it easier to understand 😃

Some top-level documentation of how these things interact could also be helpful.

@etseidl
Copy link
Contributor Author

etseidl commented Jul 24, 2024

Some top-level documentation of how these things interact could also be helpful.

Perhaps we can steal the excellent description here. 😉

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

I could also commission some more monodraw diagrams 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

4 participants