Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced example for building an external index for Row Groups *within* parquet files #10580

Closed
alamb opened this issue May 20, 2024 · 1 comment · Fixed by #10701
Closed
Assignees

Comments

@alamb
Copy link
Contributor

alamb commented May 20, 2024

Is your feature request related to a problem or challenge?

It is common in databases and other analytic system to have additional external "indexes" (perhaps stored in the "metadata catalog", perhaps stored alongside the data files, perhaps embedded in the files, perhaps elsewhere)

These indexes are used to speed up queries by "pruning": specifically evaluating a predicate on the index and then only reading the portions of files that would pass the filters in the query. In #10546 we showed how to create a index for entire files.

I would also like to create an example of how to create such an index for row groups within a file (showing how to read it without re-reading the metadata each time)

To complete this example, I think we need:

  1. The API from @NGA-TRAN in [EPIC] Efficiently and correctly extract parquet statistics into ArrayRefs #10453
  2. The API described in API in ParquetExec to pass in RowSelections to ParquetExec (enable custom indexes, finer grained pushdown) #9929

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

This is a follow on to #10546

@alamb
Copy link
Contributor Author

alamb commented Jun 17, 2024

a PR for this example is now ready for review: #10701

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant