Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example for building an external secondary index for parquet files #10549

Merged
merged 13 commits into from
May 31, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 16, 2024

Note: While this PR looks very large (715 lines) around half of the content is comments / docstrings

Which issue does this PR close?

Closes #10546

Rationale for this change

See #10546

Building and using external indexes in DataFusion is an important feature. Adding an example of how to do so will help drive the design and APIs

What changes are included in this PR?

New Example

Are these changes tested?

CI

Are there any user-facing changes?

No -- just an example

TODOs

  • Propose a nicer API for extracting the statistics
  • Connect pruning predicate into scan to avoid scanning files
  • File tickets / PRs to make creating ParquetExec easier
  • Try and make some PRs / documentation upstream in the parquet crate to make it easier to work with parquet statistics

@alamb alamb changed the title Alamb/external parquet index Add example for building an external index for parquet files May 16, 2024
@alamb alamb added the documentation Improvements or additions to documentation label May 16, 2024
@alamb alamb force-pushed the alamb/external_parquet_index branch from 55f92ed to b44bffb Compare May 16, 2024 16:12
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label May 16, 2024
@alamb alamb force-pushed the alamb/external_parquet_index branch from 8be8b09 to 460c419 Compare May 22, 2024 11:17
@github-actions github-actions bot removed the core Core DataFusion crate label May 22, 2024
@alamb alamb changed the title Add example for building an external index for parquet files Add example for building an external secondary index for parquet files May 22, 2024
@alamb alamb mentioned this pull request May 22, 2024
2 tasks
@alamb alamb marked this pull request as ready for review May 22, 2024 12:03
use tempfile::TempDir;
use url::Url;

/// This example demonstrates building a secondary index over multiple Parquet
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example speaks for itself in terms of comments, so I don't plan to add additional ones on the PR unless something is unclear

@alamb
Copy link
Contributor Author

alamb commented May 22, 2024

This PR is now ready for review

@alamb
Copy link
Contributor Author

alamb commented May 27, 2024

@crepererum and @NGA-TRAN -- here is a PR ready for your review that shows how to do file level pruning with statistics.

I will make an example of how to do row group level / page level pruning next

@NGA-TRAN
Copy link
Contributor

I start reviewing this

Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example is a bit long (in lines of code) but I also don't have any concrete recommendation on how to make it shorter w/o sacrificing its good readability. Maybe -- but that's personal taste -- collapse the use statements a bit, e.g. instead of

use foo::Bar;
use foo::Baz;

use

use foo::{Bar, Baz};

// specific language governing permissions and limitations
// under the License.

use arrow::array::{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I collapsed this a bit but paradoxically when I tried to collapse them even more the number of lines actually increased

For example, this goes from 4 lines

use datafusion::datasource::listing::PartitionedFile;
use datafusion::datasource::physical_plan::parquet::{
    RequestedStatistics, StatisticsConverter,
};

to 6 lines (though it is less dense)

use datafusion::datasource::{
  listing::PartitionedFile;
  physical_plan::parquet::{
    RequestedStatistics, StatisticsConverter,
  }
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find the grouped ones easier to parse though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason I have heard for using the single line includes is that it reduces merge conflicts when people change / update the includes. I am not sure how relevant this is in this case

FWIW the includes were automatically created by my editor (rust rover).

@alamb
Copy link
Contributor Author

alamb commented May 31, 2024

Thank you very much for the review @crepererum

@alamb alamb merged commit 09dde27 into apache:main May 31, 2024
23 checks passed
@alamb alamb deleted the alamb/external_parquet_index branch May 31, 2024 11:12
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
apache#10549)

* Add example for building an external index for parquet filtes

* Use register_object_store api

* use FileScanConfig API

* Udpate to use new API

* Collapose `use` statements

* fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Example for building an external index for parquet files
3 participants