Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add advanced_parquet_index.rs example of index in into parquet files #10701

Merged
merged 13 commits into from
Jun 22, 2024
Merged
1 change: 1 addition & 0 deletions datafusion-examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ cargo run --example csv_sql
- [`advanced_udaf.rs`](examples/advanced_udaf.rs): Define and invoke a more complicated User Defined Aggregate Function (UDAF)
- [`advanced_udf.rs`](examples/advanced_udf.rs): Define and invoke a more complicated User Defined Scalar Function (UDF)
- [`advanced_udwf.rs`](examples/advanced_udwf.rs): Define and invoke a more complicated User Defined Window Function (UDWF)
- [`advanced_parquet_index.rs`](examples/advanced_parquet_index.rs): Creates a detailed secondary index that covers the contents of several parquet files
- [`avro_sql.rs`](examples/avro_sql.rs): Build and run a query plan from a SQL statement against a local AVRO file
- [`catalog.rs`](examples/catalog.rs): Register the table into a custom catalog
- [`csv_sql.rs`](examples/csv_sql.rs): Build and run a query plan from a SQL statement against a local CSV file
Expand Down
664 changes: 664 additions & 0 deletions datafusion-examples/examples/advanced_parquet_index.rs

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions datafusion/common/src/column.rs
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,13 @@ impl Column {
})
}

/// return the column's name.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some small access APIs to make the example easier to read. I can revert these or put them in a different PR if reviewers prefer

///
/// Note: This ignores the relation and returns the column name only.
pub fn name(&self) -> &str {
&self.name
}

/// Serialize column into a flat name string
pub fn flat_name(&self) -> String {
match &self.relation {
Expand Down
7 changes: 7 additions & 0 deletions datafusion/common/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1393,6 +1393,13 @@ pub struct TableParquetOptions {
pub key_value_metadata: HashMap<String, Option<String>>,
}

impl TableParquetOptions {
/// Return new default TableParquetOptions
pub fn new() -> Self {
Self::default()
}
}

impl ConfigField for TableParquetOptions {
fn visit<V: Visit>(&self, v: &mut V, key_prefix: &str, description: &'static str) {
self.global.visit(v, key_prefix, description);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,11 @@ impl ParquetAccessPlan {
self.set(idx, RowGroupAccess::Skip);
}

/// scan the i-th row group
pub fn scan(&mut self, idx: usize) {
self.set(idx, RowGroupAccess::Scan);
}

/// Return true if the i-th row group should be scanned
pub fn should_scan(&self, idx: usize) -> bool {
self.row_groups[idx].should_scan()
Expand Down
4 changes: 2 additions & 2 deletions datafusion/core/src/datasource/physical_plan/parquet/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -186,9 +186,9 @@ pub use writer::plan_to_parquet;
/// let exec = ParquetExec::builder(file_scan_config).build();
/// ```
///
/// For a complete example, see the [`parquet_index_advanced` example]).
/// For a complete example, see the [`advanced_parquet_index` example]).
///
/// [`parquet_index_advanced` example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index_advanced.rs
/// [`parquet_index_advanced` example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
///
/// # Execution Overview
///
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ impl RowGroupAccessPlanFilter {
Self { access_plan }
}

/// Return true if there are no row groups to scan
/// Return true if there are no row groups
pub fn is_empty(&self) -> bool {
self.access_plan.is_empty()
}
Expand Down
10 changes: 8 additions & 2 deletions datafusion/core/src/physical_optimizer/pruning.rs
Original file line number Diff line number Diff line change
Expand Up @@ -471,8 +471,10 @@ pub struct PruningPredicate {
/// Original physical predicate from which this predicate expr is derived
/// (required for serialization)
orig_expr: Arc<dyn PhysicalExpr>,
/// [`LiteralGuarantee`]s that are used to try and prove a predicate can not
/// possibly evaluate to `true`.
/// [`LiteralGuarantee`]s used to try and prove a predicate can not possibly
/// evaluate to `true`.
///
/// See [`PruningPredicate::literal_guarantees`] for more details.
literal_guarantees: Vec<LiteralGuarantee>,
}

Expand Down Expand Up @@ -595,6 +597,10 @@ impl PruningPredicate {
}

/// Returns a reference to the literal guarantees
///
/// Note that **All** `LiteralGuarantee`s must be satisfied for the
/// expression to possibly be `true`. If any is not satisfied, the
/// expression is guaranteed to be `null` or `false`.
pub fn literal_guarantees(&self) -> &[LiteralGuarantee] {
&self.literal_guarantees
}
Expand Down