-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve documentation about ParquetExec
/ Parquet predicate pushdown
#11994
Conversation
@@ -144,6 +142,29 @@ pub use writer::plan_to_parquet; | |||
/// * User provided [`ParquetAccessPlan`]s to skip row groups and/or pages | |||
/// based on external information. See "Implementing External Indexes" below | |||
/// | |||
/// # Predicate Pushdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to consolidate the description of what predicate pushdown is done in the ParquetExec
@@ -15,6 +15,49 @@ | |||
// specific language governing permissions and limitations | |||
// under the License. | |||
|
|||
//! Utilities to push down of DataFusion filter predicates (any DataFusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is mostly the same content, reformatted and made more concise.
/// For a given set of `Column`s required for predicate `Expr` determine whether | ||
/// all columns are sorted. | ||
/// | ||
/// Sorted columns may be queried more efficiently in the presence of | ||
/// a PageIndex. | ||
fn columns_sorted( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting that we never connected up the "columns_sorted" information -- is this on your list @thinkharderdev ?
Should I file a ticket to do this?
@@ -486,6 +486,9 @@ pub trait TreeNodeVisitor<'n>: Sized { | |||
/// A [Visitor](https://en.wikipedia.org/wiki/Visitor_pattern) for recursively | |||
/// rewriting [`TreeNode`]s via [`TreeNode::rewrite`]. | |||
/// | |||
/// For example you can implement this trait on a struct to rewrite `Expr` or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add an example of it? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks lgtm @alamb it was a nice reading. left some minors
//! 6. Partition the predicates according to whether they are sorted (from step 4) | ||
//! 7. "Compile" each predicate `Expr` to a `DatafusionArrowPredicate`. | ||
//! 8. Build the `RowFilter` with the sorted predicates followed by | ||
//! the unsorted predicates. Within each partition, predicates are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this explanation is a gem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @thinkharderdev wrote it back in the day.
This PR just simplifies the wording slightly
/// # Return values | ||
/// | ||
/// * `Ok(Some(candidate))` if the expression can be used as an ArrowFilter | ||
/// * `Ok(None)` if the expression cannot be used as an ArrowFilter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helps a lot with (at least my own) comprehension, I think. Thank you
if let Some(column) = expr.as_any().downcast_ref::<Column>() { | ||
if self.file_schema.field_with_name(column.name()).is_err() { | ||
// the column expr must be in the table schema | ||
// Replace the column reference with a NULL (using the type from the table schema) | ||
// e.g. `column = 'foo'` is rewritten be transformed to `NULL = 'foo'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is obviously a much better comment than before, but I think it could be further improved with an explanation stating why we do this column rewriting, or what purpose it serves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- I tried to provide this information in c0b9012
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @comphead and @itsjunetime for the review
//! 6. Partition the predicates according to whether they are sorted (from step 4) | ||
//! 7. "Compile" each predicate `Expr` to a `DatafusionArrowPredicate`. | ||
//! 8. Build the `RowFilter` with the sorted predicates followed by | ||
//! the unsorted predicates. Within each partition, predicates are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @thinkharderdev wrote it back in the day.
This PR just simplifies the wording slightly
if let Some(column) = expr.as_any().downcast_ref::<Column>() { | ||
if self.file_schema.field_with_name(column.name()).is_err() { | ||
// the column expr must be in the table schema | ||
// Replace the column reference with a NULL (using the type from the table schema) | ||
// e.g. `column = 'foo'` is rewritten be transformed to `NULL = 'foo'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- I tried to provide this information in c0b9012
Thanks again -- let me know if you have additional suggestions and I'll make them in a follow on PR |
Which issue does this PR close?
part of #4028
Rationale for this change
While reviewing this code with @itsjunetime, we discovered some interesting things that I would like to encode in comments.
What changes are included in this PR?
Are these changes tested?
Yes, CI
Are there any user-facing changes?
Documentation change only (no functional changes)
Note most of the docs are internal (don't appear on docs.rs)