Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation about ParquetExec / Parquet predicate pushdown #11994

Merged
merged 6 commits into from
Aug 16, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 14, 2024

Which issue does this PR close?

part of #4028

Rationale for this change

While reviewing this code with @itsjunetime, we discovered some interesting things that I would like to encode in comments.

What changes are included in this PR?

  1. Improve documentation in the row pushdown code

Are these changes tested?

Yes, CI

Are there any user-facing changes?

Documentation change only (no functional changes)

Note most of the docs are internal (don't appear on docs.rs)

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate labels Aug 14, 2024
@@ -144,6 +142,29 @@ pub use writer::plan_to_parquet;
/// * User provided [`ParquetAccessPlan`]s to skip row groups and/or pages
/// based on external information. See "Implementing External Indexes" below
///
/// # Predicate Pushdown
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to consolidate the description of what predicate pushdown is done in the ParquetExec

@@ -15,6 +15,49 @@
// specific language governing permissions and limitations
// under the License.

//! Utilities to push down of DataFusion filter predicates (any DataFusion
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is mostly the same content, reformatted and made more concise.

/// For a given set of `Column`s required for predicate `Expr` determine whether
/// all columns are sorted.
///
/// Sorted columns may be queried more efficiently in the presence of
/// a PageIndex.
fn columns_sorted(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting that we never connected up the "columns_sorted" information -- is this on your list @thinkharderdev ?

Should I file a ticket to do this?

@alamb alamb marked this pull request as ready for review August 14, 2024 19:14
@@ -486,6 +486,9 @@ pub trait TreeNodeVisitor<'n>: Sized {
/// A [Visitor](https://en.wikipedia.org/wiki/Visitor_pattern) for recursively
/// rewriting [`TreeNode`]s via [`TreeNode::rewrite`].
///
/// For example you can implement this trait on a struct to rewrite `Expr` or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add an example of it? 🤔

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks lgtm @alamb it was a nice reading. left some minors

//! 6. Partition the predicates according to whether they are sorted (from step 4)
//! 7. "Compile" each predicate `Expr` to a `DatafusionArrowPredicate`.
//! 8. Build the `RowFilter` with the sorted predicates followed by
//! the unsorted predicates. Within each partition, predicates are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this explanation is a gem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @thinkharderdev wrote it back in the day.

This PR just simplifies the wording slightly

/// # Return values
///
/// * `Ok(Some(candidate))` if the expression can be used as an ArrowFilter
/// * `Ok(None)` if the expression cannot be used as an ArrowFilter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@itsjunetime itsjunetime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps a lot with (at least my own) comprehension, I think. Thank you

if let Some(column) = expr.as_any().downcast_ref::<Column>() {
if self.file_schema.field_with_name(column.name()).is_err() {
// the column expr must be in the table schema
// Replace the column reference with a NULL (using the type from the table schema)
// e.g. `column = 'foo'` is rewritten be transformed to `NULL = 'foo'`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is obviously a much better comment than before, but I think it could be further improved with an explanation stating why we do this column rewriting, or what purpose it serves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- I tried to provide this information in c0b9012

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @comphead and @itsjunetime for the review

//! 6. Partition the predicates according to whether they are sorted (from step 4)
//! 7. "Compile" each predicate `Expr` to a `DatafusionArrowPredicate`.
//! 8. Build the `RowFilter` with the sorted predicates followed by
//! the unsorted predicates. Within each partition, predicates are
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @thinkharderdev wrote it back in the day.

This PR just simplifies the wording slightly

if let Some(column) = expr.as_any().downcast_ref::<Column>() {
if self.file_schema.field_with_name(column.name()).is_err() {
// the column expr must be in the table schema
// Replace the column reference with a NULL (using the type from the table schema)
// e.g. `column = 'foo'` is rewritten be transformed to `NULL = 'foo'`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- I tried to provide this information in c0b9012

@alamb
Copy link
Contributor Author

alamb commented Aug 16, 2024

Thanks again -- let me know if you have additional suggestions and I'll make them in a follow on PR

@alamb alamb merged commit 2a16704 into apache:main Aug 16, 2024
24 checks passed
@alamb alamb deleted the alamb/parquet_exec_dcs branch August 16, 2024 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants