Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Add additional docstrings to Window function implementations #6592

Merged
merged 6 commits into from
Jun 9, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 7, 2023

Which issue does this PR close?

Related to #5781

Rationale for this change

As I begin working out how to expose User Defined Window Functions, I want to

  1. Make sure the documentation is up to date
  2. Make sure I understand the code sufficiently

What changes are included in this PR?

  1. Add a bunch of doc comments to explain various window related functionality better
  2. Make PartitionEvaluator public UPDATE: This PR just adds documentation, it doesn't change any APIs

Are these changes tested?

Yes, with CI

Are there any user-facing changes?

better rustdocs, hopefully

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions labels Jun 7, 2023
@@ -69,6 +73,8 @@ pub trait BuiltInWindowFunctionExpr: Send + Sync + std::fmt::Debug {
false
}

/// If returns true, [`Self::create_evaluator`] must implement
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little unclear on why use_window_frame was part of BuiltInWindowFunctionExpr and not PartitionEvaluator but I haven't looked into it in more detail

@@ -32,6 +32,7 @@ mod window_frame_state;
pub use aggregate::PlainAggregateWindowExpr;
pub use built_in::BuiltInWindowExpr;
pub use built_in_window_function_expr::BuiltInWindowFunctionExpr;
pub use partition_evaluator::PartitionEvaluator;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PartitionEvaluator now appears in public docstrings (and I propose to make it part of the user defined window function API -- see #5781 (comment))

/// There are two types of `PartitionEvaluator`:
///
/// # Stateless `PartitionEvaluator`
///
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mustafasrepo / @ozankabak if you have time to help me describe more clearly what Stateful and Stateless PartitionEvaluators are , and specifically what is different between them, I would be most appreciative

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some builtin window functions use window frame information inside the window expression (those are FIRST_VALUE, LAST_VALUE, NTH_VALUE). However, for most of the window functions what is in the window frame is not important (those are ROW_NUMBER, RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST, LEAD, LAG). For the ones, using window_frame PartitionEvaluator::evaluate_inside_range is called. For the ones that do not use window frame PartitionEvaluator::evaluate is called (For rank calculations, PartitionEvaluator::evaluate_with_rank is called since its API is quite different. However, it doesn't use window frame either.)

PartitionEvaluator::evaluate_stateful is used only when we produce window result with bounded memory(When window functions are called from the BoundedWindowAggExec). In this case window results are calculated in running fashion, hence we need to store previous state, to be able to calculate correct output (For instance, for ROW_NUMBER function the current batch evaluator receive may not be the first batch. Hence we cannot start row_number from 0, we need to start from last ROW_NUMBER produced for the previous batches received. Similarly, we need to store some information in the state. When we do not receive whole table as a single batch)

Currently, we have support for bounded(stateful) execution for FIRST_VALUE, LAST_VALUE, NTH_VALUE, ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mustafasrepo -- this is super helpful. I am incorporating this information into this PR

/// A window expr that takes the form of an aggregate function that
/// can be incrementally computed over sliding windows.
///
/// See comments on [`WindowExpr`] for more details.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated this description into WindowExpr

@alamb alamb marked this pull request as ready for review June 7, 2023 20:55
@alamb alamb added the documentation Improvements or additions to documentation label Jun 7, 2023
@alamb alamb changed the title Add additional docstrings to Window function implementations Minor: Add additional docstrings to Window function implementations Jun 8, 2023
@alamb alamb requested a review from mustafasrepo June 8, 2023 17:52
@alamb
Copy link
Contributor Author

alamb commented Jun 8, 2023

@mustafasrepo / @ozankabak I wonder if you have time to review this documentation and check if I got it right? This PR has no code changes in it

@alamb alamb mentioned this pull request Jun 8, 2023
4 tasks
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jun 8, 2023
/// Construct a new [`BuiltInWindowFunctionExpr`] that produces
/// the same result as this function on a window with reverse
/// order. The return value of this function is used by the
/// DataFusion optimizer to avoid resorting the data when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use re-sorting instead of resorting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d56a5b2

Copy link
Contributor

@mustafasrepo mustafasrepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb for this PR. This PR is LGTM!.

@alamb alamb merged commit 5032060 into apache:main Jun 9, 2023
@alamb alamb deleted the alamb/add_window_function_docs branch June 9, 2023 13:28
jayzhan211 pushed a commit to jayzhan211/arrow-datafusion that referenced this pull request Jun 12, 2023
…pache#6592)

* Add additional docstrings to Window function implementations

* Update docs

* updates

* fix doc link

* Change resorting --> re-sorting
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants