Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Window Functions for use with function builder #808

Merged

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Aug 13, 2024

Which issue does this PR close?

This addresses part of #780 but we need a follow on PR to handle the aggregates.

Rationale for this change

In DataFusion 41 we have added the ability to create builders for the parameters for window and aggregate functions. These are more expressive when it comes to setting the window frame, ordering, partition, and null handling. This MR exposes this functionality on the python side.

What changes are included in this PR?

Added functions

  • lead
  • lag
  • row_number
  • rank
  • dense_rank
  • percent_rank
  • cume_dist
  • ntile

Also updated the window function page of the online documentation to include links to these, examples of how to use them, and a description of how to set the additional parameters.

Updated the unit tests.

Are there any user-facing changes?

There are no breaking changes, but the old functions.window has been marked as deprecated. We can change this if needed because it is currently the only way to use aggregate functions as window functions. That is to be addressed in #833

@Michael-J-Ward
Copy link
Contributor

Note for when you rebase:

I yanked deprecate Expr::display_name from #802. I must have pinned the deps a little later than what was released because the upstream change wasn't included.

@timsaucer
Copy link
Contributor Author

I think the window functions are now all good to go. I've updated the documentation. Next up is the aggregate functions, but most of them have a jump start already.

I'll rebase on main once the DF41 updates get merged

@timsaucer timsaucer force-pushed the feature/aggregate_and_window_builders branch from 35547c8 to dff27ad Compare August 24, 2024 02:03
@timsaucer timsaucer changed the title Aggregate and Window function builders Add Window Functions for use with function builder Aug 24, 2024
@timsaucer
Copy link
Contributor Author

Since this is already getting a bit long, I'd like to merge it with just the window functions and deal with the aggregate functions in a different PR.

@timsaucer timsaucer marked this pull request as ready for review August 24, 2024 10:35
Copy link
Contributor

@Michael-J-Ward Michael-J-Ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a blocking thought but still worth considering.


The Builder pattern in rust is mostly a workaround for not having default / keyword arguments.

So I wonder if the more pythonic API to expose would be

f.rank().agg(partition_by=..., order_by=...)

class Expr:
    def agg(self, ** partition_by=None, order_by=None):
        ...

The 2 things that can cause build to fail are

  1. Not using a sort expression in order_by
  2. Not using a function that can be a Window or Aggregate

Can we prevent such misuse from our python wrappers?

https://github.com/apache/datafusion/blob/14d640433eac9314f37edc1d12ff3260a4778923/datafusion/expr/src/expr_fn.rs#L795C29-L809

.order_by(col('"Attack"').sort(ascending=True))
.build()
.alias("rank"),
).sort(col('"Type 1"').sort(), col('"Attack"').sort())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any requirement or constraint on the final sort and the partition_by / order_by args to the window function?

Or is this just for a nicer, more interpretable output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is, but it's an annoyance I've been thinking about fixing. I've moved this to draft so I can tighten that up.

)

return Expr(
f.approx_percentile_cont(
expression.expr, percentile.expr, distinct=distinct, num_centroids=num_centroids.expr
expression.expr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a black formatting change?

If so, is our CI not catching these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was auto applied, so I'm not completely sure what happened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it looks like we are not using ci/scripts/python_lint.sh which I'm guessing we were at one point. Instead the ruff checks happen in .github/workflows/build.yml. This only runs ruff check and not ruff format but pre-commit does run ruff format. That's why these changes are happening in my commits but not failing CI.

src/functions.rs Outdated
builder = builder.null_treatment(null_treatment.map(DFNullTreatment::from));

Ok(builder.build()?.into())
pub fn last_value(expr: PyExpr) -> PyExpr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything preventing also updating of first_value?

@Michael-J-Ward
Copy link
Contributor

IDK if you have used polars much, but I found their approach to aggregation and windowing pretty intuitive and powerful.

Maybe I should do a deeper comparison by writing their examples in datafusion-python.

@timsaucer timsaucer marked this pull request as draft August 25, 2024 00:16
@timsaucer
Copy link
Contributor Author

That's a great suggestion about using the defaults instead. I'll convert this over to that approach. It will be might tighter interface.

@timsaucer timsaucer force-pushed the feature/aggregate_and_window_builders branch from 616a748 to 070b595 Compare August 31, 2024 12:24
@timsaucer
Copy link
Contributor Author

I've rebased on main and I think the window functions are good to go. Follow on work:

#688

Address Aggregates for #780

@timsaucer timsaucer marked this pull request as ready for review August 31, 2024 22:02
@andygrove andygrove merged commit 90f5b5b into apache:main Sep 2, 2024
15 checks passed
@timsaucer timsaucer deleted the feature/aggregate_and_window_builders branch September 2, 2024 15:21
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants