-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Approximate quantile aggregation (pulled into main) #2179
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2179 +/- ##
=======================================
Coverage ? 85.33%
=======================================
Files ? 69
Lines ? 7458
Branches ? 0
=======================================
Hits ? 6364
Misses ? 1094
Partials ? 0
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lookin really good! A few high level things:
- Add the expression to the docs
- Do we want to expose top level
approx_percentile
expressions for this? (on df and grouped df) so users can dodf.approx_percentiles("values")
ordf.group_by('group').approx_percentiles("values")
daft/expressions/expressions.py
Outdated
@@ -434,6 +434,21 @@ def sum(self) -> Expression: | |||
expr = self._expr.sum() | |||
return Expression._from_pyexpr(expr) | |||
|
|||
def approx_percentiles(self, percentiles: builtins.float | builtins.list[builtins.float]) -> Expression: | |||
"""Calculates the approximate percentile(s) for a float column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cool to enable support for temporal columns in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should actually expose this functionality to support all ordered columns (numeric, temporal, strings...)
Then we can leverage this for our sorts. Currently our sorts perform its own bespoke solution for sampling for boundaries before repartitioniong, but we can leverage approx_percentiles
to generate those boundaries for us!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
FYI @maxime-petitjean ! |
Puts the finishing touches on #2076