[FEAT] Approximate quantile aggregation (pulled into main) #2179

jaychia · 2024-04-24T23:57:00Z

Puts the finishing touches on #2076

…able-level

codecov · 2024-04-27T00:09:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (main@3e5da66). Click here to learn what that means.
Report is 1 commits behind head on main.

❗ Current head 77e5fd3 differs from pull request most recent head 7980e58. Consider uploading reports for the commit 7980e58 to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2179   +/-   ##
=======================================
  Coverage        ?   85.33%           
=======================================
  Files           ?       69           
  Lines           ?     7458           
  Branches        ?        0           
=======================================
  Hits            ?     6364           
  Misses          ?     1094           
  Partials        ?        0

Files	Coverage Δ
daft/expressions/expressions.py	`92.99% <100.00%> (ø)`

colin-ho

Overall lookin really good! A few high level things:

Add the expression to the docs
Do we want to expose top level approx_percentile expressions for this? (on df and grouped df) so users can do df.approx_percentiles("values") or df.group_by('group').approx_percentiles("values")

colin-ho · 2024-04-30T16:00:10Z

daft/expressions/expressions.py

@@ -434,6 +434,21 @@ def sum(self) -> Expression:
        expr = self._expr.sum()
        return Expression._from_pyexpr(expr)

+    def approx_percentiles(self, percentiles: builtins.float | builtins.list[builtins.float]) -> Expression:
+        """Calculates the approximate percentile(s) for a float column


Would be cool to enable support for temporal columns in the future

We should actually expose this functionality to support all ordered columns (numeric, temporal, strings...)

Then we can leverage this for our sorts. Currently our sorts perform its own bespoke solution for sampling for boundaries before repartitioniong, but we can leverage approx_percentiles to generate those boundaries for us!

src/daft-core/src/series/ops/agg.rs

src/daft-dsl/src/expr.rs

daft/expressions/expressions.py

src/daft-dsl/src/python.rs

daft/expressions/expressions.py

colin-ho

LGTM!

jaychia · 2024-05-02T17:24:50Z

FYI @maxime-petitjean !

maxime-petitjean and others added 23 commits March 22, 2024 12:57

[WIP] add approx_quantile

4e1f9a7

[WIP] add approx_sketch

c4a833d

[WIP] add q parameter to approx_quantile

6dd3e2a

Rename approx_quantile to sketch_quantlie

cdaff6a

Remove mandatory lit

a955249

Add MergeSketch

1109ea0

Add basic tests

cba7162

Handle sketch conversion errors

a16faaf

Merge branch 'main' into approx-quantile-aggregation

6997832

Add class Sketch

988d3a7

Clean up

fabaf39

Merge branch 'main' into approx-quantile-aggregation

abc776e

Remove from_value and use pure-wrapper struct

5104b84

Use bincode to serialize sketch to binary column

c31a16a

[WIP] Add approx_percentile aggregation

ad3e1e5

Merge branch 'main' into approx-quantile-aggregation

2584ed7

Fix tests

6d8ebae

Rename sketch_quantile to sketch_percentile

b504e04

[WIP] use serde_arrow

798cef2

Handle array of percentiles to compute

3adad3e

Merge branch 'main' into approx-quantile-aggregation

8d92329

Fix pyproject

01fcf73

Merge into main for ExprRef changes

8cbeef3

github-actions bot added the enhancement New feature or request label Apr 24, 2024

Jay Chia added 6 commits April 24, 2024 17:28

Cleanup error handling for arrow2_serde

296f22e

Simplify code by receiving a non-expression percentiles argument

6d8a71b

Code cleanup to use Vec<f64> in the expression

0b111e7

Cleanup argument naming q -> percentiles

c24bb90

Refactor Series methods to take &[f64] inputs

e67258d

Cleanup errors

758ef9d

Jay Chia added 5 commits April 25, 2024 18:18

Remove Series::approx_percentile in favor of only doing this on the t…

733dd68

…able-level

Switch to using FixedSizeList

aa0f063

Return float64 column if single value provided

bc46d5e

merge with main

20c08ea

Fixes and unit tests

03cd785

jaychia requested review from kevinzwang and colin-ho April 26, 2024 23:58

colin-ho reviewed Apr 30, 2024

View reviewed changes

Jay Chia added 4 commits April 30, 2024 18:05

Comments

6e3a17e

Add functionality to differentiate between user input of 0.5 vs [0.5]

77e5fd3

Add docstrings

fb22880

Docs

2497d4b

github-actions bot added the documentation Improvements or additions to documentation label May 1, 2024

jaychia requested a review from colin-ho May 1, 2024 22:42

colin-ho approved these changes May 2, 2024

View reviewed changes

Merge with main

7980e58

jaychia enabled auto-merge (squash) May 2, 2024 17:17

jaychia merged commit 99a0ac0 into main May 2, 2024
27 checks passed

jaychia deleted the jay/approx-quantile-aggregation branch May 2, 2024 17:31

jaychia mentioned this pull request May 2, 2024

[FEAT] Add approximative quantile aggregation #2076

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Approximate quantile aggregation (pulled into main) #2179

[FEAT] Approximate quantile aggregation (pulled into main) #2179

jaychia commented Apr 24, 2024

codecov bot commented Apr 27, 2024 •

edited

Loading

colin-ho left a comment

colin-ho Apr 30, 2024

jaychia May 1, 2024

colin-ho left a comment

jaychia commented May 2, 2024

[FEAT] Approximate quantile aggregation (pulled into main) #2179

[FEAT] Approximate quantile aggregation (pulled into main) #2179

Conversation

jaychia commented Apr 24, 2024

codecov bot commented Apr 27, 2024 • edited Loading

Codecov Report

colin-ho left a comment

Choose a reason for hiding this comment

colin-ho Apr 30, 2024

Choose a reason for hiding this comment

jaychia May 1, 2024

Choose a reason for hiding this comment

colin-ho left a comment

Choose a reason for hiding this comment

jaychia commented May 2, 2024

codecov bot commented Apr 27, 2024 •

edited

Loading