Skip to content

Conversation

srilman
Copy link
Contributor

@srilman srilman commented May 14, 2025

Changes Made

Adds Expr.skew. Implements it like Polars and Spark (Pandas and DuckDB implement a slightly different algorithm with normalization, Polars has a flag to choose between the options). Series.skew implements it normally and populate_aggregation_stages has a specialized version.

Related Issues

Closes #4332.

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the feat label May 14, 2025
}

fn grouped_skew(&self, groups: &GroupIndices) -> Self::Output {
let grouped_skew_iter = stats::grouped_stats(self, groups)?.map(|(stats, group)| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially dumb question, but why do we need two variants. Seems like the grouped_ variant can always be derived from the regular.

let (m3, m2) = values.fold((0., 0.), |(m3_acc, m2_acc), v| {
(
m3_acc + (v - mean).powi(3),
(v - mean).mul_add(v - mean, m2_acc),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if there's a better way to do this. Couldn't figure out a good way to reuse / copy the iterator so just calculated both aggregations in the same iterator

);

let result = numerator.div(denom).div(n).alias(output_name);
final_exprs.push(result);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it like this using the DuckDB algorithm instead of just accumulation and calling skew because I couldn't really figure out how to implement a list_skew on ListArrays. Kept running into issues

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you happen to think that would be easier or have guidance for the next time, just let me know

@srilman srilman requested a review from colin-ho May 14, 2025 01:40
@srilman
Copy link
Contributor Author

srilman commented May 14, 2025

@colin-ho Let me know if someone else would be better to have review. Just tagging you for review initially

@srilman srilman requested a review from universalmind303 May 14, 2025 01:43
Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Definitely not something we need to do here, but it might be worth thinking about how we can further simplify the agg exprs like we've been doing with scalar exprs.

Copy link

codecov bot commented May 14, 2025

Codecov Report

Attention: Patch coverage is 90.28571% with 17 lines in your changes missing coverage. Please review.

Project coverage is 78.56%. Comparing base (8c10e5b) to head (ac9aa3f).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-core/src/datatypes/agg_ops.rs 33.33% 6 Missing ⚠️
src/daft-core/src/series/ops/agg.rs 71.42% 4 Missing ⚠️
daft/dataframe/dataframe.py 25.00% 3 Missing ⚠️
src/daft-logical-plan/src/ops/project.rs 0.00% 3 Missing ⚠️
src/daft-sql/src/modules/aggs.rs 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4346      +/-   ##
==========================================
+ Coverage   78.21%   78.56%   +0.34%     
==========================================
  Files         814      820       +6     
  Lines      110992   109850    -1142     
==========================================
- Hits        86811    86299     -512     
+ Misses      24181    23551     -630     
Files with missing lines Coverage Δ
daft/expressions/expressions.py 95.91% <100.00%> (+0.01%) ⬆️
src/daft-core/src/array/ops/skew.rs 100.00% <100.00%> (ø)
src/daft-core/src/datatypes/mod.rs 40.00% <ø> (ø)
src/daft-core/src/utils/stats.rs 98.43% <100.00%> (+0.43%) ⬆️
src/daft-dsl/src/expr/mod.rs 79.63% <100.00%> (+0.21%) ⬆️
src/daft-dsl/src/python.rs 90.38% <100.00%> (-1.49%) ⬇️
...ft-physical-plan/src/physical_planner/translate.rs 93.93% <100.00%> (+0.39%) ⬆️
src/daft-recordbatch/src/lib.rs 85.81% <100.00%> (+0.05%) ⬆️
src/daft-sql/src/modules/aggs.rs 68.04% <0.00%> (-0.71%) ⬇️
daft/dataframe/dataframe.py 85.67% <25.00%> (-0.23%) ⬇️
... and 3 more

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@srilman srilman merged commit 34a7393 into main May 14, 2025
98 of 100 checks passed
@srilman srilman deleted the slade/skew branch May 14, 2025 20:30
@srilman
Copy link
Contributor Author

srilman commented May 14, 2025

Definitely not something we need to do here, but it might be worth thinking about how we can further simplify the agg exprs like we've been doing with scalar exprs.

Agree, yeah especially making it a bit more flexible for future additions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expr.skew
2 participants