Skip to content

Conversation

srilman
Copy link
Contributor

@srilman srilman commented May 7, 2025

Changes Made

Currently, most of the actual work for count_distinct is done at the finalization step, which only runs on 1 task / thread. Thus, very little parallelism is possible. With this change, Daft will perform a local distinct operation on micro-partitions during the sink phase, which can be done in parallel.

Performance results of running ClickBench Queries 4 & 5 (6 runs each, skipping first) on a M4 Max MacBook Pro w/ 14 cores & 32 GB memory:

Query Before After Improvement
Q4 2.95s 1.89s 1.56x
Q5 2.82s 1.30s 2.16x

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the perf label May 7, 2025
@srilman srilman requested a review from colin-ho May 7, 2025 22:42
Copy link

codecov bot commented May 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.27%. Comparing base (5117945) to head (527a080).
Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #4325   +/-   ##
=======================================
  Coverage   78.27%   78.27%           
=======================================
  Files         818      818           
  Lines      107999   107999           
=======================================
+ Hits        84531    84532    +1     
+ Misses      23468    23467    -1     
Files with missing lines Coverage Δ
...ft-physical-plan/src/physical_planner/translate.rs 93.45% <100.00%> (ø)

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@srilman srilman merged commit 971aa24 into main May 8, 2025
55 checks passed
@srilman srilman deleted the slade/count_distinct_local_agg branch May 8, 2025 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants