-
Notifications
You must be signed in to change notification settings - Fork 295
feat: implement distributed sort in flotilla engine #4991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR implements distributed sort functionality in the flotilla engine, replacing the previous broadcast-only sort approach with a scalable multi-stage distributed algorithm. The implementation follows a classic distributed sorting pattern with four main phases:
- Materialization: All input partitions are materialized to enable sampling
- Sampling: Data is sampled across partitions to determine value distribution and calculate optimal partition boundaries
- Range Repartitioning: Data is repartitioned based on the calculated boundaries so each partition contains a contiguous range of values
- Local Sorting: Each partition is sorted independently while maintaining global sort order
The changes span multiple components of the system:
- Data structures: New
RangeRepartitionConfig
andRangeClusteringConfig
structs in the logical plan layer store partition boundaries asRecordBatch
objects - Execution layer: Modified
SortNode
removes single-partition restrictions and implements the full distributed algorithm with async execution - Repartitioning: Extended
RepartitionSink
and shuffle translation to support range-based partitioning - Python integration: Added Ray-based boundary calculation functions that concatenate materialized data and compute quantiles
- Dependencies: Added necessary crate dependencies (
daft-io
,daft-micropartition
,daft-recordbatch
) and visibility changes to support the new functionality
The implementation includes an optimization for single-partition inputs that bypasses the distributed logic and directly sorts the partition. This distributed approach enables the flotilla engine to handle large-scale sorting operations efficiently across multiple workers while maintaining correctness guarantees.
Confidence score: 4/5
- This PR implements complex distributed sorting logic with proper error handling and follows established patterns
- Score reflects the complexity of distributed algorithms and potential for subtle race conditions or boundary edge cases
- Pay close attention to the async execution flow in
sort.rs
and boundary calculation logic in the Python integration
12 files reviewed, 3 comments
3336b4c
to
32320fa
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4991 +/- ##
==========================================
+ Coverage 76.11% 76.40% +0.29%
==========================================
Files 933 934 +1
Lines 128478 128756 +278
==========================================
+ Hits 97787 98376 +589
+ Misses 30691 30380 -311
🚀 New features to boost your workflow:
|
32320fa
to
2420524
Compare
a28d80c
to
52c144f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
Changes Made
Implemented distributed sort in the flotilla engine.
Approach:
Special cases:
Related Issues
Checklist
docs/mkdocs.yml
navigation