-
Notifications
You must be signed in to change notification settings - Fork 295
feat: implement into_batches operator on flotilla distrubted engine #4958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR implements support for DataFrame.into_batches()
in the distributed (Flotilla/Ray) execution engine, achieving parity with the native execution engine. The implementation adds a new distributed pipeline node (IntoBatchesNode
) that handles batching logic across distributed workers.
The changes span multiple layers of the distributed execution stack:
-
Pipeline Node Infrastructure: Adds a new
IntoBatchesNode
insrc/daft-distributed/src/pipeline_node/into_batches.rs
that implements the distributed batching strategy. The node materializes downstream pipeline outputs, accumulates them until reaching the target batch size, then submits tasks that force local pipeline combination of RecordBatches within MicroPartitions. -
Translation Layer: Updates the pipeline translator in
translate.rs
to handleLogicalPlan::IntoBatches
operations by creatingIntoBatchesNode
instances with appropriate configuration. -
Stage Builder: Moves
IntoBatches
from the list of unsupported operations to supported operations instage_builder.rs
, enabling the distributed engine to process logical plans containing batching operations. -
API Surface: Removes the runtime restriction in
dataframe.py
that previously preventedinto_batches()
from being used with the Ray runner. -
Test Coverage: Updates tests to run on both native and distributed engines, with adjusted assertions to accommodate the "best effort" nature of distributed batching where exact batch sizes cannot be guaranteed but minimum sizes are maintained.
The distributed implementation follows a streaming approach where it collects materialized outputs and creates new tasks when accumulated data reaches the batch size threshold, ensuring efficient memory usage and proper data distribution across workers.
Confidence score: 1/5
- This PR contains a critical compilation bug that will prevent it from working
- Score reflects a serious issue in the core IntoBatches implementation that uses a moved variable
- Pay close attention to
src/daft-distributed/src/pipeline_node/into_batches.rs
line 124
6 files reviewed, 1 comment
c767693
to
8bd0b06
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4958 +/- ##
==========================================
+ Coverage 78.03% 79.57% +1.54%
==========================================
Files 916 917 +1
Lines 126917 126934 +17
==========================================
+ Hits 99043 101014 +1971
+ Misses 27874 25920 -1954
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good!
8bd0b06
to
39b8227
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job on this!
b3c1597
to
1555bb0
Compare
1555bb0
to
a191129
Compare
Changes Made
This PR adds support for DataFrame.into_batches() in the distributed (Flotilla/Ray) engine and wires it through the pipeline.
The node logic is as follows:
for (3/4), it's important to submit the next task with a IntoBatches operator, because that will force the local pipeline to combine all the RecordBatches in the give MicroPartition, which the logic done in the local execution op
Daft/src/daft-local-execution/src/intermediate_ops/into_batches.rs
Lines 35 to 43 in 048fd24
Related Issues
Native runner: #4935
Checklist