-
Notifications
You must be signed in to change notification settings - Fork 295
feat: Flotilla broadcast join #4867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR implements broadcast joins for the Flotilla distributed execution engine, a significant performance optimization for joins where one table is much smaller than the other. The implementation adds a new BroadcastJoinNode
that materializes the smaller table and broadcasts it to all partitions of the larger table, avoiding expensive shuffling operations.
The core changes include:
-
New BroadcastJoinNode: A complete implementation in
broadcast_join.rs
that handles the broadcast join execution pattern - materializing the broadcast side first, then copying it to each receiver task with hash join instructions. -
Join Strategy Logic: The
determine_join_strategy
method intranslate.rs
implements threshold-based decision making (10MB default) and partitioning compatibility checks to choose between Hash, Broadcast, Cross, and SortMerge strategies. -
Code Refactoring: Several files were updated to support the new functionality, including extracting reusable logic for creating in-memory scans from materialized outputs and renaming functions for clarity.
-
Side Swapping Logic: Proper handling of join semantics where the broadcast/receiver sides may need to be swapped based on join type (e.g., LEFT joins require the left side to remain as receiver).
The implementation reuses logic from the existing Ray runner to ensure consistency in join strategy determination and maintains the distributed pipeline abstraction while enabling significant performance improvements for star-schema and similar query patterns.
Confidence score: 4/5
• This PR is generally safe to merge with solid implementation of broadcast joins following established patterns
• Score reflects the complexity of distributed join logic and potential for subtle edge cases in join semantics or memory management
• The broadcast_join.rs file needs the most attention due to its core role in the new functionality
7 files reviewed, 1 comment
// For broadcast joins, ensure that the left side of the join is the smaller side | ||
let (smaller_size_bytes, left_is_larger) = if right_stats.size_bytes < left_stats.size_bytes | ||
{ | ||
(right_stats.size_bytes, true) | ||
} else { | ||
(left_stats.size_bytes, false) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: The comment says 'ensure that the left side of the join is the smaller side' but the logic actually identifies which side is larger/smaller without enforcing left to be smaller. Consider updating the comment for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of makes a decent point, maybe update the comment to say "determine the smaller side"
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4867 +/- ##
==========================================
+ Coverage 79.26% 79.27% +0.01%
==========================================
Files 908 910 +2
Lines 125851 126141 +290
==========================================
+ Hits 99751 100003 +252
- Misses 26100 26138 +38
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should LGTM, just a couple of things
where | ||
F: FnOnce(LocalPhysicalPlanRef) -> LocalPhysicalPlanRef + Send + Sync + 'static, | ||
{ | ||
fn make_in_memory_scan_from_materialized_outputs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still use this function anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah its used in broadcast join
join_plan, | ||
config, | ||
psets, | ||
SchedulingStrategy::Spread, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit of an aside, do you think having a scheduling strategy here makes sense any more? Cause for operators like this, are they able to use the scheduling strategy of the children task they use (in this case, the input task from receiver_input)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i think it makes sense to just use the scheduling strategy of the child.
On this note, I want to refactor the logic around 'building' tasks. Right now the only api to do so is task.with_new_task
which accepts a new swordfish task, but then there's still a lot of manual labor needed to make this new swordfish task in the first place, involving things like scheduling strategy, the context, task id, psets, etc.
// For broadcast joins, ensure that the left side of the join is the smaller side | ||
let (smaller_size_bytes, left_is_larger) = if right_stats.size_bytes < left_stats.size_bytes | ||
{ | ||
(right_stats.size_bytes, true) | ||
} else { | ||
(left_stats.size_bytes, false) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of makes a decent point, maybe update the comment to say "determine the smaller side"
Changes Made
Add broadcast joins to flotilla. The translation logic for determining join strategy + broadcast join threshold is copied from the old ray runner's physical plan translator.
The broadcast join works by simply materializing the partition refs for the broadcast side, then copying them to the each of the receiver sides task (aka broadcast), and then adding a hash join instruction.
Related Issues
Checklist
docs/mkdocs.yml
navigation