-
Notifications
You must be signed in to change notification settings - Fork 2.1k
feat: implement partition_statistics for HashJoinExec #16956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -953,21 +953,56 @@ impl ExecutionPlan for HashJoinExec { | |
| } | ||
|
|
||
| fn partition_statistics(&self, partition: Option<usize>) -> Result<Statistics> { | ||
| if partition.is_some() { | ||
| return Ok(Statistics::new_unknown(&self.schema())); | ||
| match (partition, self.mode) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: fn partition_statistics(&self, partition: Option<usize>) -> Result<Statistics> {
let (left_stats, right_stats) = match (partition, self.mode) {
(Some(p), PartitionMode::CollectLeft) => {
(self.left.partition_statistics(None)?, self.right.partition_statistics(Some(p))?)
}
(Some(p), PartitionMode::Partitioned) => {
(self.left.partition_statistics(Some(p))?, self.right.partition_statistics(Some(p))?)
}
(None, _) | (Some(_), PartitionMode::Auto) => {
(self.left.partition_statistics(None)?, self.right.partition_statistics(None)?)
}
};
let stats = estimate_join_statistics(left_stats, right_stats, &self.on, &self.join_type, &self.join_schema)?;
Ok(stats.project(self.projection.as_ref()))
} |
||
| // For CollectLeft mode, the left side is collected into a single partition, | ||
| // so all left partitions are available to each output partition. | ||
| // For the right side, we need the specific partition statistics. | ||
| (Some(partition), PartitionMode::CollectLeft) => { | ||
| let left_stats = self.left.partition_statistics(None)?; | ||
| let right_stats = self.right.partition_statistics(Some(partition))?; | ||
|
|
||
| let stats = estimate_join_statistics( | ||
| left_stats, | ||
| right_stats, | ||
| self.on.clone(), | ||
| &self.join_type, | ||
| &self.join_schema, | ||
| )?; | ||
| Ok(stats.project(self.projection.as_ref())) | ||
| } | ||
|
|
||
| // For Partitioned mode, both sides are partitioned, so each output partition | ||
| // only has access to the corresponding partition from both sides. | ||
| (Some(partition), PartitionMode::Partitioned) => { | ||
| let left_stats = self.left.partition_statistics(Some(partition))?; | ||
| let right_stats = self.right.partition_statistics(Some(partition))?; | ||
|
|
||
| let stats = estimate_join_statistics( | ||
| left_stats, | ||
| right_stats, | ||
| self.on.clone(), | ||
| &self.join_type, | ||
| &self.join_schema, | ||
| )?; | ||
| Ok(stats.project(self.projection.as_ref())) | ||
| } | ||
|
|
||
| // For Auto mode or when no specific partition is requested, fall back to | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For auto mode, as the comment says: So if the method with partition is called after
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I see. However, it seems challenging to get the threshold here for conducting the same evaluation. Do you have any other ideas on how to share the same logic from the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll check later
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @xudong963 Do you have time to look at the question I asked above? Thank you. Also, could you please help me reopen the PR?
I searched the code base, no optimizer rules currently call partition_statistics(Some(partition))) before JoinSelection. I’m also wondering — based on this, should we assert that the auto mode will never happen in this code path? Or do we still want to determine the PartitionMode here? If so, that would mean we need to store collect_threshold_byte_size and collect_threshold_num_rows in HashJoinExec. I’m not sure if it’s a good design to expose these optimizer-related configs to the executor layer.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The only rule before JoinSelection, called
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good for me to keep the old way! |
||
| // the current behavior of getting all partition statistics. | ||
| (None, _) | (Some(_), PartitionMode::Auto) => { | ||
| // TODO stats: it is not possible in general to know the output size of joins | ||
| // There are some special cases though, for example: | ||
| // - `A LEFT JOIN B ON A.col=B.col` with `COUNT_DISTINCT(B.col)=COUNT(B.col)` | ||
| let stats = estimate_join_statistics( | ||
| self.left.partition_statistics(None)?, | ||
| self.right.partition_statistics(None)?, | ||
| self.on.clone(), | ||
| &self.join_type, | ||
| &self.join_schema, | ||
| )?; | ||
| Ok(stats.project(self.projection.as_ref())) | ||
| } | ||
| } | ||
| // TODO stats: it is not possible in general to know the output size of joins | ||
| // There are some special cases though, for example: | ||
| // - `A LEFT JOIN B ON A.col=B.col` with `COUNT_DISTINCT(B.col)=COUNT(B.col)` | ||
| let stats = estimate_join_statistics( | ||
| self.left.partition_statistics(None)?, | ||
| self.right.partition_statistics(None)?, | ||
| self.on.clone(), | ||
| &self.join_type, | ||
| &self.join_schema, | ||
| )?; | ||
| // Project statistics if there is a projection | ||
| Ok(stats.project(self.projection.as_ref())) | ||
| } | ||
|
|
||
| /// Tries to push `projection` down through `hash_join`. If possible, performs the | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to see tests call validate_statistics_with_data() to verify real execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after this is done.