Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make swap_hash_join public API #10702

Merged
merged 1 commit into from
May 30, 2024
Merged

Conversation

viirya
Copy link
Member

@viirya viirya commented May 28, 2024

Which issue does this PR close?

Closes #9603.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label May 28, 2024
fn swap_hash_join(
/// This function is public so other downstream projects can use it
/// to construct `HashJoinExec` with right side as the build side.
pub fn swap_hash_join(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark supports build right in HashJoin operator internally. In planning stage, it decides which side is build side.

However, DataFusion HashJoin planning takes alternative approach. DataFusion HashJoin operator only supports build left. In optimization stage, once DataFusion finds build right is better, it will call this function to swap inputs for HashJoin to achieve build right effect.

So, to make Comet to use DataFusion HashJoin to support Spark build right HashJoin, we only need this function to be public so Comet can directly use it.

Copy link
Contributor

@alamb alamb May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this input takes &HashJoinExec as input, it seems like it might be easier to find / make more sense as a method on HashJoinExec itself 🤔

So maybe we could update the code so this function is HashJoinExec::swap_inputs or something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. 👍
Let me update it accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, some functions used in swap_hash_join are also used by other functions like swap_nl_join. So I cannot simply move these functions into HashJoinExec as private ones.

Only I can do is move these functions to joins/utils.rs in physical-plan crate and make them as public APIs so both join_selection.rs and HashJoinExec can use them.

It makes more APIs public actually, including swap_join_filter, swap_reverting_projection and swap_join_type.

I think it is not so worth making more public APIs just for swap_hash_join.

Seems the current approach is better?

WDYT?

fn swap_hash_join(
/// This function is public so other downstream projects can use it
/// to construct `HashJoinExec` with right side as the build side.
pub fn swap_hash_join(
Copy link
Contributor

@alamb alamb May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this input takes &HashJoinExec as input, it seems like it might be easier to find / make more sense as a method on HashJoinExec itself 🤔

So maybe we could update the code so this function is HashJoinExec::swap_inputs or something

@@ -157,7 +157,9 @@ fn swap_join_projection(
}

/// This function swaps the inputs of the given join operator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// This function swaps the inputs of the given join operator.
/// This function swaps the inputs of the given join operator, to construct `HashJoinExec` with right side as the build side

@@ -157,7 +157,9 @@ fn swap_join_projection(
}

/// This function swaps the inputs of the given join operator.
fn swap_hash_join(
/// This function is public so other downstream projects can use it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// This function is public so other downstream projects can use it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a clear comment, it might be wrongly moved out of public APIs easily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but the entire idea of pub modifier implies 3rd party users to use the pub method? 🤔

@@ -157,7 +157,9 @@ fn swap_join_projection(
}

/// This function swaps the inputs of the given join operator.
fn swap_hash_join(
/// This function is public so other downstream projects can use it
/// to construct `HashJoinExec` with right side as the build side.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// to construct `HashJoinExec` with right side as the build side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a benefit in leaving the documentation that explains the rationale for why this is public

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @viirya

@andygrove andygrove merged commit ad2b1dc into apache:main May 30, 2024
23 checks passed
@viirya
Copy link
Member Author

viirya commented May 30, 2024

Thanks @andygrove @alamb @Dandandan @comphead

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support build right with HashJoin in DataFusion
5 participants