Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MINOR]:Do not introduce unnecessary repartition when row count is 1. #7832

Merged
merged 6 commits into from
Oct 17, 2023
Merged

[MINOR]:Do not introduce unnecessary repartition when row count is 1. #7832

merged 6 commits into from
Oct 17, 2023

Conversation

mustafasrepo
Copy link
Contributor

@mustafasrepo mustafasrepo commented Oct 16, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

As observed in discussion bu @alamb. Currently we add RoundRobin repartition when we know that input row number is 1(repartition is not helpful). This PR fixes this problem.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@mustafasrepo mustafasrepo changed the title Do not introduce unnecessary repartition when row count is 1. [MINOR]:Do not introduce unnecessary repartition when row count is 1. Oct 16, 2023
@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Oct 16, 2023
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix, did a review and it LGTM.

// Don't need to apply when the returned row count is not greater than 1:
let stats = child.statistics();
let repartition_beneficial_stats = if stats.is_exact {
stats.num_rows.map(|num_rows| num_rows > 1).unwrap_or(true)
Copy link
Contributor

@Dandandan Dandandan Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that repartitioning is only useful when having multiple batches, we can consider changing this to:
num_rows > batch_size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I will change accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan I updated check as you suggested. Some of the existing tests changes with this change. I think, changes are for the better. However, I would appreciate If you can double check them.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thank you @mustafasrepo and @ozankabak

Let's wait for @Dandandan to respond prior to merging though

.map(|num_rows| num_rows <= 1)
.unwrap_or(false));
Statistics {
// the output row count is surely not larger than its input row count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it matters, but the output rows could be larger than the input rows for COUNT(*) queries -- specifically if there are no input rows, COUNT(*) still produces an output row 🤔

❯ create table t(x int) as values (1);
0 rows in set. Query took 0.001 seconds.

❯ select count(*) from t where x > 1000;
+----------+
| COUNT(*) |
+----------+
| 0        |
+----------+

Copy link
Contributor Author

@mustafasrepo mustafasrepo Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right. I missed that. I think the safest way is to check num_rows == 1. Changed accordingly

@Dandandan Dandandan merged commit 9aacdee into apache:main Oct 17, 2023
22 checks passed
@Dandandan
Copy link
Contributor

Thanks @mustafasrepo and @ozankabak !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants