Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Extract BatchCoalescer to its own module #12047

Merged
merged 1 commit into from
Aug 21, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 18, 2024

Which issue does this PR close?

Part of #7957 and #11628

Rationale for this change

I have hopes to improve how coalesce batches works in DataFusion saving a copy to improve performance, partly by avoiding the need for CoalesceBatchesExec

Part of this is to use the BatchCoalescer more, so let's pull it into its own module. This also makes the code boundaries a bit clearer (e.g. the functions that form the public interface now must be marked pub)

What changes are included in this PR?

  1. Move BatchCoalescer to its own module
  2. Refine the documentation

Are these changes tested?

By existing CI

Are there any user-facing changes?

No functional changes

@github-actions github-actions bot added the physical-expr Physical Expressions label Aug 18, 2024
@@ -365,511 +335,3 @@ impl RecordBatchStream for CoalesceBatchesStream {
self.coalescer.schema()
}
}

/// Concatenate multiple record batches into larger batches
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this PR is to move this code into its own module

@@ -346,7 +316,7 @@ impl CoalesceBatchesStream {
}
CoalesceBatchesStreamState::Exhausted => {
// Handle the end of the input stream.
return if self.coalescer.buffer.is_empty() {
return if self.coalescer.is_empty() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the BatchCoalescer into its own module means this code must use an accessor rather than directly look at the coalescer's state, which I think improves the modularity (a tiny bit)

/// combined filter/coalesce operation.
///
#[derive(Debug)]
pub struct BatchCoalescer {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only changes in this module are to make the struct's and some methods pub and update the documentation slightly

@ozankabak
Copy link
Contributor

Will review tomorrow 🚀

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you 🚀

@alamb
Copy link
Contributor Author

alamb commented Aug 21, 2024

Thank you for the review @ozankabak

@alamb alamb merged commit 121f330 into apache:main Aug 21, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants