Add option to FilterExec to prevent re-using input batches #12039

andygrove · 2024-08-16T22:14:53Z

Which issue does this PR close?

N/A

Rationale for this change

DataFusion Comet is currently maintaining a fork of FilterExec with a small modificiation to change the way that filtered batches are created. We have a requirement that we do not want FilterExec to pass through input batches in the case where the predicate evaluates to true for all rows in a batch (due to some array re-use in our scan).

We would like to make the DataFusion implementation of FilterExec customizable to meet our needs.

What changes are included in this PR?

Add a new boolean parameter so that we can choose whether FilterExec is allowed to return unmodified input batches.

Are these changes tested?

I did not add tests yet. I wanted to get some feedback on approach first.

Are there any user-facing changes?

metegenez · 2024-08-17T11:51:37Z

If the predicate evaluation is entirely true, it typically results in an array pointer copy. However, there are instances where you might want to copy the underlying data even if the predicate is entirely true, even if it degrades the performance of the operator.

Is there a use case other than Comet itself?

Dandandan · 2024-08-17T19:44:35Z

datafusion/physical-plan/src/filter.rs

+                    if reuse_input_batches {
+                        filter_record_batch(batch, filter_array)?
+                    } else {
+                        if filter_array.true_count() == batch.num_rows() {


As computing true count is not free, I am wondering if we can either

move this to arrow filter compute kernel

check the returned array(s) on pointer equality, copy if equal

Specifically, I think the optimize function would be a natural place to put this: https://docs.rs/arrow-select/52.2.0/src/arrow_select/filter.rs.html#181

alamb · 2024-08-18T10:52:45Z

datafusion/physical-plan/src/filter.rs

+                    if reuse_input_batches {
+                        filter_record_batch(batch, filter_array)?
+                    } else {
+                        if filter_array.true_count() == batch.num_rows() {


Specifically, I think the optimize function would be a natural place to put this: https://docs.rs/arrow-select/52.2.0/src/arrow_select/filter.rs.html#181

alamb · 2024-08-18T10:54:02Z

datafusion/physical-plan/src/filter.rs

+                        filter_record_batch(batch, filter_array)?
+                    } else {
+                        if filter_array.true_count() == batch.num_rows() {
+                            // special case where we just make an exact copy


I think I am missing something -- why go through all the effort with MutableArrayData

In other words, why isn't thus simply

Ok(batch.clone())

To literally return the input batch?

I think that would be less code and faster

That would be the same that the filter.

The use case as far as I understand is that for comet the data needs to be a new copy, as spark will reuse the existing data/arrays.

See apache/datafusion-comet#835

I see -- I think it would help if we made a function with a name that made it clear what was going on, something like

fn force_new_data_copy(...)

I also left a suggestion here apache/datafusion-comet#835 (comment)

alamb · 2024-08-18T10:54:54Z

datafusion/physical-plan/src/filter.rs

@@ -379,7 +430,8 @@ impl Stream for FilterExecStream {
            match ready!(self.input.poll_next_unpin(cx)) {
                Some(Ok(batch)) => {
                    let timer = self.baseline_metrics.elapsed_compute().timer();
-                    let filtered_batch = batch_filter(&batch, &self.predicate)?;
+                    let filtered_batch =


I recommend adding a test so we don't accidentally break the feature by accident

alamb · 2024-08-20T18:59:54Z

Marking as draft as I think this PR is no longer waiting on feedback.

github-actions · 2024-10-20T02:04:34Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

andygrove added 3 commits August 16, 2024 06:37

Provide configuration option for which kernel to use to filter batches

19a8787

integrate

c9157b2

save

1a7aadf

github-actions bot added the physical-expr Physical Expressions label Aug 16, 2024

andygrove marked this pull request as ready for review August 16, 2024 22:15

use boolean

e1d0da6

andygrove mentioned this pull request Aug 16, 2024

perf: Remove redundant copying of batches after FilterExec apache/datafusion-comet#835

Merged

Dandandan reviewed Aug 17, 2024

View reviewed changes

alamb reviewed Aug 18, 2024

View reviewed changes

alamb marked this pull request as draft August 20, 2024 18:59

github-actions bot added the Stale PR has not had any activity for some time label Oct 20, 2024

github-actions bot closed this Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to FilterExec to prevent re-using input batches #12039

Add option to FilterExec to prevent re-using input batches #12039

andygrove commented Aug 16, 2024 •

edited

Loading

metegenez commented Aug 17, 2024

Dandandan Aug 17, 2024 •

edited

Loading

alamb Aug 18, 2024

alamb Aug 18, 2024

alamb Aug 18, 2024

Dandandan Aug 18, 2024

Dandandan Aug 18, 2024

alamb Aug 19, 2024 •

edited

Loading

alamb Aug 18, 2024

alamb commented Aug 20, 2024

github-actions bot commented Oct 20, 2024

Add option to FilterExec to prevent re-using input batches #12039

Add option to FilterExec to prevent re-using input batches #12039

Conversation

andygrove commented Aug 16, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

metegenez commented Aug 17, 2024

Dandandan Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Aug 18, 2024

Choose a reason for hiding this comment

alamb Aug 18, 2024

Choose a reason for hiding this comment

alamb Aug 18, 2024

Choose a reason for hiding this comment

Dandandan Aug 18, 2024

Choose a reason for hiding this comment

Dandandan Aug 18, 2024

Choose a reason for hiding this comment

alamb Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Aug 18, 2024

Choose a reason for hiding this comment

alamb commented Aug 20, 2024

github-actions bot commented Oct 20, 2024

andygrove commented Aug 16, 2024 •

edited

Loading

Dandandan Aug 17, 2024 •

edited

Loading

alamb Aug 19, 2024 •

edited

Loading