Add LimitPushdown optimization rule and CoalesceBatchesExec fetch #27

alihandroid · 2024-07-10T10:54:17Z

Which issue does this PR close?

Closes #.

Rationale for this change

Physical plans can be optimized further by pushing GlobalLimitExec and LocalLimitExec down through certain nodes, or using versions of their children nodes with fetch limits, without changing the result. This reduces unnecessary data transfer and processing for a more efficient plan execution.

CoalesceBatchesExec can also benefit from this improvement, and as such, a fetch limit functionality is implemented for it.

For example,

GlobalLimitExec: skip=0, fetch=5
  StreamingTableExec: partition_sizes=1, projection=[c1, c2, c3], infinite_source=true

can be turned into

StreamingTableExec: partition_sizes=1, projection=[c1, c2, c3], infinite_source=true, fetch=5

and

GlobalLimitExec: skip=0, fetch=5
  CoalescePartitionsExec
    FilterExec: c3@2 > 0
      RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
        StreamingTableExec: partition_sizes=1, projection=[c1, c2, c3], infinite_source=true

can be turned into

GlobalLimitExec: skip=0, fetch=5
  CoalescePartitionsExec
    LocalLimitExec: fetch=5
      FilterExec: c3@2 > 0
        RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
          StreamingTableExec: partition_sizes=1, projection=[c1, c2, c3], infinite_source=true

without changing the result, but using fewer resources and finishing faster

Other examples can be found in the tests provided in limit_pushdown.rs

What changes are included in this PR?

Implement LimitPushdown Rule:

Introduced new APIs in the ExecutionPlan trait:
- with_fetch(&self, fetch: Option<usize>) -> Option<Arc<dyn ExecutionPlan>>: Returns fetching version if supported, None otherwise. The default implementation returns None
- supports_limit_pushdown(&self) -> bool: Returns true if a node supports limit pushdown. The default implemenation returns false

Add fetch support to CoalesceBatchesExec:

Add fetch field and with_fetch implementation
Add new_with_fetch constructor
Implement fetch limit functionality

Are these changes tested?

Unit tests are provided for LimitPushdown and the new fetching support for CoalesceBatchesExec

Are there any user-facing changes?

No. The changes only affect performance

…c as supporting limit pushdown

alihandroid · 2024-07-17T11:25:20Z

Rebased the fork because of conflicts

mustafasrepo · 2024-07-19T12:55:26Z

datafusion/sqllogictest/test_files/group_by.slt

-07)------------AggregateExec: mode=Partial, gby=[date_bin(IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 900000000000 }, ts@0) as date_bin(Utf8("15 minutes"),unbounded_csv_with_timestamps.ts)], aggr=[], ordering_mode=Sorted
-08)--------------RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
-09)----------------StreamingTableExec: partition_sizes=1, projection=[ts], infinite_source=true, output_ordering=[ts@0 DESC]
+03)----LocalLimitExec: fetch=5


This LocalLimitExec seems to be redundant, since it is pushed down below projection.

mustafasrepo · 2024-07-19T12:55:43Z

datafusion/sqllogictest/test_files/group_by.slt

-03)----ProjectionExec: expr=[name@0 as name, date_bin(IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 900000000000 }, ts@1) as time_chunks]
-04)------RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
-05)--------StreamingTableExec: partition_sizes=1, projection=[name, ts], infinite_source=true, output_ordering=[name@0 DESC, ts@1 DESC]
+03)----LocalLimitExec: fetch=5


Same problem above

mustafasrepo · 2024-07-19T12:56:33Z

datafusion/sqllogictest/test_files/join_disable_repartition_joins.slt

+03)----LocalLimitExec: fetch=10
+04)------ProjectionExec: expr=[a@0 as a2, b@1 as b]


This limit pushed down below projection as fetch inside the CoalesceBatchesExec. This limit is redundant.

alamb

This looks very cool 👓

I think it would also help if we could document somewhere why we need both this physical optimizer and the logical optimizer here: https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/push_down_limit.rs

I wonder perhaps if we should remove the logical pushdown entirely in favor of physical pushdown 🤔

alamb · 2024-07-23T20:56:45Z

datafusion/core/src/physical_optimizer/limit_pushdown.rs

+// specific language governing permissions and limitations
+// under the License.
+
+//! This rule reduces the amount of data transferred by pushing down limits as much as possible.


Suggested change

//! This rule reduces the amount of data transferred by pushing down limits as much as possible.

//! [`LimitPushdown`]: Pushes limits into the plan as much as much as possible.

alamb · 2024-07-23T21:01:51Z

datafusion/core/tests/memory_limit/mod.rs

+                "|               |     TableScan: t projection=[a, b]                                                                            |",
+                "| physical_plan | GlobalLimitExec: skip=0, fetch=10                                                                             |",
+                "|               |   SortPreservingMergeExec: [a@0 ASC NULLS LAST,b@1 ASC NULLS LAST], fetch=10                                  |",
+                "|               |     LocalLimitExec: fetch=10                                                                                  |",


Another way to get this plan would be to support limit pushdown into MemoryExec (rather than applying the limit with a new LocalLimitExec)

alamb · 2024-07-23T21:02:18Z

datafusion/physical-plan/src/coalesce_batches.rs

@@ -319,6 +376,86 @@ mod tests {
        Ok(())
    }

+    #[tokio::test]


alamb · 2024-07-23T21:05:30Z

datafusion/sqllogictest/test_files/union.slt

-11)------------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c3], has_header=true
+03)----LocalLimitExec: fetch=5
+04)------UnionExec
+05)--------SortExec: expr=[c9@1 DESC], preserve_partitioning=[true]


I think SortExec can be limited -- which would likely be better to use too (it has a special implementation with limit)

https://docs.rs/datafusion-physical-plan/40.0.0/src/datafusion_physical_plan/sorts/sort.rs.html#683-684

I think you could also push the limit all the way down to the CSVExec in this plan too

I actually would have expected the logical optimizer to have already pushed the limit into the inputs 🤔

https://github.com/apache/datafusion/blob/fc8e7b90356b94af5f591240b8165bc4c8275a51/datafusion/optimizer/src/push_down_limit.rs#L103

berkaysynnada · 2024-07-24T12:08:55Z

Congrats again, @alihandroid. We only have two remaining issues to address. You can go ahead and open this PR to the upstream repo now. While you work on the remaining tasks, we might receive additional feedback from the upstream reviewers, so there's no need to wait any longer.

alihandroid · 2024-07-25T15:27:14Z

@berkaysynnada Still working on the coalesce batches test cases but the limit merging is done. The PR at the upstream repo is apache#11652

berkaysynnada · 2024-07-25T15:40:35Z

@berkaysynnada Still working on the coalesce batches test cases but the limit merging is done. The PR at the upstream repo is apache#11652

Great job. The tests are failing because there are two concat_batches functions: one that requires the number of rows and another that does not. You should use the one from arrow::compute. I took a quick look, and it seems well done. Good work 👍🏻

Remove redundant lınes ın docstrıng

ozankabak · 2024-07-28T19:32:19Z

Merged upstream.

github-actions bot added the core label Jul 10, 2024

alihandroid changed the title ~~Add LimitPushdown optimization rule~~ Add LimitPushdown optimization rule and CoalesceBatchesExec fetch Jul 15, 2024

alihandroid marked this pull request as ready for review July 15, 2024 15:37

github-actions bot added documentation Improvements or additions to documentation sqllogictest labels Jul 17, 2024

alihandroid added 20 commits July 17, 2024 14:12

Add LimitPushdown skeleton

d868d60

Transform StreamTableExec into fetching version when skip is 0

36f0ba9

Transform StreamTableExec into fetching version when skip is non-zero

532fe69

Fix non-zero skip test

8dcd86b

Add fetch field to CoalesceBatchesExec

26636f1

Tag ProjectionExec, CoalescePartitionsExec and SortPreservingMergeExe…

145da2d

…c as supporting limit pushdown

Add with_fetch to SortExec

1d0d2c8

Push limit down through supporting ExecutionPlans

d4eab68

Reorder LimitPushdown optimization to before SanityCheckPlan

ff1609f

Refactor LimitPushdown tests

3a8989f

Refactor LimitPushdown tests

10d6436

Add more LimitPushdown tests

26c907a

Add fetch support to CoalesceBatchesExec

1f3299a

Fix tests that were affected

49efeb7

Refactor LimitPushdown push_down_limits

68a315c

Remove unnecessary parameter from coalesce_batches_exec

7960c0a

Format files

5294cd2

Apply clippy fixes

2bb7385

Make CoalesceBatchesExec display consistent

54d8713

Fix slt tests according to LimitPushdown rules

db48495

alihandroid force-pushed the apache_main branch from a33c2a6 to db48495 Compare July 17, 2024 11:23

mustafasrepo reviewed Jul 19, 2024

View reviewed changes

Minor

a096ea4

ozankabak mentioned this pull request Jul 23, 2024

Extract CoalesceBatchesStream to a struct apache/datafusion#11610

Merged

alamb mentioned this pull request Jul 23, 2024

GC StringViewArray in CoalesceBatchesStream apache/datafusion#11587

Merged

alamb reviewed Jul 23, 2024

View reviewed changes

Merge all Global/Local-LimitExec combinations in LimitPushdown

0f7aa8d

alihandroid added 2 commits July 25, 2024 18:51

Merge remote-tracking branch 'upstream/main' into apache_main

dd9ad7a

Fix compile errors after merge

f9becb6

github-actions bot added sql logical-expr labels Jul 25, 2024

Merge remote-tracking branch 'upstream/main' into apache_main

0d01e76

github-actions bot added documentation Improvements or additions to documentation physical-expr optimizer substrait labels Jul 26, 2024

Update datafusion/core/src/physical_optimizer/limit_pushdown.rs

a4a7794

Remove redundant lınes ın docstrıng

github-actions bot removed documentation Improvements or additions to documentation physical-expr sql logical-expr optimizer substrait labels Jul 26, 2024

Avoid code duplication

c4c7276

github-actions bot added the optimizer label Jul 26, 2024

Incorporate review feedback

dcd69b6

ozankabak closed this Jul 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LimitPushdown optimization rule and CoalesceBatchesExec fetch #27

Add LimitPushdown optimization rule and CoalesceBatchesExec fetch #27

alihandroid commented Jul 10, 2024 •

edited

Loading

alihandroid commented Jul 17, 2024

mustafasrepo Jul 19, 2024

mustafasrepo Jul 19, 2024

mustafasrepo Jul 19, 2024 •

edited

Loading

alamb left a comment •

edited

Loading

alamb Jul 23, 2024

alamb Jul 23, 2024

alamb Jul 23, 2024

alamb Jul 23, 2024 •

edited

Loading

alamb Jul 23, 2024

berkaysynnada commented Jul 24, 2024

alihandroid commented Jul 25, 2024

berkaysynnada commented Jul 25, 2024 •

edited

Loading

ozankabak commented Jul 28, 2024

		03)----LocalLimitExec: fetch=10
		04)------ProjectionExec: expr=[a@0 as a2, b@1 as b]

	//! This rule reduces the amount of data transferred by pushing down limits as much as possible.
	//! [`LimitPushdown`]: Pushes limits into the plan as much as much as possible.

Add LimitPushdown optimization rule and CoalesceBatchesExec fetch #27

Add LimitPushdown optimization rule and CoalesceBatchesExec fetch #27

Conversation

alihandroid commented Jul 10, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alihandroid commented Jul 17, 2024

mustafasrepo Jul 19, 2024

Choose a reason for hiding this comment

mustafasrepo Jul 19, 2024

Choose a reason for hiding this comment

mustafasrepo Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb Jul 23, 2024

Choose a reason for hiding this comment

alamb Jul 23, 2024

Choose a reason for hiding this comment

alamb Jul 23, 2024

Choose a reason for hiding this comment

alamb Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Jul 23, 2024

Choose a reason for hiding this comment

berkaysynnada commented Jul 24, 2024

alihandroid commented Jul 25, 2024

berkaysynnada commented Jul 25, 2024 • edited Loading

ozankabak commented Jul 28, 2024

alihandroid commented Jul 10, 2024 •

edited

Loading

mustafasrepo Jul 19, 2024 •

edited

Loading

alamb left a comment •

edited

Loading

alamb Jul 23, 2024 •

edited

Loading

berkaysynnada commented Jul 25, 2024 •

edited

Loading