Optimizer Sanity Checker #24

yfy- · 2024-06-09T17:44:03Z

Rationale for this change

We extend the existing PipelineChecker with additional checks of order and distribution requirements. In a physical plan each plan should be pipeline friendly (previously checked by PipelineChecker) and each child plan's output order and partitioning should satisfy the parent's respective requirements. We combine these checks into the SanityCheckPlan proposed with this PR.

With the SanityCheckPlan, as the optimizer steps change and grow, we ensure that the order and distribution requirements of the final phsyical plan are always satisfied so that it can yield correct results.

What changes are included in this PR?

The SanityCheckPlan step as the last step of the physical plan optimizations.
SanityCheckPlan contains the former PipelineChecker, that's why PipelineChecker is deleted.

Are these changes tested?

We use the existing test cases of the former PipelineChecker. In addition following cases are added:

Using `BoundedWindowAggExec`

2 cases: First with the child order requirement is satisfied and another one that is not satisifed.

Using `GlobalLimitExec`

2 cases: First with the child distribution requirement is satisfied and another one that is not satisifed.

Using `LocalLimitExec`

We check when there are no requirements at all our check passes.

Using `SortMergeJoinExec`

3 cases: First case with both children satisify both requirements. Second case, where the second child does not satisfy the order requirement. Finally, a case where the second child does not satisfy distribution requirements.

Only includes sort reqs, docs will be added.

mustafasrepo · 2024-06-10T05:53:00Z

datafusion/core/src/physical_optimizer/sanity_checker.rs

+        let child_sort_req = child_sort_req.as_ref().unwrap();
+        if !child
+            .equivalence_properties()
+            .ordering_satisfy_requirement(&child_sort_req)


This API is the correct API (in response to your email).

mustafasrepo · 2024-06-10T07:07:12Z

datafusion/core/src/physical_optimizer/sanity_checker.rs

+        let bw = bounded_window_exec("c9", sort_exprs, sort);
+        let sanity_checker = SanityCheckPlan::new();
+        let opts = ConfigOptions::default();
+        assert_eq!(sanity_checker.optimize(bw, &opts).is_ok(), true);


Additionally you can assert plan of the bw to make test more explicit. By this way plan will be more visible: Somethinkg like below:

let expected_plan = vec![ // Fill Plan ]; let actual_plan = get_plan_string(&bw); assert_eq(expected_plan, actual_plan)

I have added plan checks in the below commit:

aea95c6

mustafasrepo · 2024-06-10T07:07:27Z

datafusion/core/src/physical_optimizer/sanity_checker.rs

+                nulls_first: false,
+            },
+        )];
+        let bw = bounded_window_exec("c9", sort_exprs, source);


Same thing applies here

I have added plan checks in the below commit:

aea95c6

berkaysynnada · 2024-06-13T05:46:43Z

datafusion/core/src/physical_optimizer/optimizer.rs

+            // SanityChecker rule checks whether the order and the
+            // partition requirements of each execution plan is
+            // satisfied by its children or not.
+            Arc::new(SanityCheckPlan::new()),


We can discuss if this rule can be merged with PipelineChecker.

From my understanding, in addition to doing is_pipeline_friendly, PipelineChecker also checks a condition around SymmetricHashJoin operation. Since PipelineChecker is more enchanced (to check pipeline friendliness) should we remove the pipeline friendliness check from SanityChecker? Or to simplify, I can merge PipelineChecker into SanityChecker (and remove PipelineChecker).

I think, we should either

remove pipeline friendliness check from SanityChecker (as this is done in another rule already)

or merge PipelineChecker into SanityChecker .

However, merging these two rules seems more appropriate. Hence, we should follow this approach.

They are merged in the below commit:

d89b1c9

datafusion/core/src/physical_optimizer/sanity_checker.rs

berkaysynnada · 2024-06-13T05:52:55Z

datafusion/core/src/physical_optimizer/sanity_checker.rs

+    plan: Arc<dyn ExecutionPlan>,
+) -> Result<Transformed<Arc<dyn ExecutionPlan>>> {
+    if !plan.execution_mode().pipeline_friendly() {
+        return plan_err!("Plan {:?} is not pipeline friendly.", plan);


Pipeline friendly means the plan can be executed in bounded memory even if the source is unbounded, or the plan has a bounded source. If we want to apply this rule before PipelineChecker, we need to remove this check, or alternatively we can move this rule after PipelineChecker.

yfy- · 2024-06-13T17:26:05Z

@mustafasrepo, about the testing strategy: So far I have picked one scenario for order requirements (using BoundedWindowAggExec), one scenario for partition requirements (GlobalLimitExec) and a scenario when there are no requirements at all (LocalLimitExec).

Should we extend this to cover all ExecutionPlan's that have some requirement, sort and/or distribution? I can add test cases for the following as well:

AggregateExec
RecursiveQueryExec
SymmetricHashJoinExec
NestedLoopJoinExec
SortMergeJoinExec
CrossJoinExec
HashJoinExec
DataSinkExec
WindowAggExec
SortExec
PartialSortExec
SortPreservingMergeExec

mustafasrepo · 2024-06-14T06:06:19Z

@mustafasrepo, about the testing strategy: So far I have picked one scenario for order requirements (using BoundedWindowAggExec), one scenario for partition requirements (GlobalLimitExec) and a scenario when there are no requirements at all (LocalLimitExec).

Should we extend this to cover all ExecutionPlan's that have some requirement, sort and/or distribution? I can add test cases for the following as well:

AggregateExec

RecursiveQueryExec

SymmetricHashJoinExec

NestedLoopJoinExec

SortMergeJoinExec

CrossJoinExec

HashJoinExec

DataSinkExec

WindowAggExec

SortExec

PartialSortExec

SortPreservingMergeExec

I think, existing test coverage is good. We do not need to use all operators in the tests. However, all the operators in the tests are single child operators. We can also add a test case for an operator with multiple children (such as SortMergeJoinExec, it has both input distribution requirement and order requirement. It will allow us to test rule with full coverage.) Once we add these test cases. I think test coverage will be in very good shape, and will be sufficient.

mustafasrepo · 2024-06-14T06:22:16Z

Also, I ran the CI tests. It seems that we have some clippy errors and test errors (I think they are related to pipeline error raised in the Sanity Checker Rule. Since error message is changed expected error message isn't same anymore. Once you merge PipelineChecker rule and Sanity Checker rule these errors will be fixed as far as I can tell).
Make sure to run command
ulimit -S -n 2048 cargo test (for mac)
cargo test (for other systems)
in the .git main directory to see whether all tests pass. This test may take 3-5 minutes.

Also, CI tests has additional check whether code follows guidelines. For this check, we have the script bash dev/rust_lint.sh
This script checks the style of the codebase. It also guides you, how to conform to the guide. Once bash dev/rust_lint.sh, and ulimit -S -n 2048 cargo test checks complete successfully. CI tests will pass 99% of the time.

If you encounter any issues during these stages, please notify me. I can help you out.

yfy- · 2024-06-17T14:12:37Z

@mustafasrepo, about the testing strategy: So far I have picked one scenario for order requirements (using BoundedWindowAggExec), one scenario for partition requirements (GlobalLimitExec) and a scenario when there are no requirements at all (LocalLimitExec).
Should we extend this to cover all ExecutionPlan's that have some requirement, sort and/or distribution? I can add test cases for the following as well:

AggregateExec

RecursiveQueryExec

SymmetricHashJoinExec

NestedLoopJoinExec

SortMergeJoinExec

CrossJoinExec

HashJoinExec

DataSinkExec

WindowAggExec

SortExec

PartialSortExec

SortPreservingMergeExec

I think, existing test coverage is good. We do not need to use all operators in the tests. However, all the operators in the tests are single child operators. We can also add a test case for an operator with multiple children (such as SortMergeJoinExec, it has both input distribution requirement and order requirement. It will allow us to test rule with full coverage.) Once we add these test cases. I think test coverage will be in very good shape, and will be sufficient.

I have added test using SortMergeJoinExec in the below commit:

49aba4b

yfy- · 2024-06-17T14:39:26Z

Also, I ran the CI tests. It seems that we have some clippy errors and test errors (I think they are related to pipeline error raised in the Sanity Checker Rule. Since error message is changed expected error message isn't same anymore. Once you merge PipelineChecker rule and Sanity Checker rule these errors will be fixed as far as I can tell). Make sure to run command ulimit -S -n 2048 cargo test (for mac) cargo test (for other systems) in the .git main directory to see whether all tests pass. This test may take 3-5 minutes.

Also, CI tests has additional check whether code follows guidelines. For this check, we have the script bash dev/rust_lint.sh This script checks the style of the codebase. It also guides you, how to conform to the guide. Once bash dev/rust_lint.sh, and ulimit -S -n 2048 cargo test checks complete successfully. CI tests will pass 99% of the time.

If you encounter any issues during these stages, please notify me. I can help you out.

Thanks @mustafasrepo, I have fixed clippy errors and some other test cases. However, there is one test case that does not finish: fifo::unix_test::unbounded_file_with_symmetric_join. The query defined there gets stuck at physical plan optimization stage and it hits the check where we look at the distribution requirements in SanityCheckPlan. When I remove the distribution requirements check the tests execute successfully. When I examined that physical plan I observed SymmetricHashJoinExec requires its 2 children to be hash partitioned on join columns, but its 2 children have UndefinedDistribution. Somehow, SanityCheckPlan fails and it causes the query optimization to execute indefinitely (at least on my machine). I am further examining why this would happen.

mustafasrepo · 2024-06-20T07:39:58Z

Thanks @mustafasrepo, I have fixed clippy errors and some other test cases. However, there is one test case that does not finish: fifo::unix_test::unbounded_file_with_symmetric_join. The query defined there gets stuck at physical plan optimization stage and it hits the check where we look at the distribution requirements in SanityCheckPlan. When I remove the distribution requirements check the tests execute successfully. When I examined that physical plan I observed SymmetricHashJoinExec requires its 2 children to be hash partitioned on join columns, but its 2 children have UndefinedDistribution. Somehow, SanityCheckPlan fails and it causes the query optimization to execute indefinitely (at least on my machine). I am further examining why this would happen.

These problems do not seem to be related with your work. Hence, I also debugged them. It seems that we have two problems

fifo::unix_test::unbounded_file_with_symmetric_join test stucks when error is returned. (it has some tasks running. those tasks couldn't be joined when error is returned. Then test doesn't finish.)
Partitioning UnknownDistribution(1) doesn't satisfy requirement HashRequirement(vec![expr1, ..]). This requirement is actually satisfied (The root cause of the error).

I sent a commit to fix these problems. Also there are some other tests failing with this check. After examining them, I have found that rule actually help us find invalid plans. In the commit I also fixed them. However, there are still some clippy errors (If you do not have those errors. They come with latest rust updates. You can do rustup update or corresponding command in your local to bring latest stable changes.) Also as far as I remember those clippy errors are already resolved in the upstream. Hence I recommend you to merge your branch with apache_main before resolving any clippy issues. They might already be solved.

mustafasrepo · 2024-06-20T07:41:11Z

datafusion/physical-plan/src/aggregates/mod.rs

@@ -675,7 +675,7 @@ impl ExecutionPlan for AggregateExec {
                vec![Distribution::UnspecifiedDistribution]
            }
            AggregateMode::FinalPartitioned | AggregateMode::SinglePartitioned => {
-                vec![Distribution::HashPartitioned(self.output_group_expr())]
+                vec![Distribution::HashPartitioned(self.group_by.input_exprs())]


We should have required input expressions. Rule helped us to catch this bug.

mustafasrepo · 2024-06-20T07:41:55Z

datafusion/core/tests/tpcds_planning.rs

-            Arc::new(MemTable::try_new(Arc::new(table.schema.clone()), vec![])?),
+            Arc::new(MemTable::try_new(
+                Arc::new(table.schema.clone()),
+                vec![vec![]],


At the source, we were generating empty partitions (0 partition). This is invalid. Rule helped us to catch this bug.

yfy- · 2024-06-21T07:17:53Z

Thanks @mustafasrepo, I have fixed clippy errors and some other test cases. However, there is one test case that does not finish: fifo::unix_test::unbounded_file_with_symmetric_join. The query defined there gets stuck at physical plan optimization stage and it hits the check where we look at the distribution requirements in SanityCheckPlan. When I remove the distribution requirements check the tests execute successfully. When I examined that physical plan I observed SymmetricHashJoinExec requires its 2 children to be hash partitioned on join columns, but its 2 children have UndefinedDistribution. Somehow, SanityCheckPlan fails and it causes the query optimization to execute indefinitely (at least on my machine). I am further examining why this would happen.

These problems do not seem to be related with your work. Hence, I also debugged them. It seems that we have two problems

fifo::unix_test::unbounded_file_with_symmetric_join test stucks when error is returned. (it has some tasks running. those tasks couldn't be joined when error is returned. Then test doesn't finish.)

Partitioning UnknownDistribution(1) doesn't satisfy requirement HashRequirement(vec![expr1, ..]). This requirement is actually satisfied (The root cause of the error).

I sent a commit to fix these problems. Also there are some other tests failing with this check. After examining them, I have found that rule actually help us find invalid plans. In the commit I also fixed them. However, there are still some clippy errors (If you do not have those errors. They come with latest rust updates. You can do rustup update or corresponding command in your local to bring latest stable changes.) Also as far as I remember those clippy errors are already resolved in the upstream. Hence I recommend you to merge your branch with apache_main before resolving any clippy issues. They might already be solved.

I have merged apache_main to the branch and also ran rustup update, now clippy errors seem to be gone. However, there are other test cases that are failing. These come from bin/sqllogictests.rs. From what I could get there is a problem with queries like:

SELECT 1 num UNION ALL SELECT 2 num ORDER BY num;

This yields the physical plan:

SortPreservingMergeExec: [num@0 ASC NULLS LAST]
  UnionExec
    ProjectionExec: expr=[1 as num]
      PlaceholderRowExec
    ProjectionExec: expr=[2 as num]
      PlaceholderRowExec

Order requirement for the SortPreservingMergeExec is not satisifed.

mustafasrepo · 2024-06-25T06:39:17Z

I have merged apache_main to the branch and also ran rustup update, now clippy errors seem to be gone. However, there are other test cases that are failing.

@yfy- It seems that this failure is another case this check helped us catch a bug. I will fix this problem. I will let you know once this problem is solved.

mustafasrepo · 2024-06-25T15:02:54Z

@yfy- I resolved the problems mentioned above in the PR. We may either merge that PR into your PR, or continue as separate PRs. (We will decide after internal discussion.) With the changes in that PR, all test pass in my local. Hence, for now we can call this PR complete in terms of functionality. However, I will do one last review of this PR tomorrow to make sure we are not missing anything. Kind regards

mustafasrepo · 2024-06-27T15:16:02Z

@yfy- this PR is ready to go upstream. 🎉 🎉 🚀

ozankabak

Left some minor comments to address, but LGTM in general. Great work @yfy-, please coordinate with @mustafasrepo to send/merge upstream after addressing the comments.

datafusion/core/src/physical_optimizer/optimizer.rs

datafusion/physical-expr/src/equivalence/properties.rs

datafusion/physical-plan/src/union.rs

datafusion/core/src/physical_optimizer/sanity_checker.rs

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

mustafasrepo · 2024-07-01T10:44:24Z

@yfy- final reviews are addressed now. You can open this PR to upstream repo 🚀 🚀 .

…calculations, limit/order/distinct (apache#11756) * Fix unparser derived table with columns include calculations, limit/order/distinct (#24) * compare format output to make sure the two level of projects match * add method to find inner projection that could be nested under limit/order/distinct * use format! for matching in unparser sort optimization too * refactor * use to_string and also put comments in * clippy * fix unparser derived table contains cast (#25) * fix unparser derived table contains cast * remove dbg

Initial optimizer sanity checker.

8f0c30f

Only includes sort reqs, docs will be added.

github-actions bot added the core label Jun 9, 2024

mustafasrepo reviewed Jun 10, 2024

View reviewed changes

yfy- added 2 commits June 11, 2024 18:36

Add distro and pipeline friendly checks

2538702

Also check the plans we create are correct.

aea95c6

berkaysynnada reviewed Jun 13, 2024

View reviewed changes

datafusion/core/src/physical_optimizer/sanity_checker.rs Show resolved Hide resolved

berkaysynnada reviewed Jun 13, 2024

View reviewed changes

Add distribution test cases using global limit exec.

9aef599

yfy- added 4 commits June 15, 2024 13:40

Add test for multiple children using SortMergeJoinExec.

49aba4b

Move PipelineChecker to SanityCheckPlan

d89b1c9

Fix some tests and add docs

defebb0

Add some test docs and fix clippy diagnostics.

b392cf6

Fix some failing tests

5bdec62

github-actions bot added the physical-expr label Jun 20, 2024

mustafasrepo reviewed Jun 20, 2024

View reviewed changes

yfy- added 2 commits June 20, 2024 18:04

Merge branch 'apache_main' into optimizer-sanity-checker

46bf900

Replace PipelineChecker with SanityChecker in .slt files.

7148c07

github-actions bot added the sqllogictest label Jun 21, 2024

mustafasrepo added 6 commits June 25, 2024 16:21

Slt tests pass

c08a92b

Resolve linter errors

6d79881

Minor changes

cd49e2e

Minor changes

338f451

Minor changes

1959488

Minor changes

1c5c0c0

mustafasrepo added 3 commits June 25, 2024 21:25

Sort PreservingMerge clear per partition

24e32b6

Minor changes

9b77608

Merge branch 'optimizer-sanity-checker' into sanity_checker_subs

655a010

mustafasrepo mentioned this pull request Jun 26, 2024

[MINOR]: Fix some minor silent bugs apache/datafusion#11127

Merged

berkaysynnada and others added 3 commits June 26, 2024 11:36

Update output_requirements.rs

e93cf6e

Address reviews

58cbf9a

Merge branch 'apache_main' into optimizer-sanity-checker

6f8db29

github-actions bot removed the physical-expr label Jun 27, 2024

Merge branch 'optimizer-sanity-checker' into sanity_checker_subs

0cb86b9

github-actions bot added the physical-expr label Jun 27, 2024

mustafasrepo mentioned this pull request Jun 28, 2024

Sanity checker subs #25

Closed

ozankabak approved these changes Jun 28, 2024

View reviewed changes

mustafasrepo and others added 3 commits July 1, 2024 13:29

Update datafusion/core/src/physical_optimizer/optimizer.rs

5a34ae0

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

Update datafusion/core/src/physical_optimizer/sanity_checker.rs

f3ac546

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

Address reviews

0491c13

mustafasrepo marked this pull request as ready for review July 1, 2024 10:35

mustafasrepo added 2 commits July 1, 2024 13:38

Merge branch 'apache_main' into optimizer-sanity-checker

aa075e2

Minor changes

dfd219c

mustafasrepo changed the base branch from apache_main to feature/optimizer_sanity_checker July 1, 2024 15:06

mustafasrepo merged commit 0c2eb45 into synnada-ai:feature/optimizer_sanity_checker Jul 1, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizer Sanity Checker #24

Optimizer Sanity Checker #24

yfy- commented Jun 9, 2024 •

edited

Loading

mustafasrepo Jun 10, 2024

mustafasrepo Jun 10, 2024

yfy- Jun 12, 2024

mustafasrepo Jun 10, 2024

yfy- Jun 12, 2024

mustafasrepo Jun 13, 2024

berkaysynnada Jun 13, 2024

yfy- Jun 13, 2024

mustafasrepo Jun 14, 2024 •

edited

Loading

yfy- Jun 17, 2024

berkaysynnada Jun 13, 2024

yfy- commented Jun 13, 2024 •

edited

Loading

mustafasrepo commented Jun 14, 2024

mustafasrepo commented Jun 14, 2024

yfy- commented Jun 17, 2024

yfy- commented Jun 17, 2024

mustafasrepo commented Jun 20, 2024

mustafasrepo Jun 20, 2024

mustafasrepo Jun 20, 2024

yfy- commented Jun 21, 2024 •

edited

Loading

mustafasrepo commented Jun 25, 2024

mustafasrepo commented Jun 25, 2024

mustafasrepo commented Jun 27, 2024 •

edited

Loading

ozankabak left a comment

mustafasrepo commented Jul 1, 2024

Optimizer Sanity Checker #24

Optimizer Sanity Checker #24

Conversation

yfy- commented Jun 9, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Using BoundedWindowAggExec

Using GlobalLimitExec

Using LocalLimitExec

Using SortMergeJoinExec

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mustafasrepo Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yfy- commented Jun 13, 2024 • edited Loading

mustafasrepo commented Jun 14, 2024

mustafasrepo commented Jun 14, 2024

yfy- commented Jun 17, 2024

yfy- commented Jun 17, 2024

mustafasrepo commented Jun 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yfy- commented Jun 21, 2024 • edited Loading

mustafasrepo commented Jun 25, 2024

mustafasrepo commented Jun 25, 2024

mustafasrepo commented Jun 27, 2024 • edited Loading

ozankabak left a comment

Choose a reason for hiding this comment

mustafasrepo commented Jul 1, 2024

yfy- commented Jun 9, 2024 •

edited

Loading

Using `BoundedWindowAggExec`

Using `GlobalLimitExec`

Using `LocalLimitExec`

Using `SortMergeJoinExec`

mustafasrepo Jun 14, 2024 •

edited

Loading

yfy- commented Jun 13, 2024 •

edited

Loading

yfy- commented Jun 21, 2024 •

edited

Loading

mustafasrepo commented Jun 27, 2024 •

edited

Loading