Skip to content

Conversation

@tanelk
Copy link
Contributor

@tanelk tanelk commented Dec 29, 2020

What changes were proposed in this pull request?

Changed the cost function in CBO to match documentation.

Why are the changes needed?

The parameter spark.sql.cbo.joinReorder.card.weight is documented as:

The weight of cardinality (number of rows) for plan cost comparison in join reorder: rows * weight + size * (1 - weight).

The implementation in JoinReorderDP.betterThan does not match this documentaiton:

def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
      if (other.planCost.card == 0 || other.planCost.size == 0) {
        false
      } else {
        val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card)
        val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size)
        relativeRows * conf.joinReorderCardWeight +
          relativeSize * (1 - conf.joinReorderCardWeight) < 1
      }
    }

This different implementation has an unfortunate consequence:
given two plans A and B, both A betterThan B and B betterThan A might give the same results. This happes when one has many rows with small sizes and other has few rows with large sizes.

A example values, that have this fenomen with the default weight value (0.7):
A.card = 500, B.card = 300
A.size = 30, B.size = 80
Both A betterThan B and B betterThan A would have score above 1 and would return false.

This happens with several of the TPCDS queries.

The new implementation does not have this behavior.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New and existing UTs

@github-actions github-actions bot added the SQL label Dec 29, 2020
@tanelk
Copy link
Contributor Author

tanelk commented Dec 29, 2020

A bit more information:

The unstability of CBO has been noted before (#29638) and I think, that this is the main reason for this. There is also #29871, that tackles another reason for the unstability. That one should only impact, when the costs are equal, this here can impact more plans (see the example in the description)

An important thing to note is that this could change the the behavior of the spark.sql.cbo.joinReorder.card.weight config value, but luckily it seems, that it does so minimally.
I generated random values for the plan row counts and sizes, and found that the new cost function agrees most with the old cost function at the same weight value. This holds true for all the weight value, not only the default (0.7).
2020-12-29-195948_1920x1080_scrot

@tanelk
Copy link
Contributor Author

tanelk commented Dec 29, 2020

@LuciferYang and @maropu
Could you take a look at this or perhaps point this to someone, who has worked more with the CBO.

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Test build #133502 has finished for PR 30965 at commit 2308a8a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38091/

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38091/

@maropu
Copy link
Member

maropu commented Dec 30, 2020

cc: @wzhfy

@maropu
Copy link
Member

maropu commented Dec 30, 2020

I've checked the update and it seems fine. Could you add some tests based on the example in the PR description?

@tanelk
Copy link
Contributor Author

tanelk commented Dec 30, 2020

I've checked the update and it seems fine. Could you add some tests based on the example in the PR description?

Ahh, good point. Adding this I also realised, that I had the inequality check in the wrong order - this greatly reduced the amount of plan changes.

@tanelk tanelk changed the title [WIP][SPARK-33935][SQL] Fix CBO cost function [SPARK-33935][SQL] Fix CBO cost function Dec 30, 2020
@SparkQA
Copy link

SparkQA commented Dec 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38129/

@SparkQA
Copy link

SparkQA commented Dec 30, 2020

Test build #133539 has finished for PR 30965 at commit 4b05711.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 30, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38129/

@SparkQA
Copy link

SparkQA commented Dec 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38130/

@SparkQA
Copy link

SparkQA commented Dec 30, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38130/

@SparkQA
Copy link

SparkQA commented Dec 30, 2020

Test build #133534 has finished for PR 30965 at commit 6dbb9fe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 30, 2020

Test build #133536 has finished for PR 30965 at commit 24f65b5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines +378 to +379
assert(plan1.betterThan(plan2, conf))
assert(!plan2.betterThan(plan1, conf))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this. This fix looks fine to me. cc: @cloud-fan @HyukjinKwon

@cloud-fan
Copy link
Contributor

retest this please

@LuciferYang
Copy link
Contributor

@tanelk Can you help to run UTs in sql/catalyst and sql/core in Scala 2.13 with this pr manually ? thx ~

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38210/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38210/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Test build #133621 has finished for PR 30965 at commit 4b05711.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu closed this in f252a93 Jan 5, 2021
maropu pushed a commit that referenced this pull request Jan 5, 2021
### What changes were proposed in this pull request?

Changed the cost function in CBO to match documentation.

### Why are the changes needed?

The parameter `spark.sql.cbo.joinReorder.card.weight` is documented as:
```
The weight of cardinality (number of rows) for plan cost comparison in join reorder: rows * weight + size * (1 - weight).
```
The implementation in `JoinReorderDP.betterThan` does not match this documentaiton:
```
def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
      if (other.planCost.card == 0 || other.planCost.size == 0) {
        false
      } else {
        val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card)
        val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size)
        relativeRows * conf.joinReorderCardWeight +
          relativeSize * (1 - conf.joinReorderCardWeight) < 1
      }
    }
```

This different implementation has an unfortunate consequence:
given two plans A and B, both A betterThan B and B betterThan A might give the same results. This happes when one has many rows with small sizes and other has few rows with large sizes.

A example values, that have this fenomen with the default weight value (0.7):
A.card = 500, B.card = 300
A.size = 30, B.size = 80
Both A betterThan B and B betterThan A would have score above 1 and would return false.

This happens with several of the TPCDS queries.

The new implementation does not have this behavior.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New and existing UTs

Closes #30965 from tanelk/SPARK-33935_cbo_cost_function.

Authored-by: [email protected] <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit f252a93)
Signed-off-by: Takeshi Yamamuro <[email protected]>
@maropu
Copy link
Member

maropu commented Jan 5, 2021

Thanks! Merged to master/3.1.

@maropu
Copy link
Member

maropu commented Jan 5, 2021

@tanelk Can you help to run UTs in sql/catalyst and sql/core in Scala 2.13 with this pr manually ? thx ~

Could you check this? Have this fix resolved the previous issue, too?

@maropu
Copy link
Member

maropu commented Jan 5, 2021

@tanelk Could you open a PR to fix it for branch-3.0/2.4?

@tanelk
Copy link
Contributor Author

tanelk commented Jan 5, 2021

I created the pulls for 3.0 and 2.4: #31042 and #31043

@tanelk
Copy link
Contributor Author

tanelk commented Jan 5, 2021

@tanelk Can you help to run UTs in sql/catalyst and sql/core in Scala 2.13 with this pr manually ? thx ~

The sql/catalyst passed.
With sql/core I had some issues:
at random tests it would fail with the following exception:

*** RUN ABORTED ***
  java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 100 retries (on a random free port)!

I assume you want the results of *PlanStability*Suite - these all passed.

@maropu
Copy link
Member

maropu commented Jan 5, 2021

The sql/catalyst passed.
I assume you want the results of PlanStabilitySuite - these all passed.

Awesome!

@wzhfy
Copy link
Contributor

wzhfy commented Feb 5, 2021

@tanelk Hi, sorry to see this so late.

IIRC the reason to use a relative value for rowCount and size, is to normalize them to a similar scale while comparing cost. Otherwise, one (size) may overwhelm the other (rowCount), then the weight and cost function become meaningless.

To resolve the stability issue, can we have a betterPlan() function instead of current betterThan()? Inside that plan, we can fix the comparison order based on rowCount or size. And at caller side, we can use !existingPlan.get.eq(betterPlan(existingPlan.get, newJoinPlan, conf)) to decide whether to update the best plan so far.

    def betterPlan(existing: JoinPlan, newPlan: JoinPlan, conf: SQLConf): JoinPlan = {
      // To fix the comparison order, set the one with smaller cardinality as the baseline.
      val (baseline, toCompare) = if (existing.planCost.card <= newPlan.planCost.card) {
        (existing, newPlan)
      } else {
        (newPlan, existing)
      }

      if (toCompare.planCost.card == 0 || toCompare.planCost.size == 0) {
        return existing
      }

      val relativeRows = BigDecimal(baseline.planCost.card) / BigDecimal(toCompare.planCost.card)
      val relativeSize = BigDecimal(baseline.planCost.size) / BigDecimal(toCompare.planCost.size)
      val relativeCost = relativeRows * conf.joinReorderCardWeight +
        relativeSize * (1 - conf.joinReorderCardWeight)
      if (relativeCost == 1) {
        // If they have same cost, we don't update the best plan and return the existing one.
        existing
      } else if (relativeCost < 1) {
        baseline
      } else {
        toCompare
      }
    }

What do you think?

@cloud-fan
Copy link
Contributor

I think what @wzhfy said makes sense, @tanelk do you have time to try this idea?

@tanelk
Copy link
Contributor Author

tanelk commented Mar 31, 2021

@wzhfy and @cloud-fan

I'm not a fan of adding up the relative costs.

A simple example, where the weight is 0.5:
If this plans size (bytes) is 2x larger, then no matter how many times more rows does the other plan have, the other plan will allways be considered to be better - 0.5*2 + 0.5*0.00000000000001 > 1.
This basically the same situation, where one cost overwhelms the other.

Perhaps this would be a best of both worlds:
(this.card / other.card) ^ cardWeight * (this.size / other.size) ^ (1 - cardWeight) < 1.
In short - multiply the relative costs instead of adding them.

ColumnarToRow
InputAdapter
Scan parquet default.store_sales [ss_sold_date_sk,ss_item_sk,ss_customer_sk,ss_store_sk,ss_ext_sales_price]
BroadcastHashJoin [ss_item_sk,i_item_sk]
Copy link
Contributor

@cloud-fan cloud-fan Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think q19 exposes a problem. Previously this BroadcastHashJoin is run before the SortMergeJoin, which reduces the input data of shuffle, because this BroadcastHashJoin has a filter on the right side and likely makes this join very selective.

@tanelk , if the idea from @wzhfy doesn't look good to you, can you try with some other ideas and see if we can fix this issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll experiment with it a bit, but it might take a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it!

Copy link
Member

@HyukjinKwon HyukjinKwon Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually caused a significant regression at q19 in TPC-DS benchmark (performed internally). Can we just revert it for now, and do it with actual performance numbers? I think that's safer and easier for everybody here. Seems like at least the plan change here was overlooked and the performance of this had to be clarified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

branch-2.4 has the same regression? This PR was back-ported into branch-2.4, too. cc: @viirya

Copy link
Member

@viirya viirya Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @HyukjinKwon's suggestion. Especially for branch-2.4, I think it is better to avoid unexpected performance change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed with @cloud-fan offline. Shall we merge #32014 into master, branch-3.1 and branch-3.0 to fix the regression, and revert this one from branch-2.4? Spark 2.4 release is very soon and might be best to stay safe, and technically this was more like an improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with it. I prefer to make stable for branch-2.4.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea let's revert it from 2.4 to be safe

Copy link
Member

@viirya viirya Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert at #32020

HyukjinKwon pushed a commit that referenced this pull request Apr 1, 2021
This reverts commit 3e6a6b7 per the discussion at #30965 (comment).

Closes #32020 from viirya/revert-SPARK-33935.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
maropu pushed a commit that referenced this pull request Apr 7, 2021
### What changes were proposed in this pull request?

Changed the cost comparison function of the CBO to use the ratios of row counts and sizes in bytes.

### Why are the changes needed?

In #30965 we changed to CBO cost comparison function so it would be "symetric": `A.betterThan(B)` now implies, that `!B.betterThan(A)`.
With that we caused a performance regressions in some queries - TPCDS q19 for example.

The original cost comparison function used the ratios `relativeRows = A.rowCount / B.rowCount` and `relativeSize = A.size / B.size`. The changed function compared "absolute" cost values `costA = w*A.rowCount + (1-w)*A.size` and `costB = w*B.rowCount + (1-w)*B.size`.

Given the input from wzhfy we decided to go back to the relative values, because otherwise one (size) may overwhelm the other (rowCount). But this time we avoid adding up the ratios.

Originally `A.betterThan(B) => w*relativeRows + (1-w)*relativeSize < 1` was used. Besides being "non-symteric", this also can exhibit one overwhelming other.
For `w=0.5` If `A` size (bytes) is at least 2x larger than `B`, then no matter how many times more rows does the `B` plan have, `B` will allways be considered to be better - `0.5*2 + 0.5*0.00000000000001 > 1`.

When working with ratios, then it would be better to multiply them.
The proposed cost comparison function is: `A.betterThan(B) => relativeRows^w  * relativeSize^(1-w) < 1`.

### Does this PR introduce _any_ user-facing change?

Comparison of the changed TPCDS v1.4 query execution times at sf=10:

  | absolute | multiplicative |   | additive |  
-- | -- | -- | -- | -- | --
q12 | 145 | 137 | -5.52% | 141 | -2.76%
q13 | 264 | 271 | 2.65% | 271 | 2.65%
q17 | 4521 | 4243 | -6.15% | 4348 | -3.83%
q18 | 758 | 466 | -38.52% | 480 | -36.68%
q19 | 38503 | 2167 | -94.37% | 2176 | -94.35%
q20 | 119 | 120 | 0.84% | 126 | 5.88%
q24a | 16429 | 16838 | 2.49% | 17103 | 4.10%
q24b | 16592 | 16999 | 2.45% | 17268 | 4.07%
q25 | 3558 | 3556 | -0.06% | 3675 | 3.29%
q33 | 362 | 361 | -0.28% | 380 | 4.97%
q52 | 1020 | 1032 | 1.18% | 1052 | 3.14%
q55 | 927 | 938 | 1.19% | 961 | 3.67%
q72 | 24169 | 13377 | -44.65% | 24306 | 0.57%
q81 | 1285 | 1185 | -7.78% | 1168 | -9.11%
q91 | 324 | 336 | 3.70% | 337 | 4.01%
q98 | 126 | 129 | 2.38% | 131 | 3.97%

All times are in ms, the change is compared to the situation in the master branch (absolute).
The proposed cost function (multiplicative) significantlly improves the performance on q18, q19 and q72. The original cost function (additive) has similar improvements at q18 and q19. All other chagnes are within the error bars and I would ignore them - perhaps q81 has also improved.

### How was this patch tested?

PlanStabilitySuite

Closes #32014 from tanelk/SPARK-34922_cbo_better_cost_function.

Lead-authored-by: Tanel Kiis <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
maropu pushed a commit that referenced this pull request Apr 8, 2021
…e CBO

### What changes were proposed in this pull request?

Changed the cost comparison function of the CBO to use the ratios of row counts and sizes in bytes.

### Why are the changes needed?

In #30965 we changed to CBO cost comparison function so it would be "symetric": `A.betterThan(B)` now implies, that `!B.betterThan(A)`.
With that we caused a performance regressions in some queries - TPCDS q19 for example.

The original cost comparison function used the ratios `relativeRows = A.rowCount / B.rowCount` and `relativeSize = A.size / B.size`. The changed function compared "absolute" cost values `costA = w*A.rowCount + (1-w)*A.size` and `costB = w*B.rowCount + (1-w)*B.size`.

Given the input from wzhfy we decided to go back to the relative values, because otherwise one (size) may overwhelm the other (rowCount). But this time we avoid adding up the ratios.

Originally `A.betterThan(B) => w*relativeRows + (1-w)*relativeSize < 1` was used. Besides being "non-symteric", this also can exhibit one overwhelming other.
For `w=0.5` If `A` size (bytes) is at least 2x larger than `B`, then no matter how many times more rows does the `B` plan have, `B` will allways be considered to be better - `0.5*2 + 0.5*0.00000000000001 > 1`.

When working with ratios, then it would be better to multiply them.
The proposed cost comparison function is: `A.betterThan(B) => relativeRows^w  * relativeSize^(1-w) < 1`.

### Does this PR introduce _any_ user-facing change?

Comparison of the changed TPCDS v1.4 query execution times at sf=10:

  | absolute | multiplicative |   | additive |  
-- | -- | -- | -- | -- | --
q12 | 145 | 137 | -5.52% | 141 | -2.76%
q13 | 264 | 271 | 2.65% | 271 | 2.65%
q17 | 4521 | 4243 | -6.15% | 4348 | -3.83%
q18 | 758 | 466 | -38.52% | 480 | -36.68%
q19 | 38503 | 2167 | -94.37% | 2176 | -94.35%
q20 | 119 | 120 | 0.84% | 126 | 5.88%
q24a | 16429 | 16838 | 2.49% | 17103 | 4.10%
q24b | 16592 | 16999 | 2.45% | 17268 | 4.07%
q25 | 3558 | 3556 | -0.06% | 3675 | 3.29%
q33 | 362 | 361 | -0.28% | 380 | 4.97%
q52 | 1020 | 1032 | 1.18% | 1052 | 3.14%
q55 | 927 | 938 | 1.19% | 961 | 3.67%
q72 | 24169 | 13377 | -44.65% | 24306 | 0.57%
q81 | 1285 | 1185 | -7.78% | 1168 | -9.11%
q91 | 324 | 336 | 3.70% | 337 | 4.01%
q98 | 126 | 129 | 2.38% | 131 | 3.97%

All times are in ms, the change is compared to the situation in the master branch (absolute).
The proposed cost function (multiplicative) significantlly improves the performance on q18, q19 and q72. The original cost function (additive) has similar improvements at q18 and q19. All other chagnes are within the error bars and I would ignore them - perhaps q81 has also improved.

### How was this patch tested?

PlanStabilitySuite

Closes #32075 from tanelk/SPARK-34922_cbo_better_cost_function_3.1.

Lead-authored-by: Tanel Kiis <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
maropu pushed a commit that referenced this pull request Apr 8, 2021
…e CBO

### What changes were proposed in this pull request?

Changed the cost comparison function of the CBO to use the ratios of row counts and sizes in bytes.

### Why are the changes needed?

In #30965 we changed to CBO cost comparison function so it would be "symetric": `A.betterThan(B)` now implies, that `!B.betterThan(A)`.
With that we caused a performance regressions in some queries - TPCDS q19 for example.

The original cost comparison function used the ratios `relativeRows = A.rowCount / B.rowCount` and `relativeSize = A.size / B.size`. The changed function compared "absolute" cost values `costA = w*A.rowCount + (1-w)*A.size` and `costB = w*B.rowCount + (1-w)*B.size`.

Given the input from wzhfy we decided to go back to the relative values, because otherwise one (size) may overwhelm the other (rowCount). But this time we avoid adding up the ratios.

Originally `A.betterThan(B) => w*relativeRows + (1-w)*relativeSize < 1` was used. Besides being "non-symteric", this also can exhibit one overwhelming other.
For `w=0.5` If `A` size (bytes) is at least 2x larger than `B`, then no matter how many times more rows does the `B` plan have, `B` will allways be considered to be better - `0.5*2 + 0.5*0.00000000000001 > 1`.

When working with ratios, then it would be better to multiply them.
The proposed cost comparison function is: `A.betterThan(B) => relativeRows^w  * relativeSize^(1-w) < 1`.

### Does this PR introduce _any_ user-facing change?

Comparison of the changed TPCDS v1.4 query execution times at sf=10:

  | absolute | multiplicative |   | additive |  
-- | -- | -- | -- | -- | --
q12 | 145 | 137 | -5.52% | 141 | -2.76%
q13 | 264 | 271 | 2.65% | 271 | 2.65%
q17 | 4521 | 4243 | -6.15% | 4348 | -3.83%
q18 | 758 | 466 | -38.52% | 480 | -36.68%
q19 | 38503 | 2167 | -94.37% | 2176 | -94.35%
q20 | 119 | 120 | 0.84% | 126 | 5.88%
q24a | 16429 | 16838 | 2.49% | 17103 | 4.10%
q24b | 16592 | 16999 | 2.45% | 17268 | 4.07%
q25 | 3558 | 3556 | -0.06% | 3675 | 3.29%
q33 | 362 | 361 | -0.28% | 380 | 4.97%
q52 | 1020 | 1032 | 1.18% | 1052 | 3.14%
q55 | 927 | 938 | 1.19% | 961 | 3.67%
q72 | 24169 | 13377 | -44.65% | 24306 | 0.57%
q81 | 1285 | 1185 | -7.78% | 1168 | -9.11%
q91 | 324 | 336 | 3.70% | 337 | 4.01%
q98 | 126 | 129 | 2.38% | 131 | 3.97%

All times are in ms, the change is compared to the situation in the master branch (absolute).
The proposed cost function (multiplicative) significantlly improves the performance on q18, q19 and q72. The original cost function (additive) has similar improvements at q18 and q19. All other chagnes are within the error bars and I would ignore them - perhaps q81 has also improved.

### How was this patch tested?

PlanStabilitySuite

Closes #32076 from tanelk/SPARK-34922_cbo_better_cost_function_3.0.

Lead-authored-by: Tanel Kiis <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…e CBO

### What changes were proposed in this pull request?

Changed the cost comparison function of the CBO to use the ratios of row counts and sizes in bytes.

### Why are the changes needed?

In apache#30965 we changed to CBO cost comparison function so it would be "symetric": `A.betterThan(B)` now implies, that `!B.betterThan(A)`.
With that we caused a performance regressions in some queries - TPCDS q19 for example.

The original cost comparison function used the ratios `relativeRows = A.rowCount / B.rowCount` and `relativeSize = A.size / B.size`. The changed function compared "absolute" cost values `costA = w*A.rowCount + (1-w)*A.size` and `costB = w*B.rowCount + (1-w)*B.size`.

Given the input from wzhfy we decided to go back to the relative values, because otherwise one (size) may overwhelm the other (rowCount). But this time we avoid adding up the ratios.

Originally `A.betterThan(B) => w*relativeRows + (1-w)*relativeSize < 1` was used. Besides being "non-symteric", this also can exhibit one overwhelming other.
For `w=0.5` If `A` size (bytes) is at least 2x larger than `B`, then no matter how many times more rows does the `B` plan have, `B` will allways be considered to be better - `0.5*2 + 0.5*0.00000000000001 > 1`.

When working with ratios, then it would be better to multiply them.
The proposed cost comparison function is: `A.betterThan(B) => relativeRows^w  * relativeSize^(1-w) < 1`.

### Does this PR introduce _any_ user-facing change?

Comparison of the changed TPCDS v1.4 query execution times at sf=10:

  | absolute | multiplicative |   | additive |  
-- | -- | -- | -- | -- | --
q12 | 145 | 137 | -5.52% | 141 | -2.76%
q13 | 264 | 271 | 2.65% | 271 | 2.65%
q17 | 4521 | 4243 | -6.15% | 4348 | -3.83%
q18 | 758 | 466 | -38.52% | 480 | -36.68%
q19 | 38503 | 2167 | -94.37% | 2176 | -94.35%
q20 | 119 | 120 | 0.84% | 126 | 5.88%
q24a | 16429 | 16838 | 2.49% | 17103 | 4.10%
q24b | 16592 | 16999 | 2.45% | 17268 | 4.07%
q25 | 3558 | 3556 | -0.06% | 3675 | 3.29%
q33 | 362 | 361 | -0.28% | 380 | 4.97%
q52 | 1020 | 1032 | 1.18% | 1052 | 3.14%
q55 | 927 | 938 | 1.19% | 961 | 3.67%
q72 | 24169 | 13377 | -44.65% | 24306 | 0.57%
q81 | 1285 | 1185 | -7.78% | 1168 | -9.11%
q91 | 324 | 336 | 3.70% | 337 | 4.01%
q98 | 126 | 129 | 2.38% | 131 | 3.97%

All times are in ms, the change is compared to the situation in the master branch (absolute).
The proposed cost function (multiplicative) significantlly improves the performance on q18, q19 and q72. The original cost function (additive) has similar improvements at q18 and q19. All other chagnes are within the error bars and I would ignore them - perhaps q81 has also improved.

### How was this patch tested?

PlanStabilitySuite

Closes apache#32075 from tanelk/SPARK-34922_cbo_better_cost_function_3.1.

Lead-authored-by: Tanel Kiis <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…e CBO

### What changes were proposed in this pull request?

Changed the cost comparison function of the CBO to use the ratios of row counts and sizes in bytes.

### Why are the changes needed?

In apache#30965 we changed to CBO cost comparison function so it would be "symetric": `A.betterThan(B)` now implies, that `!B.betterThan(A)`.
With that we caused a performance regressions in some queries - TPCDS q19 for example.

The original cost comparison function used the ratios `relativeRows = A.rowCount / B.rowCount` and `relativeSize = A.size / B.size`. The changed function compared "absolute" cost values `costA = w*A.rowCount + (1-w)*A.size` and `costB = w*B.rowCount + (1-w)*B.size`.

Given the input from wzhfy we decided to go back to the relative values, because otherwise one (size) may overwhelm the other (rowCount). But this time we avoid adding up the ratios.

Originally `A.betterThan(B) => w*relativeRows + (1-w)*relativeSize < 1` was used. Besides being "non-symteric", this also can exhibit one overwhelming other.
For `w=0.5` If `A` size (bytes) is at least 2x larger than `B`, then no matter how many times more rows does the `B` plan have, `B` will allways be considered to be better - `0.5*2 + 0.5*0.00000000000001 > 1`.

When working with ratios, then it would be better to multiply them.
The proposed cost comparison function is: `A.betterThan(B) => relativeRows^w  * relativeSize^(1-w) < 1`.

### Does this PR introduce _any_ user-facing change?

Comparison of the changed TPCDS v1.4 query execution times at sf=10:

  | absolute | multiplicative |   | additive |  
-- | -- | -- | -- | -- | --
q12 | 145 | 137 | -5.52% | 141 | -2.76%
q13 | 264 | 271 | 2.65% | 271 | 2.65%
q17 | 4521 | 4243 | -6.15% | 4348 | -3.83%
q18 | 758 | 466 | -38.52% | 480 | -36.68%
q19 | 38503 | 2167 | -94.37% | 2176 | -94.35%
q20 | 119 | 120 | 0.84% | 126 | 5.88%
q24a | 16429 | 16838 | 2.49% | 17103 | 4.10%
q24b | 16592 | 16999 | 2.45% | 17268 | 4.07%
q25 | 3558 | 3556 | -0.06% | 3675 | 3.29%
q33 | 362 | 361 | -0.28% | 380 | 4.97%
q52 | 1020 | 1032 | 1.18% | 1052 | 3.14%
q55 | 927 | 938 | 1.19% | 961 | 3.67%
q72 | 24169 | 13377 | -44.65% | 24306 | 0.57%
q81 | 1285 | 1185 | -7.78% | 1168 | -9.11%
q91 | 324 | 336 | 3.70% | 337 | 4.01%
q98 | 126 | 129 | 2.38% | 131 | 3.97%

All times are in ms, the change is compared to the situation in the master branch (absolute).
The proposed cost function (multiplicative) significantlly improves the performance on q18, q19 and q72. The original cost function (additive) has similar improvements at q18 and q19. All other chagnes are within the error bars and I would ignore them - perhaps q81 has also improved.

### How was this patch tested?

PlanStabilitySuite

Closes apache#32075 from tanelk/SPARK-34922_cbo_better_cost_function_3.1.

Lead-authored-by: Tanel Kiis <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants