[SPARK-34922][SQL] Use a relative cost comparison function in the CBO #32014

tanelk · 2021-03-31T12:30:35Z

What changes were proposed in this pull request?

Changed the cost comparison function of the CBO to use the ratios of row counts and sizes in bytes.

Why are the changes needed?

In #30965 we changed to CBO cost comparison function so it would be "symetric": A.betterThan(B) now implies, that !B.betterThan(A).
With that we caused a performance regressions in some queries - TPCDS q19 for example.

The original cost comparison function used the ratios relativeRows = A.rowCount / B.rowCount and relativeSize = A.size / B.size. The changed function compared "absolute" cost values costA = w*A.rowCount + (1-w)*A.size and costB = w*B.rowCount + (1-w)*B.size.

Given the input from @wzhfy we decided to go back to the relative values, because otherwise one (size) may overwhelm the other (rowCount). But this time we avoid adding up the ratios.

Originally A.betterThan(B) => w*relativeRows + (1-w)*relativeSize < 1 was used. Besides being "non-symteric", this also can exhibit one overwhelming other.
For w=0.5 If A size (bytes) is at least 2x larger than B, then no matter how many times more rows does the B plan have, B will allways be considered to be better - 0.5*2 + 0.5*0.00000000000001 > 1.

When working with ratios, then it would be better to multiply them.
The proposed cost comparison function is: A.betterThan(B) => relativeRows^w * relativeSize^(1-w) < 1.

Does this PR introduce any user-facing change?

Comparison of the changed TPCDS v1.4 query execution times at sf=10:

	absolute	multiplicative		additive
q12	145	137	-5.52%	141	-2.76%
q13	264	271	2.65%	271	2.65%
q17	4521	4243	-6.15%	4348	-3.83%
q18	758	466	-38.52%	480	-36.68%
q19	38503	2167	-94.37%	2176	-94.35%
q20	119	120	0.84%	126	5.88%
q24a	16429	16838	2.49%	17103	4.10%
q24b	16592	16999	2.45%	17268	4.07%
q25	3558	3556	-0.06%	3675	3.29%
q33	362	361	-0.28%	380	4.97%
q52	1020	1032	1.18%	1052	3.14%
q55	927	938	1.19%	961	3.67%
q72	24169	13377	-44.65%	24306	0.57%
q81	1285	1185	-7.78%	1168	-9.11%
q91	324	336	3.70%	337	4.01%
q98	126	129	2.38%	131	3.97%

All times are in ms, the change is compared to the situation in the master branch (absolute).
The proposed cost function (multiplicative) significantlly improves the performance on q18, q19 and q72. The original cost function (additive) has similar improvements at q18 and q19. All other chagnes are within the error bars and I would ignore them - perhaps q81 has also improved.

How was this patch tested?

PlanStabilitySuite

tanelk · 2021-03-31T12:31:09Z

@wzhfy @maropu @HyukjinKwon and @cloud-fan

cloud-fan · 2021-03-31T13:06:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

+      } else {
+        val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card)
+        val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size)
+        Math.pow(relativeRows.doubleValue(), conf.joinReorderCardWeight) *


shall we update the config doc?

cloud-fan · 2021-03-31T13:10:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

+        val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card)
+        val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size)
+        Math.pow(relativeRows.doubleValue(), conf.joinReorderCardWeight) *
+          Math.pow(relativeSize.doubleValue(), 1 - conf.joinReorderCardWeight) < 1


is this symmetric?

ah, it's kind of normalize the row count and bytes size with Math.pow(_, m) and Math.pow(_, 1- m), then calculate the relative ratios and compare.

If I'm not mistaken, then, when the left side of the comparison is x for A.betterThan(B), then for B.betterThan(A) it will be 1/x. One of them will be greater than 1 and other smaller.

yea, after normalization the formula is still row_count1 * size1 / (row_count2 * size2), so if one is x the other must be 1/x.

SparkQA · 2021-03-31T13:54:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41351/

SparkQA · 2021-03-31T14:03:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41351/

SparkQA · 2021-03-31T17:06:45Z

Test build #136768 has finished for PR 32014 at commit 811c1c9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-31T20:24:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41363/

SparkQA · 2021-03-31T20:31:33Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41363/

SparkQA · 2021-04-01T00:04:39Z

Test build #136780 has finished for PR 32014 at commit 6c39602.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-04-01T04:23:35Z

The change LGTM. Can we re-generate the golden files to fix conflicts?

SparkQA · 2021-04-01T06:48:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41381/

SparkQA · 2021-04-01T06:56:47Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41381/

SparkQA · 2021-04-01T11:39:12Z

Test build #136799 has finished for PR 32014 at commit 41b46a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-04-01T12:45:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

-      thisCost < otherCost
+      if (other.planCost.card == 0 || other.planCost.size == 0) {
+        false
+      } else {


How about leaving some comments about why we need to use relative values here?

Added some comments to this method.

maropu · 2021-04-01T12:47:45Z

cc: @wzhfy

SparkQA · 2021-04-02T10:05:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41429/

SparkQA · 2021-04-02T10:36:24Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41429/

SparkQA · 2021-04-02T13:53:57Z

Test build #136851 has finished for PR 32014 at commit dc87250.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class SubtractTimestamps(
public class OrcArrayColumnVector extends OrcColumnVector
public class OrcAtomicColumnVector extends OrcColumnVector
public abstract class OrcColumnVector extends org.apache.spark.sql.vectorized.ColumnVector
class OrcColumnVectorUtils
public class OrcMapColumnVector extends OrcColumnVector
public class OrcStructColumnVector extends OrcColumnVector

cloud-fan · 2021-04-06T15:14:55Z

Unfortunately there are conflicts again...

SparkQA · 2021-04-06T16:15:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41530/

SparkQA · 2021-04-06T16:15:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41530/

SparkQA · 2021-04-06T17:13:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41535/

SparkQA · 2021-04-06T17:13:03Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41535/

SparkQA · 2021-04-06T17:58:40Z

Test build #136954 has finished for PR 32014 at commit a7406e7.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2021-04-06T20:45:14Z

Test build #136958 has finished for PR 32014 at commit cdf7f08.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class KoalasFrameMethods(object):
class KoalasSeriesMethods(object):
class IndexOpsMixin(object, metaclass=ABCMeta):
class CategoricalAccessor(object):
however, expected types are [(<class 'float'>, <class 'int'>)].
class OptionError(AttributeError, KeyError):
class DatetimeMethods(object):
class DataError(Exception):
class SparkPandasIndexingError(Exception):
class SparkPandasNotImplementedError(NotImplementedError):
class PandasNotImplementedError(NotImplementedError):
new_class = type(\"NameType\", (NameTypeHolder,),
new_class = type(\"NameType\", (NameTypeHolder,),
class DataFrame(Frame, Generic[T]):
[defaultdict(<class 'list'>,
defaultdict(<class 'list'>,
class CachedDataFrame(DataFrame):
class Frame(object, metaclass=ABCMeta):
class GroupBy(object, metaclass=ABCMeta):
class DataFrameGroupBy(GroupBy):
class SeriesGroupBy(GroupBy):
class Index(IndexOpsMixin):
class CategoricalIndex(Index):
class DatetimeIndex(Index):
class MultiIndex(Index):
a single :class:Index (or subclass thereof).
class NumericIndex(Index):
class IntegerIndex(NumericIndex):
class Int64Index(IntegerIndex):
class Float64Index(NumericIndex):
class IndexerLike(object):
class AtIndexer(IndexerLike):
class iAtIndexer(IndexerLike):
class LocIndexerLike(IndexerLike, metaclass=ABCMeta):
class LocIndexer(LocIndexerLike):
class iLocIndexer(LocIndexerLike):
class InternalFrame(object):
class _MissingPandasLikeDataFrame(object):
class MissingPandasLikeDataFrameGroupBy(object):
class MissingPandasLikeSeriesGroupBy(object):
class MissingPandasLikeIndex(object):
class MissingPandasLikeDatetimeIndex(MissingPandasLikeIndex):
class MissingPandasLikeCategoricalIndex(MissingPandasLikeIndex):
class MissingPandasLikeMultiIndex(object):
class MissingPandasLikeSeries(object):
class MissingPandasLikeExpanding(object):
class MissingPandasLikeRolling(object):
class MissingPandasLikeExpandingGroupby(object):
class MissingPandasLikeRollingGroupby(object):
class PythonModelWrapper(object):
class KoalasPlotAccessor(PandasObject):
class KoalasBarPlot(PandasBarPlot, TopNPlotBase):
class KoalasBoxPlot(PandasBoxPlot, BoxPlotBase):
class KoalasHistPlot(PandasHistPlot, HistogramPlotBase):
class KoalasPiePlot(PandasPiePlot, TopNPlotBase):
class KoalasAreaPlot(PandasAreaPlot, SampledPlotBase):
class KoalasLinePlot(PandasLinePlot, SampledPlotBase):
class KoalasBarhPlot(PandasBarhPlot, TopNPlotBase):
class KoalasScatterPlot(PandasScatterPlot, TopNPlotBase):
class KoalasKdePlot(PandasKdePlot, KdePlotBase):
new_class = type(\"NameType\", (NameTypeHolder,),
new_class = param.type if isinstance(param, np.dtype) else param
class Series(Frame, IndexOpsMixin, Generic[T]):
dictionary is a ``dict`` subclass that defines ``__missing__`` (i.e.
defaultdict(<class 'list'>,
class SparkIndexOpsMethods(object, metaclass=ABCMeta):
class SparkSeriesMethods(SparkIndexOpsMethods):
class SparkIndexMethods(SparkIndexOpsMethods):
class SparkFrameMethods(object):
class CachedSparkFrameMethods(SparkFrameMethods):
class SQLProcessor(object):
class StringMethods(object):
class SeriesType(Generic[T]):
class DataFrameType(object):
class ScalarType(object):
class UnknownType(object):
class NameTypeHolder(object):
The returned type class indicates both dtypes (a pandas only dtype object
class KoalasUsageLogger(object):
class RollingAndExpanding(object):
class Rolling(RollingAndExpanding):
class RollingGroupby(Rolling):
class Expanding(RollingAndExpanding):
class ExpandingGroupby(Expanding):

maropu · 2021-04-07T02:31:48Z

The GA test failures is not related to this PR, so merged to master.

maropu · 2021-04-07T02:32:26Z

Could you open PRs to backport this for branch-3.1/3.0, too?

Relative cost function

811c1c9

cloud-fan reviewed Mar 31, 2021

View reviewed changes

github-actions bot added the SQL label Mar 31, 2021

Fix test

6c39602

tanelk added 2 commits April 1, 2021 08:47

Rerun plan stability

9d69ede

Update doc

41b46a8

HyukjinKwon mentioned this pull request Apr 1, 2021

[SPARK-33935][SQL] Fix CBO cost function #30965

Closed

cloud-fan approved these changes Apr 1, 2021

View reviewed changes

maropu reviewed Apr 1, 2021

View reviewed changes

maropu approved these changes Apr 1, 2021

View reviewed changes

Rerun plan stability suite

dc87250

tanelk added 3 commits April 6, 2021 17:36

Comment

2e2d5fb

Comment

3ce2db9

Comment

a7406e7

Rerun plan stability suite

cdf7f08

maropu closed this in 7c8dc5e Apr 7, 2021

[SPARK-34922][SQL] Use a relative cost comparison function in the CBO #32014

[SPARK-34922][SQL] Use a relative cost comparison function in the CBO #32014

Uh oh!

Conversation

tanelk commented Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

tanelk commented Mar 31, 2021

Uh oh!

cloud-fan Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

tanelk Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 31, 2021

Uh oh!

SparkQA commented Mar 31, 2021

Uh oh!

SparkQA commented Mar 31, 2021

Uh oh!

SparkQA commented Mar 31, 2021

Uh oh!

SparkQA commented Mar 31, 2021

Uh oh!

SparkQA commented Apr 1, 2021

Uh oh!

cloud-fan commented Apr 1, 2021

Uh oh!

SparkQA commented Apr 1, 2021

Uh oh!

SparkQA commented Apr 1, 2021

Uh oh!

SparkQA commented Apr 1, 2021

Uh oh!

maropu Apr 1, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 5, 2021

Choose a reason for hiding this comment

Uh oh!

tanelk Apr 6, 2021

Choose a reason for hiding this comment

Uh oh!

maropu commented Apr 1, 2021

Uh oh!

SparkQA commented Apr 2, 2021

Uh oh!

SparkQA commented Apr 2, 2021

Uh oh!

SparkQA commented Apr 2, 2021

Uh oh!

cloud-fan commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

tanelk commented Mar 31, 2021 •

edited

Loading

tanelk Mar 31, 2021 •

edited

Loading