Skip to content

Conversation

@tanelk
Copy link
Contributor

@tanelk tanelk commented Mar 31, 2021

What changes were proposed in this pull request?

Changed the cost comparison function of the CBO to use the ratios of row counts and sizes in bytes.

Why are the changes needed?

In #30965 we changed to CBO cost comparison function so it would be "symetric": A.betterThan(B) now implies, that !B.betterThan(A).
With that we caused a performance regressions in some queries - TPCDS q19 for example.

The original cost comparison function used the ratios relativeRows = A.rowCount / B.rowCount and relativeSize = A.size / B.size. The changed function compared "absolute" cost values costA = w*A.rowCount + (1-w)*A.size and costB = w*B.rowCount + (1-w)*B.size.

Given the input from @wzhfy we decided to go back to the relative values, because otherwise one (size) may overwhelm the other (rowCount). But this time we avoid adding up the ratios.

Originally A.betterThan(B) => w*relativeRows + (1-w)*relativeSize < 1 was used. Besides being "non-symteric", this also can exhibit one overwhelming other.
For w=0.5 If A size (bytes) is at least 2x larger than B, then no matter how many times more rows does the B plan have, B will allways be considered to be better - 0.5*2 + 0.5*0.00000000000001 > 1.

When working with ratios, then it would be better to multiply them.
The proposed cost comparison function is: A.betterThan(B) => relativeRows^w * relativeSize^(1-w) < 1.

Does this PR introduce any user-facing change?

Comparison of the changed TPCDS v1.4 query execution times at sf=10:

  absolute multiplicative   additive  
q12 145 137 -5.52% 141 -2.76%
q13 264 271 2.65% 271 2.65%
q17 4521 4243 -6.15% 4348 -3.83%
q18 758 466 -38.52% 480 -36.68%
q19 38503 2167 -94.37% 2176 -94.35%
q20 119 120 0.84% 126 5.88%
q24a 16429 16838 2.49% 17103 4.10%
q24b 16592 16999 2.45% 17268 4.07%
q25 3558 3556 -0.06% 3675 3.29%
q33 362 361 -0.28% 380 4.97%
q52 1020 1032 1.18% 1052 3.14%
q55 927 938 1.19% 961 3.67%
q72 24169 13377 -44.65% 24306 0.57%
q81 1285 1185 -7.78% 1168 -9.11%
q91 324 336 3.70% 337 4.01%
q98 126 129 2.38% 131 3.97%

All times are in ms, the change is compared to the situation in the master branch (absolute).
The proposed cost function (multiplicative) significantlly improves the performance on q18, q19 and q72. The original cost function (additive) has similar improvements at q18 and q19. All other chagnes are within the error bars and I would ignore them - perhaps q81 has also improved.

How was this patch tested?

PlanStabilitySuite

@tanelk
Copy link
Contributor Author

tanelk commented Mar 31, 2021

} else {
val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card)
val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size)
Math.pow(relativeRows.doubleValue(), conf.joinReorderCardWeight) *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we update the config doc?

val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card)
val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size)
Math.pow(relativeRows.doubleValue(), conf.joinReorderCardWeight) *
Math.pow(relativeSize.doubleValue(), 1 - conf.joinReorderCardWeight) < 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this symmetric?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, it's kind of normalize the row count and bytes size with Math.pow(_, m) and Math.pow(_, 1- m), then calculate the relative ratios and compare.

Copy link
Contributor Author

@tanelk tanelk Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, then, when the left side of the comparison is x for A.betterThan(B), then for B.betterThan(A) it will be 1/x. One of them will be greater than 1 and other smaller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, after normalization the formula is still row_count1 * size1 / (row_count2 * size2), so if one is x the other must be 1/x.

@SparkQA
Copy link

SparkQA commented Mar 31, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41351/

@SparkQA
Copy link

SparkQA commented Mar 31, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41351/

@SparkQA
Copy link

SparkQA commented Mar 31, 2021

Test build #136768 has finished for PR 32014 at commit 811c1c9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions github-actions bot added the SQL label Mar 31, 2021
@SparkQA
Copy link

SparkQA commented Mar 31, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41363/

@SparkQA
Copy link

SparkQA commented Mar 31, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41363/

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Test build #136780 has finished for PR 32014 at commit 6c39602.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The change LGTM. Can we re-generate the golden files to fix conflicts?

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41381/

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41381/

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Test build #136799 has finished for PR 32014 at commit 41b46a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

thisCost < otherCost
if (other.planCost.card == 0 || other.planCost.size == 0) {
false
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about leaving some comments about why we need to use relative values here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to this method.

@maropu
Copy link
Member

maropu commented Apr 1, 2021

cc: @wzhfy

@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41429/

@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41429/

@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Test build #136851 has finished for PR 32014 at commit dc87250.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class SubtractTimestamps(
  • public class OrcArrayColumnVector extends OrcColumnVector
  • public class OrcAtomicColumnVector extends OrcColumnVector
  • public abstract class OrcColumnVector extends org.apache.spark.sql.vectorized.ColumnVector
  • class OrcColumnVectorUtils
  • public class OrcMapColumnVector extends OrcColumnVector
  • public class OrcStructColumnVector extends OrcColumnVector

@cloud-fan
Copy link
Contributor

Unfortunately there are conflicts again...

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41530/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41530/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41535/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41535/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Test build #136954 has finished for PR 32014 at commit a7406e7.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Test build #136958 has finished for PR 32014 at commit cdf7f08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class KoalasFrameMethods(object):
  • class KoalasSeriesMethods(object):
  • class IndexOpsMixin(object, metaclass=ABCMeta):
  • class CategoricalAccessor(object):
  • however, expected types are [(<class 'float'>, <class 'int'>)].
  • class OptionError(AttributeError, KeyError):
  • class DatetimeMethods(object):
  • class DataError(Exception):
  • class SparkPandasIndexingError(Exception):
  • class SparkPandasNotImplementedError(NotImplementedError):
  • class PandasNotImplementedError(NotImplementedError):
  • new_class = type(\"NameType\", (NameTypeHolder,),
  • new_class = type(\"NameType\", (NameTypeHolder,),
  • class DataFrame(Frame, Generic[T]):
  • [defaultdict(<class 'list'>,
  • defaultdict(<class 'list'>,
  • class CachedDataFrame(DataFrame):
  • class Frame(object, metaclass=ABCMeta):
  • class GroupBy(object, metaclass=ABCMeta):
  • class DataFrameGroupBy(GroupBy):
  • class SeriesGroupBy(GroupBy):
  • class Index(IndexOpsMixin):
  • class CategoricalIndex(Index):
  • class DatetimeIndex(Index):
  • class MultiIndex(Index):
  • a single :class:Index (or subclass thereof).
  • class NumericIndex(Index):
  • class IntegerIndex(NumericIndex):
  • class Int64Index(IntegerIndex):
  • class Float64Index(NumericIndex):
  • class IndexerLike(object):
  • class AtIndexer(IndexerLike):
  • class iAtIndexer(IndexerLike):
  • class LocIndexerLike(IndexerLike, metaclass=ABCMeta):
  • class LocIndexer(LocIndexerLike):
  • class iLocIndexer(LocIndexerLike):
  • class InternalFrame(object):
  • class _MissingPandasLikeDataFrame(object):
  • class MissingPandasLikeDataFrameGroupBy(object):
  • class MissingPandasLikeSeriesGroupBy(object):
  • class MissingPandasLikeIndex(object):
  • class MissingPandasLikeDatetimeIndex(MissingPandasLikeIndex):
  • class MissingPandasLikeCategoricalIndex(MissingPandasLikeIndex):
  • class MissingPandasLikeMultiIndex(object):
  • class MissingPandasLikeSeries(object):
  • class MissingPandasLikeExpanding(object):
  • class MissingPandasLikeRolling(object):
  • class MissingPandasLikeExpandingGroupby(object):
  • class MissingPandasLikeRollingGroupby(object):
  • class PythonModelWrapper(object):
  • class KoalasPlotAccessor(PandasObject):
  • class KoalasBarPlot(PandasBarPlot, TopNPlotBase):
  • class KoalasBoxPlot(PandasBoxPlot, BoxPlotBase):
  • class KoalasHistPlot(PandasHistPlot, HistogramPlotBase):
  • class KoalasPiePlot(PandasPiePlot, TopNPlotBase):
  • class KoalasAreaPlot(PandasAreaPlot, SampledPlotBase):
  • class KoalasLinePlot(PandasLinePlot, SampledPlotBase):
  • class KoalasBarhPlot(PandasBarhPlot, TopNPlotBase):
  • class KoalasScatterPlot(PandasScatterPlot, TopNPlotBase):
  • class KoalasKdePlot(PandasKdePlot, KdePlotBase):
  • new_class = type(\"NameType\", (NameTypeHolder,),
  • new_class = param.type if isinstance(param, np.dtype) else param
  • class Series(Frame, IndexOpsMixin, Generic[T]):
  • dictionary is a ``dict`` subclass that defines ``__missing__`` (i.e.
  • defaultdict(<class 'list'>,
  • class SparkIndexOpsMethods(object, metaclass=ABCMeta):
  • class SparkSeriesMethods(SparkIndexOpsMethods):
  • class SparkIndexMethods(SparkIndexOpsMethods):
  • class SparkFrameMethods(object):
  • class CachedSparkFrameMethods(SparkFrameMethods):
  • class SQLProcessor(object):
  • class StringMethods(object):
  • class SeriesType(Generic[T]):
  • class DataFrameType(object):
  • class ScalarType(object):
  • class UnknownType(object):
  • class NameTypeHolder(object):
  • The returned type class indicates both dtypes (a pandas only dtype object
  • class KoalasUsageLogger(object):
  • class RollingAndExpanding(object):
  • class Rolling(RollingAndExpanding):
  • class RollingGroupby(Rolling):
  • class Expanding(RollingAndExpanding):
  • class ExpandingGroupby(Expanding):

@maropu maropu closed this in 7c8dc5e Apr 7, 2021
@maropu
Copy link
Member

maropu commented Apr 7, 2021

The GA test failures is not related to this PR, so merged to master.

@maropu
Copy link
Member

maropu commented Apr 7, 2021

Could you open PRs to backport this for branch-3.1/3.0, too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants