-
Notifications
You must be signed in to change notification settings - Fork 29.3k
[SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically #29804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
a7193cf
7ffcdbc
6a8c7a5
a15f864
b45ee8b
9aa0266
26c4338
336337a
1012334
057c021
e0c76c9
fbe0c06
4027175
f2ceacd
b29f688
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -951,6 +951,14 @@ object SQLConf { | |
| .checkValue(_ > 0, "the value of spark.sql.sources.bucketing.maxBuckets must be greater than 0") | ||
| .createWithDefault(100000) | ||
|
|
||
| val AUTO_BUCKETED_SCAN_ENABLED = | ||
| buildConf("spark.sql.sources.bucketing.autoBucketedScan.enabled") | ||
| .doc("When true, decide whether to do bucketed scan on input tables based on query plan " + | ||
| "automatically.") | ||
| .version("3.1.0") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. btw, we need to make this config external? If we just add this config for keeping the current behaviour, is it okay to add it as internal one?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @maropu - sure, just for my own education, what does it indicate to make a config internal/external?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC we don't have any strict rule for that. But, I think this new rule works well in most queries, so adding this as external looks less meaning because I think most users don't turn this feature off.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @maropu - sure, updated. |
||
| .booleanConf | ||
| .createWithDefault(true) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @viirya - this config will not take any effect in that case, updated. |
||
|
|
||
| val CROSS_JOINS_ENABLED = buildConf("spark.sql.crossJoin.enabled") | ||
| .internal() | ||
| .doc("When false, we will throw an error if a query contains a cartesian product without " + | ||
|
|
@@ -3164,6 +3172,8 @@ class SQLConf extends Serializable with Logging { | |
|
|
||
| def bucketingMaxBuckets: Int = getConf(SQLConf.BUCKETING_MAX_BUCKETS) | ||
|
|
||
| def autoBucketedScanEnabled: Boolean = getConf(SQLConf.AUTO_BUCKETED_SCAN_ENABLED) | ||
|
|
||
| def dataFrameSelfJoinAutoResolveAmbiguity: Boolean = | ||
| getConf(DATAFRAME_SELF_JOIN_AUTO_RESOLVE_AMBIGUITY) | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -156,7 +156,9 @@ case class RowDataSourceScanExec( | |
| * @param optionalBucketSet Bucket ids for bucket pruning. | ||
| * @param optionalNumCoalescedBuckets Number of coalesced buckets. | ||
| * @param dataFilters Filters on non-partition columns. | ||
| * @param tableIdentifier identifier for the table in the metastore. | ||
| * @param tableIdentifier Identifier for the table in the metastore. | ||
| * @param disableBucketedScan Disable bucketed scan based on physical query plan, see rule | ||
| * [[DisableUnnecessaryBucketedScan]] for details. | ||
| */ | ||
| case class FileSourceScanExec( | ||
| @transient relation: HadoopFsRelation, | ||
|
|
@@ -166,7 +168,8 @@ case class FileSourceScanExec( | |
| optionalBucketSet: Option[BitSet], | ||
| optionalNumCoalescedBuckets: Option[Int], | ||
| dataFilters: Seq[Expression], | ||
| tableIdentifier: Option[TableIdentifier]) | ||
| tableIdentifier: Option[TableIdentifier], | ||
| disableBucketedScan: Boolean = false) | ||
| extends DataSourceScanExec { | ||
|
|
||
| // Note that some vals referring the file-based relation are lazy intentionally | ||
|
|
@@ -257,7 +260,8 @@ case class FileSourceScanExec( | |
|
|
||
| // exposed for testing | ||
| lazy val bucketedScan: Boolean = { | ||
| if (relation.sparkSession.sessionState.conf.bucketingEnabled && relation.bucketSpec.isDefined) { | ||
| if (relation.sparkSession.sessionState.conf.bucketingEnabled && relation.bucketSpec.isDefined | ||
| && !disableBucketedScan) { | ||
| val spec = relation.bucketSpec.get | ||
| val bucketColumns = spec.bucketColumnNames.flatMap(n => toAttribute(n)) | ||
| bucketColumns.size == spec.bucketColumnNames.size | ||
|
|
@@ -339,7 +343,7 @@ case class FileSourceScanExec( | |
| location.getClass.getSimpleName + | ||
| Utils.buildLocationMetadata(location.rootPaths, maxMetadataValueLength) | ||
| val metadata = | ||
| Map( | ||
| HashMap( | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @maropu - updated. Was just following other code in the same file. |
||
| "Format" -> relation.fileFormat.toString, | ||
| "ReadSchema" -> requiredSchema.catalogString, | ||
| "Batched" -> supportsColumnar.toString, | ||
|
|
@@ -348,20 +352,22 @@ case class FileSourceScanExec( | |
| "DataFilters" -> seqToString(dataFilters), | ||
| "Location" -> locationDesc) | ||
|
|
||
| val withSelectedBucketsCount = relation.bucketSpec.map { spec => | ||
| val numSelectedBuckets = optionalBucketSet.map { b => | ||
| b.cardinality() | ||
| } getOrElse { | ||
| spec.numBuckets | ||
| if (bucketedScan) { | ||
| relation.bucketSpec.map { spec => | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @maropu - just for my own education, why does it matter? Updated anyway.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, I remember the previous discussion: https://issues.apache.org/jira/browse/SPARK-16694
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, please only use
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan , @maropu - I changed the code during iterations. The current change is just adding a |
||
| val numSelectedBuckets = optionalBucketSet.map { b => | ||
| b.cardinality() | ||
| } getOrElse { | ||
| spec.numBuckets | ||
| } | ||
| metadata += ("SelectedBucketsCount" -> | ||
| (s"$numSelectedBuckets out of ${spec.numBuckets}" + | ||
| optionalNumCoalescedBuckets.map { b => s" (Coalesced to $b)"}.getOrElse(""))) | ||
| } | ||
| metadata + ("SelectedBucketsCount" -> | ||
| (s"$numSelectedBuckets out of ${spec.numBuckets}" + | ||
| optionalNumCoalescedBuckets.map { b => s" (Coalesced to $b)"}.getOrElse(""))) | ||
| } getOrElse { | ||
| metadata | ||
| } else if (disableBucketedScan) { | ||
| metadata += ("DisableBucketedScan" -> "true") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need it? It's kind of the reason why there is no bucket scan in this node. The reason can be: 1. the table is not bucketed. 2. the bucket column is not read. 3. the planner decides to disable it as it has no benefits. If we do need the reason, we should make it completed. Let's not just put the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At least we don't need to do it in this PR.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan - how about I update the explain with a
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The intention in my comment meant that users need to be able to see why bucket scans are disabled as Wenchen pointed it out above. Anyway, the followup looks okay.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a TODO comment for followup jira - https://issues.apache.org/jira/browse/SPARK-32986 . |
||
| } | ||
|
|
||
| withSelectedBucketsCount | ||
| metadata.toMap | ||
| } | ||
|
|
||
| override def verboseStringWithOperatorId(): String = { | ||
|
|
@@ -624,6 +630,7 @@ case class FileSourceScanExec( | |
| optionalBucketSet, | ||
| optionalNumCoalescedBuckets, | ||
| QueryPlan.normalizePredicates(dataFilters, output), | ||
| None) | ||
| None, | ||
| disableBucketedScan) | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,153 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.execution.bucketing | ||
|
|
||
| import org.apache.spark.sql.catalyst.expressions.aggregate.{Partial, PartialMerge} | ||
| import org.apache.spark.sql.catalyst.plans.physical.{ClusteredDistribution, HashClusteredDistribution} | ||
| import org.apache.spark.sql.catalyst.rules.Rule | ||
| import org.apache.spark.sql.execution.{FileSourceScanExec, FilterExec, ProjectExec, SortExec, SparkPlan} | ||
| import org.apache.spark.sql.execution.aggregate.BaseAggregateExec | ||
| import org.apache.spark.sql.execution.exchange.Exchange | ||
| import org.apache.spark.sql.internal.SQLConf | ||
|
|
||
| /** | ||
| * Disable unnecessary bucketed table scan based on actual physical query plan. | ||
| * NOTE: this rule is designed to be applied right after [[EnsureRequirements]], | ||
| * where all [[ShuffleExchangeExec]] and [[SortExec]] have been added to plan properly. | ||
| * | ||
| * When BUCKETING_ENABLED and AUTO_BUCKETED_SCAN_ENABLED are set to true, go through | ||
| * query plan to check where bucketed table scan is unnecessary, and disable bucketed table | ||
| * scan if needed. | ||
| * | ||
| * For all operators which [[hasInterestingPartition]] (i.e., require [[ClusteredDistribution]] | ||
| * or [[HashClusteredDistribution]]), check if the sub-plan for operator has [[Exchange]] and | ||
| * bucketed table scan. If yes, disable the bucketed table scan in the sub-plan. | ||
| * Only allow certain operators in sub-plan, which guarantees each sub-plan is single lineage | ||
| * (i.e., each operator has only one child). See details in | ||
| * [[disableBucketWithInterestingPartition]]). | ||
| * | ||
| * Examples: | ||
| * (1).join: | ||
| * SortMergeJoin(t1.i = t2.j) | ||
| * / \ | ||
| * Sort(i) Sort(j) | ||
| * / \ | ||
| * Shuffle(i) Scan(t2: i, j) | ||
| * / (bucketed on column j, enable bucketed scan) | ||
| * Scan(t1: i, j) | ||
| * (bucketed on column j, DISABLE bucketed scan) | ||
| * | ||
| * (2).aggregate: | ||
| * HashAggregate(i, ..., Final) | ||
| * | | ||
| * Shuffle(i) | ||
| * | | ||
| * HashAggregate(i, ..., Partial) | ||
| * | | ||
| * Filter | ||
| * | | ||
| * Scan(t1: i, j) | ||
| * (bucketed on column j, DISABLE bucketed scan) | ||
| * | ||
| * The idea of [[hasInterestingPartition]] is inspired from "interesting order" in | ||
| * the paper "Access Path Selection in a Relational Database Management System" | ||
| * (http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf). | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: could we use a link to the ACM page instead? https://dl.acm.org/doi/10.1145/582095.582099
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @maropu - sure, updated. Was just following the reference in CBO join reorder. Updated the link there as well. |
||
| */ | ||
| case class DisableUnnecessaryBucketedScan(conf: SQLConf) extends Rule[SparkPlan] { | ||
|
|
||
| /** | ||
| * Disable bucketed table scan with pre-order traversal of plan. | ||
| * | ||
| * @param withInterestingPartition The traversed plan has operator with interesting partition. | ||
| * @param withExchange The traversed plan has [[Exchange]] operator. | ||
| */ | ||
| private def disableBucketWithInterestingPartition( | ||
| plan: SparkPlan, | ||
| withInterestingPartition: Boolean, | ||
| withExchange: Boolean): SparkPlan = { | ||
| plan match { | ||
| case p if hasInterestingPartition(p) => | ||
| // Operators with interesting partition, propagates `withInterestingPartition` as true | ||
| // to its children. | ||
| p.mapChildren(disableBucketWithInterestingPartition(_, true, false)) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a question; I read the PR description and I thought first that this rule is to find exchange plan nodes (inserted by But, the current code looks more general. Any reason to propagate required distributions in a top-down manner?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @maropu - this is good question. I think currently only matching One interesting extension I can think of - if bucketed scan parallelism is too low (to few # of buckets), we may decide to not do a bucketed scan for join to trade-off for query run-time vs extra shuffle cost (in this case, there's no |
||
| case exchange: Exchange if withInterestingPartition => | ||
| // Exchange operator propagates `withExchange` as true to its child | ||
| // if the plan has interesting partition. | ||
| exchange.mapChildren(disableBucketWithInterestingPartition( | ||
| _, withInterestingPartition, true)) | ||
| case scan: FileSourceScanExec | ||
| if withInterestingPartition && withExchange && isBucketedScanWithoutFilter(scan) => | ||
| // Disable bucketed table scan if the plan has interesting partition, | ||
| // and [[Exchange]] in the plan. | ||
| scan.copy(disableBucketedScan = true) | ||
| case o => | ||
| if (isAllowedUnaryExecNode(o)) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's make the rule clear. I think there are 2 cases when doing traverse here:
I'd expect to see code like
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan - as discussed offline, we still need to traverse sub-plan for node with interesting partition, even though this node cannot disable bucketed scan (e.g. multiple join cases). Updated the code to disable bucketed table scan if:
|
||
| // Propagates `withInterestingPartition` and `withExchange` from parent | ||
| // for only allowed single-child nodes. | ||
| o.mapChildren(disableBucketWithInterestingPartition( | ||
| _, withInterestingPartition, withExchange)) | ||
| } else { | ||
| o.mapChildren(disableBucketWithInterestingPartition(_, false, false)) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| private def hasInterestingPartition(plan: SparkPlan): Boolean = { | ||
| plan.requiredChildDistribution.exists { | ||
| case _: ClusteredDistribution | _: HashClusteredDistribution => true | ||
| case _ => false | ||
| } | ||
| } | ||
|
|
||
| private def isAllowedUnaryExecNode(plan: SparkPlan): Boolean = { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add description why this is needed? HasInterestingPartition and at lease one Exchange sounds obvious condition, but this allowed unary exec node is not. Why we can disable bucketed scan only if those exec nodes?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @viirya - this is good question. I agree we can be more bold and we probably don't need a whitelist operators here, e.g. SMJ - shuffle - BHJ - scan, SMJ - Shuffle - union - Scan (and another scan) should also work, but my feeling is to start with more confidence change first and improve later. With a whitelist operators here, we have a high confidence that this feature should work without introducing regression, but much less confidence if we allow arbitrary operators in the middle (at least for me). For now, to be honest, I cannot find a case why arbitrary operators cannot work. But I want to play safer in the beginning and any future improvement for this is much welcomed. cc @cloud-fan and @maropu for thoughts. Added a comment for now.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my opnion, it is okay for this PR to focus on a basic (minimal) support for the auto bucket scan. In followup activities, I think we can optimize it step-by-step by adding test cases and checking performance improvements... (Anyway, it would be better to leave some comment there about it as @viirya suggested above)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should make the code clear for developer and maintainer, so leaving some comment is nicer if we want to constrain the scope of this rule for now.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I already added a comment in last iteration. Please suggest concretely for alternative comment if it's not looking good. Thanks. |
||
| plan match { | ||
| case _: SortExec | _: Exchange | _: ProjectExec | _: FilterExec | | ||
| _: FileSourceScanExec => true | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is the case only allowed if
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @maropu - |
||
| case partialAgg: BaseAggregateExec => | ||
| val modes = partialAgg.aggregateExpressions.map(_.mode) | ||
| modes.nonEmpty && modes.forall(mode => mode == Partial || mode == PartialMerge) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. instead of checking the mode, shall we just check
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan - sure, updated. Also thanks for catch, I think |
||
| case _ => false | ||
| } | ||
| } | ||
|
|
||
| private def isBucketedScanWithoutFilter(scan: FileSourceScanExec): Boolean = { | ||
| // Do not disable bucketed table scan if it has filter pruning, | ||
| // because bucketed table scan is still useful here to save CPU/IO cost with | ||
| // only reading selected bucket files. | ||
| scan.bucketedScan && scan.optionalBucketSet.isEmpty | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if a scan operator reads most buckets? e.g., 999 of 1000 buckets. We select bucket scans even in this case?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @maropu - this is a good question, and I think it is kind of out of scope for this PR and needs more thoughts later. We don't have a cost model to decide whether to do (bucketed filter + bucketed scan) vs (normal filter + non-bucketed scan). It can depend on number of buckets, size of filtered buckets, CPU cost for filter, etc.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm fine with it for now. Technically I think filter by bucket ID and bucketed scan don't need to be coupled. We can always filter files by bucket id, and then do bucketed scan or not according to this rule.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, okay. Could you file jira later, @c21 ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @maropu - sure, filed https://issues.apache.org/jira/browse/SPARK-32985 . |
||
| } | ||
|
|
||
| private def disableAllBucketedScan(plan: SparkPlan): SparkPlan = { | ||
| plan.transformUp { | ||
| case scan: FileSourceScanExec if isBucketedScanWithoutFilter(scan) => | ||
| scan.copy(disableBucketedScan = true) | ||
| } | ||
| } | ||
|
|
||
| def apply(plan: SparkPlan): SparkPlan = { | ||
| if (!conf.bucketingEnabled || !conf.autoBucketedScanEnabled) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a plan doesn't have any bucket spec, we do nothing in this rule? Could we filter out the case at the beginning?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @maropu - sure, updated. |
||
| plan | ||
| } else if (plan.find(hasInterestingPartition).isDefined) { | ||
| disableBucketWithInterestingPartition(plan, false, false) | ||
| } else { | ||
| // Disable all bucketed scans if there's no operator with interesting partition | ||
| // found in query plan. | ||
| disableAllBucketedScan(plan) | ||
| } | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since our user documents are generated based on this statement, could you describe a bit more about how to decide whether to do bucketed scans or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu - sure, wondering what do you think of below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks okay, but I have the same suggestion with #29804 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu - sure, updated.