-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15453] [SQL] FileSourceScanExec to extract outputOrdering information
#14864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
07196a8
568b742
445549b
7db6c10
070c249
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,7 @@ import scala.language.existentials | |
|
|
||
| import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation | ||
| import org.apache.spark.sql.catalyst.TableIdentifier | ||
| import org.apache.spark.sql.execution.SortExec | ||
| import org.apache.spark.sql.execution.joins._ | ||
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.test.SharedSQLContext | ||
|
|
@@ -61,6 +62,51 @@ class JoinSuite extends QueryTest with SharedSQLContext { | |
| } | ||
| } | ||
|
|
||
| test("SPARK-15453 : Sort Merge join on bucketed + sorted tables should not add `sort` step " + | ||
| "if the join predicates are subset of the sorted columns of the tables") { | ||
| withTable("SPARK_15453_table_a", "SPARK_15453_table_b") { | ||
| withSQLConf("spark.sql.autoBroadcastJoinThreshold" -> "0") { | ||
| val df = | ||
| (0 until 8) | ||
| .map(i => (i, i * 2, i.toString)) | ||
| .toDF("i", "j", "k") | ||
| .coalesce(1) | ||
| df.write.bucketBy(4, "j", "k").sortBy("j", "k").saveAsTable("SPARK_15453_table_a") | ||
|
||
| df.write.bucketBy(4, "j", "k").sortBy("j", "k").saveAsTable("SPARK_15453_table_b") | ||
|
|
||
| val query = """ | ||
| |SELECT * | ||
| |FROM | ||
| | SPARK_15453_table_a a | ||
| |JOIN | ||
| | SPARK_15453_table_b b | ||
| |ON a.j=b.j AND | ||
| | a.k=b.k | ||
| """.stripMargin | ||
| val joinDF = sql(query) | ||
|
|
||
| val executedPlan = joinDF.queryExecution.executedPlan | ||
| val operators = executedPlan.collect { | ||
|
||
| case j: SortMergeJoinExec => j | ||
| case j: SortExec => j | ||
| } | ||
| assert(operators.size === 1) | ||
| assert(operators.head.getClass == classOf[SortMergeJoinExec]) | ||
|
|
||
| checkAnswer(joinDF, | ||
| Row(0, 0, "0", 0, 0, "0") :: | ||
| Row(1, 2, "1", 1, 2, "1") :: | ||
| Row(2, 4, "2", 2, 4, "2") :: | ||
| Row(3, 6, "3", 3, 6, "3") :: | ||
| Row(4, 8, "4", 4, 8, "4") :: | ||
| Row(5, 10, "5", 5, 10, "5") :: | ||
| Row(6, 12, "6", 6, 12, "6") :: | ||
| Row(7, 14, "7", 7, 14, "7") :: Nil) | ||
| } | ||
| } | ||
| } | ||
|
|
||
|
|
||
| test("join operator selection") { | ||
| spark.sharedState.cacheManager.clearCache() | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
listing files and grouping by bucket id can be expensive, if there are a lot of files. What's worse, we will do it again in
createBucketedReadRDD.Instead of doing this, I'd like to fix the sorting problem for bucketed table first, then we don't need to scan file names to get the
outputOrderingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sorting problem, one way to fix would be to do what Hive does : create single file per bucket. For any other approach, since there would be multiple files per bucket, one would have to globally sort them while reading it. This would in a way be sub-optimal because tables tend to be "write-once, read many" and spending more CPU once for write path to generate single file would be better.
When I came across this, I wondered why it was designed this way. I even posted about this to dev group earlier today : http://apache-spark-developers-list.1001551.n3.nabble.com/Questions-about-bucketing-in-Spark-td18814.html
To give you some context, I am trying to drive adoption for Spark within Facebook. We have lot of tables which would benefit from having full bucketing support. So my high level goal is to get Spark's bucketing in par with Hive's in terms of features and compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea that's a good question, single file per bucket looks more reasonable, it's more important to read bucketed table fast than writing it fast. But how about data insertion? Does hive support inserting into bucketed table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan : Open source Hive allows INSERTing data into bucketed table but it breaks the guarantee about one file per bucket. We could do better in two ways:
I think the later is a better model for longer term. But we could start with first one and work over it iteratively.