Picking changes from the fix for SPARK-32709 #39

s-pedamallu · 2021-07-22T18:25:42Z

Applied patch from PR: apache#33432

Query in Hive: https://sql.lyft.net/databases/mozart/queries/92938921
Query in Spark: https://sql.lyft.net/databases/mozart/queries/92939005
Comparision: https://sql.lyft.net/databases/mozart/queries/92948392

However, the number of files being created in Spark is very large and probably the solution wouldn't work without custom configuration. Raised a request regarding the same in open-source: https://issues.apache.org/jira/browse/SPARK-32709?focusedCommentId=17396330&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17396330

Context
Spark previously did not support bucketed tables from Hive because the hashing function in hive is different from that of Spark. However, this issue was addressed by introducing hive hash in Spark and using it when writing data and hence this data written by Spark can now be efficiently be read by Hive and PrestoSql/Trino.

TODO
Although reducers use hive hash when writing output, the reducers are not partitioned using hive hash which is leading to many output files that could cause S3 throttling failures. Custom configuration by enabling adaptive query execution may help in some cases.

Applied patch from PR: apache#33432

c21 · 2021-07-22T20:31:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

+      committer.newTaskTempFileAbsPath(taskAttemptContext, customPath.get, fileNameSpec)
    } else {
-      committer.newTaskTempFile(taskAttemptContext, partDir, ext)
+      committer.newTaskTempFile(taskAttemptContext, partDir, fileNameSpec)


Sorry to chime in here, but this PR depends on apache#33012. Shall we merge apache#33012 first? Thanks.

apologies for late reply. but this change is already included as part of this PR.

Yeah, my comment was published before your 2nd commit, so it looks good for me now.

s-pedamallu · 2021-08-16T20:48:32Z

👀 @catalinii @nanassito #compute-data-platform

### What changes were proposed in this pull request? Currently, Spark DS V2 aggregate push-down doesn't supports project with alias. Refer https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L96 This PR let it works good with alias. **The first example:** the origin plan show below: ``` Aggregate [DEPT#0], [DEPT#0, sum(mySalary#8) AS total#14] +- Project [DEPT#0, SALARY#2 AS mySalary#8] +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession77978658,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions5f8da82) ``` If we can complete push down the aggregate, then the plan will be: ``` Project [DEPT#0, SUM(SALARY)#18 AS sum(SALARY#2)#13 AS total#14] +- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee ``` If we can partial push down the aggregate, then the plan will be: ``` Aggregate [DEPT#0], [DEPT#0, sum(cast(SUM(SALARY)#18 as decimal(20,2))) AS total#14] +- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee ``` **The second example:** the origin plan show below: ``` Aggregate [myDept#33], [myDept#33, sum(mySalary#34) AS total#40] +- Project [DEPT#25 AS myDept#33, SALARY#27 AS mySalary#34] +- ScanBuilderHolder [DEPT#25, NAME#26, SALARY#27, BONUS#28], RelationV2[DEPT#25, NAME#26, SALARY#27, BONUS#28] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession25c4f621,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions345d641e) ``` If we can complete push down the aggregate, then the plan will be: ``` Project [DEPT#25 AS myDept#33, SUM(SALARY)#44 AS sum(SALARY#27)#39 AS total#40] +- RelationV2[DEPT#25, SUM(SALARY)#44] test.employee ``` If we can partial push down the aggregate, then the plan will be: ``` Aggregate [myDept#33], [DEPT#25 AS myDept#33, sum(cast(SUM(SALARY)#56 as decimal(20,2))) AS total#52] +- RelationV2[DEPT#25, SUM(SALARY)#56] test.employee ``` ### Why are the changes needed? Alias is more useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see DS V2 aggregate push-down supports project with alias. ### How was this patch tested? New tests. Closes apache#35932 from beliefer/SPARK-38533_new. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit f327dad) Signed-off-by: Wenchen Fan <[email protected]>

Picking changes from the fix for SPARK-32709

cbba6dc

Applied patch from PR: apache#33432

c21 reviewed Jul 22, 2021

View reviewed changes

s-pedamallu added 2 commits July 26, 2021 08:27

Fixing compile issues

1356607

fix compile failure Spark Hive module

31c6284

catalinii approved these changes Aug 16, 2021

View reviewed changes

s-pedamallu merged commit cefa493 into master Aug 16, 2021

s-pedamallu deleted the bucketing-spark-32709 branch August 16, 2021 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Picking changes from the fix for SPARK-32709 #39

Picking changes from the fix for SPARK-32709 #39

Uh oh!

s-pedamallu commented Jul 22, 2021 •

edited

Loading

Uh oh!

c21 Jul 22, 2021

Uh oh!

s-pedamallu Aug 5, 2021

Uh oh!

c21 Aug 5, 2021

Uh oh!

s-pedamallu commented Aug 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Picking changes from the fix for SPARK-32709 #39

Picking changes from the fix for SPARK-32709 #39

Uh oh!

Conversation

s-pedamallu commented Jul 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

c21 Jul 22, 2021

Choose a reason for hiding this comment

Uh oh!

s-pedamallu Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

c21 Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

s-pedamallu commented Aug 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

s-pedamallu commented Jul 22, 2021 •

edited

Loading