-
Notifications
You must be signed in to change notification settings - Fork 2
Picking changes from the fix for SPARK-32709 #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Applied patch from PR: apache#33432
| committer.newTaskTempFileAbsPath(taskAttemptContext, customPath.get, fileNameSpec) | ||
| } else { | ||
| committer.newTaskTempFile(taskAttemptContext, partDir, ext) | ||
| committer.newTaskTempFile(taskAttemptContext, partDir, fileNameSpec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to chime in here, but this PR depends on apache#33012. Shall we merge apache#33012 first? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apologies for late reply. but this change is already included as part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my comment was published before your 2nd commit, so it looks good for me now.
|
👀 @catalinii @nanassito #compute-data-platform |
### What changes were proposed in this pull request? Currently, Spark DS V2 aggregate push-down doesn't supports project with alias. Refer https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L96 This PR let it works good with alias. **The first example:** the origin plan show below: ``` Aggregate [DEPT#0], [DEPT#0, sum(mySalary#8) AS total#14] +- Project [DEPT#0, SALARY#2 AS mySalary#8] +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession77978658,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions5f8da82) ``` If we can complete push down the aggregate, then the plan will be: ``` Project [DEPT#0, SUM(SALARY)#18 AS sum(SALARY#2)#13 AS total#14] +- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee ``` If we can partial push down the aggregate, then the plan will be: ``` Aggregate [DEPT#0], [DEPT#0, sum(cast(SUM(SALARY)#18 as decimal(20,2))) AS total#14] +- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee ``` **The second example:** the origin plan show below: ``` Aggregate [myDept#33], [myDept#33, sum(mySalary#34) AS total#40] +- Project [DEPT#25 AS myDept#33, SALARY#27 AS mySalary#34] +- ScanBuilderHolder [DEPT#25, NAME#26, SALARY#27, BONUS#28], RelationV2[DEPT#25, NAME#26, SALARY#27, BONUS#28] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession25c4f621,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions345d641e) ``` If we can complete push down the aggregate, then the plan will be: ``` Project [DEPT#25 AS myDept#33, SUM(SALARY)#44 AS sum(SALARY#27)#39 AS total#40] +- RelationV2[DEPT#25, SUM(SALARY)#44] test.employee ``` If we can partial push down the aggregate, then the plan will be: ``` Aggregate [myDept#33], [DEPT#25 AS myDept#33, sum(cast(SUM(SALARY)#56 as decimal(20,2))) AS total#52] +- RelationV2[DEPT#25, SUM(SALARY)#56] test.employee ``` ### Why are the changes needed? Alias is more useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see DS V2 aggregate push-down supports project with alias. ### How was this patch tested? New tests. Closes apache#35932 from beliefer/SPARK-38533_new. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit f327dad) Signed-off-by: Wenchen Fan <[email protected]>
Applied patch from PR: apache#33432
Query in Hive: https://sql.lyft.net/databases/mozart/queries/92938921
Query in Spark: https://sql.lyft.net/databases/mozart/queries/92939005
Comparision: https://sql.lyft.net/databases/mozart/queries/92948392
However, the number of files being created in Spark is very large and probably the solution wouldn't work without custom configuration. Raised a request regarding the same in open-source: https://issues.apache.org/jira/browse/SPARK-32709?focusedCommentId=17396330&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17396330
Context
Spark previously did not support bucketed tables from Hive because the hashing function in hive is different from that of Spark. However, this issue was addressed by introducing hive hash in Spark and using it when writing data and hence this data written by Spark can now be efficiently be read by Hive and PrestoSql/Trino.
TODO
Although reducers use hive hash when writing output, the reducers are not partitioned using hive hash which is leading to many output files that could cause S3 throttling failures. Custom configuration by enabling adaptive query execution may help in some cases.