Skip to content

Conversation

@s-pedamallu
Copy link

@s-pedamallu s-pedamallu commented Jul 22, 2021

Applied patch from PR: apache#33432

Query in Hive: https://sql.lyft.net/databases/mozart/queries/92938921
Query in Spark: https://sql.lyft.net/databases/mozart/queries/92939005
Comparision: https://sql.lyft.net/databases/mozart/queries/92948392

However, the number of files being created in Spark is very large and probably the solution wouldn't work without custom configuration. Raised a request regarding the same in open-source: https://issues.apache.org/jira/browse/SPARK-32709?focusedCommentId=17396330&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17396330

Context
Spark previously did not support bucketed tables from Hive because the hashing function in hive is different from that of Spark. However, this issue was addressed by introducing hive hash in Spark and using it when writing data and hence this data written by Spark can now be efficiently be read by Hive and PrestoSql/Trino.

TODO
Although reducers use hive hash when writing output, the reducers are not partitioned using hive hash which is leading to many output files that could cause S3 throttling failures. Custom configuration by enabling adaptive query execution may help in some cases.

Comment on lines +241 to +243
committer.newTaskTempFileAbsPath(taskAttemptContext, customPath.get, fileNameSpec)
} else {
committer.newTaskTempFile(taskAttemptContext, partDir, ext)
committer.newTaskTempFile(taskAttemptContext, partDir, fileNameSpec)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to chime in here, but this PR depends on apache#33012. Shall we merge apache#33012 first? Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies for late reply. but this change is already included as part of this PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my comment was published before your 2nd commit, so it looks good for me now.

@s-pedamallu
Copy link
Author

👀 @catalinii @nanassito #compute-data-platform

@s-pedamallu s-pedamallu merged commit cefa493 into master Aug 16, 2021
@s-pedamallu s-pedamallu deleted the bucketing-spark-32709 branch August 16, 2021 21:19
laguilarlyft pushed a commit that referenced this pull request Nov 9, 2023
### What changes were proposed in this pull request?
Currently, Spark DS V2 aggregate push-down doesn't supports project with alias.

Refer https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L96

This PR let it works good with alias.

**The first example:**
the origin plan show below:
```
Aggregate [DEPT#0], [DEPT#0, sum(mySalary#8) AS total#14]
+- Project [DEPT#0, SALARY#2 AS mySalary#8]
   +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession77978658,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions5f8da82)
```
If we can complete push down the aggregate, then the plan will be:
```
Project [DEPT#0, SUM(SALARY)#18 AS sum(SALARY#2)#13 AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee
```
If we can partial push down the aggregate, then the plan will be:
```
Aggregate [DEPT#0], [DEPT#0, sum(cast(SUM(SALARY)#18 as decimal(20,2))) AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee
```

**The second example:**
the origin plan show below:
```
Aggregate [myDept#33], [myDept#33, sum(mySalary#34) AS total#40]
+- Project [DEPT#25 AS myDept#33, SALARY#27 AS mySalary#34]
   +- ScanBuilderHolder [DEPT#25, NAME#26, SALARY#27, BONUS#28], RelationV2[DEPT#25, NAME#26, SALARY#27, BONUS#28] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession25c4f621,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions345d641e)
```
If we can complete push down the aggregate, then the plan will be:
```
Project [DEPT#25 AS myDept#33, SUM(SALARY)#44 AS sum(SALARY#27)#39 AS total#40]
+- RelationV2[DEPT#25, SUM(SALARY)#44] test.employee
```
If we can partial push down the aggregate, then the plan will be:
```
Aggregate [myDept#33], [DEPT#25 AS myDept#33, sum(cast(SUM(SALARY)#56 as decimal(20,2))) AS total#52]
+- RelationV2[DEPT#25, SUM(SALARY)#56] test.employee
```

### Why are the changes needed?
Alias is more useful.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users could see DS V2 aggregate push-down supports project with alias.

### How was this patch tested?
New tests.

Closes apache#35932 from beliefer/SPARK-38533_new.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit f327dad)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants