-
Notifications
You must be signed in to change notification settings - Fork 562
[GLUTEN-11346][CORE][VL] Add Spark 4.1 Shim Layer #11347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Run Gluten Clickhouse CI on x86 |
4144702 to
f31bc7a
Compare
|
Run Gluten Clickhouse CI on x86 |
f31bc7a to
3e6f2f8
Compare
|
Run Gluten Clickhouse CI on x86 |
| **/gluten-ut/**/hs_err_*.log | ||
| **/gluten-ut/**/core.* | ||
| spark-test-spark41: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should also add the tpc tests
https://github.com/apache/incubator-gluten/blob/main/.github/workflows/velox_backend_x86.yml#L104
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhouyuan
I understand that we need to add this here. Spark 4.1 has a new option spark.sql.unionOutputPartitioning introduced in apache/spark#51623. Currently, it needs to be set to false for successful execution. I plan to submit a separate PR later to address this, which will make the review process more convenient."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should also add the tpc tests https://github.com/apache/incubator-gluten/blob/main/.github/workflows/velox_backend_x86.yml#L104
fix in #11353
| dnf module -y install python39 && \ | ||
| alternatives --set python3 /usr/bin/python3.9 && \ | ||
| pip3 install setuptools==77.0.3 && \ | ||
| pip3 install pyspark==3.5.5 cython && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pyspark version should be 4.1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, it was copied from Spark 4.0, cc @zhouyuan
However, starting with Spark 4.1(apache/spark#51259), the minimum supported Python version is 3.10. I'm not familiar with how to configure the Python environment, so I've excluded these two unit tests for now, see (2ef147c).
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
| object ParquetFooterReaderShim { | ||
|
|
||
| /** @since Spark 4.1 */ | ||
| def readFooter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since most of the versions have same implementation, could we use like https://github.com/apache/incubator-gluten/blob/main/shims/common/src/main/scala/org/apache/gluten/sql/shims/SparkShims.scala#L355?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intentionally designed this way.
The SparkShims interface has been growing larger, with many unrelated interfaces deliberately combined together.
For example, the shims added in this change—QueryExecutionShim, DataSourceV2RelationShim, and ParquetFooterReaderShim—are conceptually very different and do not belong to the same abstraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since most of the versions have same implementation, could we use like https://github.com/apache/incubator-gluten/blob/main/shims/common/src/main/scala/org/apache/gluten/sql/shims/SparkShims.scala#L355?
I’m still keeping ParquetFooterReaderShim. Those two readFooter methods are just static utility methods in Spark’s ParquetFooterReader, so moving them into SparkShim feels odd.
| def createSparkPlan( | ||
| sparkSession: SparkSession, | ||
| planner: SparkPlanner, | ||
| plan: LogicalPlan): SparkPlan = QueryExecution.createSparkPlan(sparkSession, planner, plan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to SparkShim
| import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression, SortOrder} | ||
| import org.apache.spark.sql.connector.read.{InputPartition, PartitionReaderFactory, Scan} | ||
|
|
||
| class Spark35Scan extends DataSourceV2ScanExecBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it used? I assume it is not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove them
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
| import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder | ||
| import org.apache.spark.sql.types.StructType | ||
|
|
||
| object InternalRowUtl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file name and object name has a typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch—let’s fix it in another PR. Since it’s been in the repo for a long time, I’d rather keep this PR focused.
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch—let’s fix it in another PR. Since it’s been in the repo for a long time, I’d rather keep this PR focused.
|
Run Gluten Clickhouse CI on x86 |
| Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | - | Feat | Feature | Introduce Spark41Shims and update build configuration to support Spark 4.1. | pom.xml<br>shims/pom.xml<br>shims/spark41/pom.xml<br>shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider<br>shims/spark41/.../spark41/Spark41Shims.scala<br>shims/spark41/.../spark41/SparkShimProvider.scala | | [#51477](apache/spark#51477) | Fix | Compatibility | Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility. | gluten-core/.../caller/CallerInfo.scala | | [#50852](apache/spark#50852) | Fix | Compatibility | Add printOutputColumns parameter to generateTreeString methods | shims/spark41/.../GenerateTreeStringShim.scala | | [#51775](apache/spark#51775) | Fix | Compatibility | Remove unused MDC import in FileSourceScanExecShim.scala | shims/spark41/.../FileSourceScanExecShim.scala | | [#51979](apache/spark#51979) | Fix | Compatibility | Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec | shims/spark41/.../v2/AbstractBatchScanExec.scala<br>shims/spark41/.../v2/BatchScanExecShim.scala | | [#51302](apache/spark#51302) | Fix | Compatibility | Remove TimeAdd from ExpressionConverter and ExpressionMappings for test | gluten-substrait/.../ExpressionConverter.scala<br>gluten-substrait/.../ExpressionMappings.scala | | [#50598](apache/spark#50598) | Fix | Compatibility | Adapt to QueryExecution.createSparkPlan interface change | gluten-substrait/.../GlutenImplicits.scala<br>shims/spark\*/.../shims/spark\*/Spark*Shims.scala | | [#52599](apache/spark#52599) | Fix | Compatibility | Adapt to DataSourceV2Relation interface change | backends-velox/.../ArrowConvertorRule.scala | | [#52384](apache/spark#52384) | Fix | Compatibility | Using new interface of ParquetFooterReader | backends-velox/.../ParquetMetadataUtils.scala<br>gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala<br>shims/spark*/.../parquet/ParquetFooterReaderShim.scala | | [#52509](apache/spark#52509) | Fix | Build | Update Scala version to 2.13.17 in pom.xml to fix `java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'` | pom.xml | | - | Fix | Test | Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability | backends-velox/.../VeloxHashJoinSuite.scala | | [#50849](apache/spark#50849) | Fix | Test | Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1 | backends-velox/.../MiscOperatorSuite.scala | | [#52723](apache/spark#52723) | Fix | Compatibility | Add GeographyVal and GeometryVal support in ColumnarArrayShim | shims/spark41/.../vectorized/ColumnarArrayShim.java | | [#48470](apache/spark#48470) | 4.1.0 | Exclude | Exclude split test in VeloxStringFunctionsSuite | backends-velox/.../VeloxStringFunctionsSuite.scala | | [#51259](apache/spark#51259) | 4.1.0 | Exclude | Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0, we need update ci python to 3.10 | backends-velox/.../python/ArrowEvalPythonExecSuite.scala |
|
Run Gluten Clickhouse CI on x86 |
|
Thanks, @jinchengchenghh |
What changes are proposed in this pull request?
Add Spark 4.1 shim layer to support Spark 4.1.x in Gluten Velox backend.
shims/pom.xml
shims/spark41/pom.xml
shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider
shims/spark41/.../spark41/Spark41Shims.scala
shims/spark41/.../spark41/SparkShimProvider.scala
shims/spark41/.../v2/BatchScanExecShim.scala
gluten-substrait/.../ExpressionMappings.scala
shims/spark*/.../shims/spark*/Spark*Shims.scala
gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala
shims/spark*/.../parquet/ParquetFooterReaderShim.scala
java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'Fixes #11346
How was this patch tested?
Pass GHA