[GLUTEN-11346][CORE][VL] Add Spark 4.1 Shim Layer #11347

baibaichen · 2025-12-31T08:58:51Z

What changes are proposed in this pull request?

Add Spark 4.1 shim layer to support Spark 4.1.x in Gluten Velox backend.

Cause	Type	Category	Description	Affected Files
-	Feat	Feature	Introduce Spark41Shims and update build configuration to support Spark 4.1.	pom.xml shims/pom.xml shims/spark41/pom.xml shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider shims/spark41/.../spark41/Spark41Shims.scala shims/spark41/.../spark41/SparkShimProvider.scala
#51477	Fix	Compatibility	Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility.	gluten-core/.../caller/CallerInfo.scala
#50852	Fix	Compatibility	Add printOutputColumns parameter to generateTreeString methods	shims/spark41/.../GenerateTreeStringShim.scala
#51775	Fix	Compatibility	Remove unused MDC import in FileSourceScanExecShim.scala	shims/spark41/.../FileSourceScanExecShim.scala
#51979	Fix	Compatibility	Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec	shims/spark41/.../v2/AbstractBatchScanExec.scala shims/spark41/.../v2/BatchScanExecShim.scala
#51302	Fix	Compatibility	Remove TimeAdd from ExpressionConverter and ExpressionMappings for test	gluten-substrait/.../ExpressionConverter.scala gluten-substrait/.../ExpressionMappings.scala
#50598	Fix	Compatibility	Adapt to QueryExecution.createSparkPlan interface change	gluten-substrait/.../GlutenImplicits.scala shims/spark/.../shims/spark/Spark*Shims.scala
#52599	Fix	Compatibility	Adapt to DataSourceV2Relation interface change	backends-velox/.../ArrowConvertorRule.scala
#52384	Fix	Compatibility	Using new interface of ParquetFooterReader	backends-velox/.../ParquetMetadataUtils.scala gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala shims/spark*/.../parquet/ParquetFooterReaderShim.scala
#52509	Fix	Build	Update Scala version to 2.13.17 in pom.xml to fix `java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'`	pom.xml
-	Fix	Test	Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability	backends-velox/.../VeloxHashJoinSuite.scala
#50849	Fix	Test	Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1	backends-velox/.../MiscOperatorSuite.scala
#52723	Fix	Compatibility	Add GeographyVal and GeometryVal support in ColumnarArrayShim	shims/spark41/.../vectorized/ColumnarArrayShim.java
#48470	4.1.0	Exclude	Exclude split test in VeloxStringFunctionsSuite	backends-velox/.../VeloxStringFunctionsSuite.scala
#51259	4.1.0	Exclude	Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0， we need update ci python to 3.10	backends-velox/.../python/ArrowEvalPythonExecSuite.scala

Fixes #11346

How was this patch tested?

Pass GHA

github-actions · 2025-12-31T08:59:21Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-12-31T08:59:47Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-12-31T09:15:42Z

Run Gluten Clickhouse CI on x86

zhouyuan · 2026-01-02T08:27:06Z

.github/workflows/velox_backend_x86.yml

            **/gluten-ut/**/hs_err_*.log
            **/gluten-ut/**/core.*
+
+  spark-test-spark41:


should also add the tpc tests
https://github.com/apache/incubator-gluten/blob/main/.github/workflows/velox_backend_x86.yml#L104

Thanks @zhouyuan
I understand that we need to add this here. Spark 4.1 has a new option spark.sql.unionOutputPartitioning introduced in apache/spark#51623. Currently, it needs to be set to false for successful execution. I plan to submit a separate PR later to address this, which will make the review process more convenient."

should also add the tpc tests https://github.com/apache/incubator-gluten/blob/main/.github/workflows/velox_backend_x86.yml#L104

fix in #11353

jackylee-ch · 2026-01-04T01:42:13Z

.github/workflows/velox_backend_x86.yml

+          dnf module -y install python39 && \
+          alternatives --set python3 /usr/bin/python3.9 && \
+          pip3 install setuptools==77.0.3 && \
+          pip3 install pyspark==3.5.5 cython && \


The pyspark version should be 4.1.0

@jackylee-ch

Interesting, it was copied from Spark 4.0, cc @zhouyuan

However, starting with Spark 4.1(apache/spark#51259), the minimum supported Python version is 3.10. I'm not familiar with how to configure the Python environment, so I've excluded these two unit tests for now, see (2ef147c).

github-actions · 2026-01-04T05:38:00Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-01-04T07:25:47Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-01-05T10:27:05Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-01-05T15:12:19Z

Run Gluten Clickhouse CI on x86

jinchengchenghh · 2026-01-06T06:33:10Z

.../main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReaderShim.scala

+object ParquetFooterReaderShim {
+
+  /** @since Spark 4.1 */
+  def readFooter(


Since most of the versions have same implementation, could we use like https://github.com/apache/incubator-gluten/blob/main/shims/common/src/main/scala/org/apache/gluten/sql/shims/SparkShims.scala#L355?

This was intentionally designed this way.
The SparkShims interface has been growing larger, with many unrelated interfaces deliberately combined together.
For example, the shims added in this change—QueryExecutionShim, DataSourceV2RelationShim, and ParquetFooterReaderShim—are conceptually very different and do not belong to the same abstraction.

Since most of the versions have same implementation, could we use like https://github.com/apache/incubator-gluten/blob/main/shims/common/src/main/scala/org/apache/gluten/sql/shims/SparkShims.scala#L355?

I’m still keeping ParquetFooterReaderShim. Those two readFooter methods are just static utility methods in Spark’s ParquetFooterReader, so moving them into SparkShim feels odd.

jinchengchenghh · 2026-01-06T06:34:37Z

shims/spark33/src/main/scala/org/apache/spark/sql/execution/QueryExecutionShim.scala

+  def createSparkPlan(
+      sparkSession: SparkSession,
+      planner: SparkPlanner,
+      plan: LogicalPlan): SparkPlan = QueryExecution.createSparkPlan(sparkSession, planner, plan)


move to SparkShim

jinchengchenghh · 2026-01-06T06:37:36Z

shims/spark41/src/main/scala/org/apache/spark/sql/execution/datasources/v2/Spark35Scan.scala

+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression, SortOrder}
+import org.apache.spark.sql.connector.read.{InputPartition, PartitionReaderFactory, Scan}
+
+class Spark35Scan extends DataSourceV2ScanExecBase {


Is it used? I assume it is not used

remove them

github-actions · 2026-01-06T09:57:57Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-01-06T10:11:14Z

Run Gluten Clickhouse CI on x86

jinchengchenghh · 2026-01-06T10:15:33Z

shims/spark41/src/main/scala/org/apache/gluten/utils/InternalRowUtl.scala

+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.types.StructType
+
+object InternalRowUtl {


The file name and object name has a typo

Good catch—let’s fix it in another PR. Since it’s been in the repo for a long time, I’d rather keep this PR focused.

jinchengchenghh · 2026-01-06T10:19:07Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala.deprecated

+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at


Please remove this file

Good catch—let’s fix it in another PR. Since it’s been in the repo for a long time, I’d rather keep this PR focused.

github-actions · 2026-01-06T10:20:52Z

Run Gluten Clickhouse CI on x86

| Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | - | Feat | Feature | Introduce Spark41Shims and update build configuration to support Spark 4.1. | pom.xml shims/pom.xml shims/spark41/pom.xml shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider shims/spark41/.../spark41/Spark41Shims.scala shims/spark41/.../spark41/SparkShimProvider.scala | | [#51477](apache/spark#51477) | Fix | Compatibility | Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility. | gluten-core/.../caller/CallerInfo.scala | | [#50852](apache/spark#50852) | Fix | Compatibility | Add printOutputColumns parameter to generateTreeString methods | shims/spark41/.../GenerateTreeStringShim.scala | | [#51775](apache/spark#51775) | Fix | Compatibility | Remove unused MDC import in FileSourceScanExecShim.scala | shims/spark41/.../FileSourceScanExecShim.scala | | [#51979](apache/spark#51979) | Fix | Compatibility | Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec | shims/spark41/.../v2/AbstractBatchScanExec.scala shims/spark41/.../v2/BatchScanExecShim.scala | | [#51302](apache/spark#51302) | Fix | Compatibility | Remove TimeAdd from ExpressionConverter and ExpressionMappings for test | gluten-substrait/.../ExpressionConverter.scala gluten-substrait/.../ExpressionMappings.scala | | [#50598](apache/spark#50598) | Fix | Compatibility | Adapt to QueryExecution.createSparkPlan interface change | gluten-substrait/.../GlutenImplicits.scala shims/spark\*/.../shims/spark\*/Spark*Shims.scala | | [#52599](apache/spark#52599) | Fix | Compatibility | Adapt to DataSourceV2Relation interface change | backends-velox/.../ArrowConvertorRule.scala | | [#52384](apache/spark#52384) | Fix | Compatibility | Using new interface of ParquetFooterReader | backends-velox/.../ParquetMetadataUtils.scala gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala shims/spark*/.../parquet/ParquetFooterReaderShim.scala | | [#52509](apache/spark#52509) | Fix | Build | Update Scala version to 2.13.17 in pom.xml to fix `java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'` | pom.xml | | - | Fix | Test | Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability | backends-velox/.../VeloxHashJoinSuite.scala | | [#50849](apache/spark#50849) | Fix | Test | Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1 | backends-velox/.../MiscOperatorSuite.scala | | [#52723](apache/spark#52723) | Fix | Compatibility | Add GeographyVal and GeometryVal support in ColumnarArrayShim | shims/spark41/.../vectorized/ColumnarArrayShim.java | | [#48470](apache/spark#48470) | 4.1.0 | Exclude | Exclude split test in VeloxStringFunctionsSuite | backends-velox/.../VeloxStringFunctionsSuite.scala | | [#51259](apache/spark#51259) | 4.1.0 | Exclude | Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0， we need update ci python to 3.10 | backends-velox/.../python/ArrowEvalPythonExecSuite.scala |

github-actions · 2026-01-07T04:36:01Z

Run Gluten Clickhouse CI on x86

baibaichen · 2026-01-07T07:59:56Z

Thanks, @jinchengchenghh

github-actions bot added CORE works for Gluten Core VELOX INFRA labels Dec 31, 2025

baibaichen force-pushed the feature/41 branch from 4144702 to f31bc7a Compare December 31, 2025 08:59

baibaichen force-pushed the feature/41 branch from f31bc7a to 3e6f2f8 Compare December 31, 2025 09:15

zhouyuan reviewed Jan 2, 2026

View reviewed changes

jackylee-ch reviewed Jan 4, 2026

View reviewed changes

baibaichen force-pushed the feature/41 branch from 3e6f2f8 to acf027d Compare January 4, 2026 05:37

baibaichen force-pushed the feature/41 branch from acf027d to d7d78a4 Compare January 4, 2026 07:25

baibaichen marked this pull request as ready for review January 4, 2026 14:08

baibaichen force-pushed the feature/41 branch from d7d78a4 to 6d67065 Compare January 5, 2026 10:26

baibaichen force-pushed the feature/41 branch from 6d67065 to 0d7bfc8 Compare January 5, 2026 15:11

jinchengchenghh reviewed Jan 6, 2026

View reviewed changes

baibaichen force-pushed the feature/41 branch from 0d7bfc8 to a39003a Compare January 6, 2026 09:57

baibaichen force-pushed the feature/41 branch from a39003a to ee8214e Compare January 6, 2026 10:10

jinchengchenghh approved these changes Jan 6, 2026

View reviewed changes

baibaichen force-pushed the feature/41 branch from ee8214e to c4414b3 Compare January 6, 2026 10:20

github-actions bot added the BUILD label Jan 6, 2026

baibaichen added 2 commits January 7, 2026 12:35

Spark: Move 4.0 as 4.1

20fee5a

Spark: Copy back 4.1 as 4.0

322e82c

baibaichen force-pushed the feature/41 branch from c4414b3 to a8d0656 Compare January 7, 2026 04:35

baibaichen merged commit 9026029 into apache:main Jan 7, 2026
112 of 113 checks passed

baibaichen deleted the feature/41 branch January 7, 2026 08:00

zhouyuan mentioned this pull request Jan 8, 2026

[VL] Support JDK-21 for Spark-4.x #11316

Open

[GLUTEN-11346][CORE][VL] Add Spark 4.1 Shim Layer #11347

[GLUTEN-11346][CORE][VL] Add Spark 4.1 Shim Layer #11347

Uh oh!

Conversation

baibaichen commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this patch tested?

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 4, 2026

Uh oh!

github-actions bot commented Jan 4, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

Uh oh!

baibaichen commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

baibaichen commented Dec 31, 2025 •

edited

Loading