Skip to content

Conversation

@baibaichen
Copy link
Contributor

@baibaichen baibaichen commented Dec 31, 2025

What changes are proposed in this pull request?

Add Spark 4.1 shim layer to support Spark 4.1.x in Gluten Velox backend.

Cause Type Category Description Affected Files
- Feat Feature Introduce Spark41Shims and update build configuration to support Spark 4.1. pom.xml
shims/pom.xml
shims/spark41/pom.xml
shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider
shims/spark41/.../spark41/Spark41Shims.scala
shims/spark41/.../spark41/SparkShimProvider.scala
#51477 Fix Compatibility Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility. gluten-core/.../caller/CallerInfo.scala
#50852 Fix Compatibility Add printOutputColumns parameter to generateTreeString methods shims/spark41/.../GenerateTreeStringShim.scala
#51775 Fix Compatibility Remove unused MDC import in FileSourceScanExecShim.scala shims/spark41/.../FileSourceScanExecShim.scala
#51979 Fix Compatibility Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec shims/spark41/.../v2/AbstractBatchScanExec.scala
shims/spark41/.../v2/BatchScanExecShim.scala
#51302 Fix Compatibility Remove TimeAdd from ExpressionConverter and ExpressionMappings for test gluten-substrait/.../ExpressionConverter.scala
gluten-substrait/.../ExpressionMappings.scala
#50598 Fix Compatibility Adapt to QueryExecution.createSparkPlan interface change gluten-substrait/.../GlutenImplicits.scala
shims/spark*/.../shims/spark*/Spark*Shims.scala
#52599 Fix Compatibility Adapt to DataSourceV2Relation interface change backends-velox/.../ArrowConvertorRule.scala
#52384 Fix Compatibility Using new interface of ParquetFooterReader backends-velox/.../ParquetMetadataUtils.scala
gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala
shims/spark*/.../parquet/ParquetFooterReaderShim.scala
#52509 Fix Build Update Scala version to 2.13.17 in pom.xml to fix java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()' pom.xml
- Fix Test Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability backends-velox/.../VeloxHashJoinSuite.scala
#50849 Fix Test Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1 backends-velox/.../MiscOperatorSuite.scala
#52723 Fix Compatibility Add GeographyVal and GeometryVal support in ColumnarArrayShim shims/spark41/.../vectorized/ColumnarArrayShim.java
#48470 4.1.0 Exclude Exclude split test in VeloxStringFunctionsSuite backends-velox/.../VeloxStringFunctionsSuite.scala
#51259 4.1.0 Exclude Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0, we need update ci python to 3.10 backends-velox/.../python/ArrowEvalPythonExecSuite.scala

Fixes #11346

How was this patch tested?

Pass GHA

@github-actions github-actions bot added CORE works for Gluten Core VELOX INFRA labels Dec 31, 2025
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

**/gluten-ut/**/hs_err_*.log
**/gluten-ut/**/core.*
spark-test-spark41:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhouyuan
I understand that we need to add this here. Spark 4.1 has a new option spark.sql.unionOutputPartitioning introduced in apache/spark#51623. Currently, it needs to be set to false for successful execution. I plan to submit a separate PR later to address this, which will make the review process more convenient."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dnf module -y install python39 && \
alternatives --set python3 /usr/bin/python3.9 && \
pip3 install setuptools==77.0.3 && \
pip3 install pyspark==3.5.5 cython && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pyspark version should be 4.1.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackylee-ch

Interesting, it was copied from Spark 4.0, cc @zhouyuan

However, starting with Spark 4.1(apache/spark#51259), the minimum supported Python version is 3.10. I'm not familiar with how to configure the Python environment, so I've excluded these two unit tests for now, see (2ef147c).

@github-actions
Copy link

github-actions bot commented Jan 4, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Jan 4, 2026

Run Gluten Clickhouse CI on x86

@baibaichen baibaichen marked this pull request as ready for review January 4, 2026 14:08
@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Run Gluten Clickhouse CI on x86

object ParquetFooterReaderShim {

/** @since Spark 4.1 */
def readFooter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentionally designed this way.
The SparkShims interface has been growing larger, with many unrelated interfaces deliberately combined together.
For example, the shims added in this change—QueryExecutionShim, DataSourceV2RelationShim, and ParquetFooterReaderShim—are conceptually very different and do not belong to the same abstraction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since most of the versions have same implementation, could we use like https://github.com/apache/incubator-gluten/blob/main/shims/common/src/main/scala/org/apache/gluten/sql/shims/SparkShims.scala#L355?

I’m still keeping ParquetFooterReaderShim. Those two readFooter methods are just static utility methods in Spark’s ParquetFooterReader, so moving them into SparkShim feels odd.

def createSparkPlan(
sparkSession: SparkSession,
planner: SparkPlanner,
plan: LogicalPlan): SparkPlan = QueryExecution.createSparkPlan(sparkSession, planner, plan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to SparkShim

import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression, SortOrder}
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReaderFactory, Scan}

class Spark35Scan extends DataSourceV2ScanExecBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used? I assume it is not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove them

@github-actions
Copy link

github-actions bot commented Jan 6, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Jan 6, 2026

Run Gluten Clickhouse CI on x86

import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.types.StructType

object InternalRowUtl {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file name and object name has a typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch—let’s fix it in another PR. Since it’s been in the repo for a long time, I’d rather keep this PR focused.

* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch—let’s fix it in another PR. Since it’s been in the repo for a long time, I’d rather keep this PR focused.

@github-actions
Copy link

github-actions bot commented Jan 6, 2026

Run Gluten Clickhouse CI on x86

| Cause | Type | Category | Description | Affected Files |
|-------|------|----------|-------------|----------------|
| - | Feat | Feature | Introduce Spark41Shims and update build configuration to support Spark 4.1. | pom.xml<br>shims/pom.xml<br>shims/spark41/pom.xml<br>shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider<br>shims/spark41/.../spark41/Spark41Shims.scala<br>shims/spark41/.../spark41/SparkShimProvider.scala |
| [#51477](apache/spark#51477) | Fix | Compatibility | Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility. | gluten-core/.../caller/CallerInfo.scala |
| [#50852](apache/spark#50852) | Fix | Compatibility | Add printOutputColumns parameter to generateTreeString methods | shims/spark41/.../GenerateTreeStringShim.scala |
| [#51775](apache/spark#51775) | Fix | Compatibility | Remove unused MDC import in FileSourceScanExecShim.scala | shims/spark41/.../FileSourceScanExecShim.scala |
| [#51979](apache/spark#51979) | Fix | Compatibility | Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec | shims/spark41/.../v2/AbstractBatchScanExec.scala<br>shims/spark41/.../v2/BatchScanExecShim.scala |
| [#51302](apache/spark#51302) | Fix | Compatibility | Remove TimeAdd from ExpressionConverter and ExpressionMappings for test | gluten-substrait/.../ExpressionConverter.scala<br>gluten-substrait/.../ExpressionMappings.scala |
| [#50598](apache/spark#50598) | Fix | Compatibility | Adapt to QueryExecution.createSparkPlan interface change | gluten-substrait/.../GlutenImplicits.scala<br>shims/spark\*/.../shims/spark\*/Spark*Shims.scala |
| [#52599](apache/spark#52599) | Fix | Compatibility | Adapt to DataSourceV2Relation interface change | backends-velox/.../ArrowConvertorRule.scala |
| [#52384](apache/spark#52384) | Fix | Compatibility | Using new interface of ParquetFooterReader | backends-velox/.../ParquetMetadataUtils.scala<br>gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala<br>shims/spark*/.../parquet/ParquetFooterReaderShim.scala |
| [#52509](apache/spark#52509) | Fix | Build | Update Scala version to 2.13.17 in pom.xml to fix `java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'` | pom.xml |
| - | Fix | Test | Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability | backends-velox/.../VeloxHashJoinSuite.scala |
| [#50849](apache/spark#50849) | Fix | Test | Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1 | backends-velox/.../MiscOperatorSuite.scala |
| [#52723](apache/spark#52723) | Fix | Compatibility | Add GeographyVal and GeometryVal support in ColumnarArrayShim | shims/spark41/.../vectorized/ColumnarArrayShim.java |
| [#48470](apache/spark#48470) | 4.1.0 | Exclude | Exclude split test in VeloxStringFunctionsSuite | backends-velox/.../VeloxStringFunctionsSuite.scala |
| [#51259](apache/spark#51259) | 4.1.0 | Exclude | Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0, we need update ci python to 3.10 | backends-velox/.../python/ArrowEvalPythonExecSuite.scala |
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Run Gluten Clickhouse CI on x86

@baibaichen baibaichen merged commit 9026029 into apache:main Jan 7, 2026
112 of 113 checks passed
@baibaichen
Copy link
Contributor Author

Thanks, @jinchengchenghh

@baibaichen baibaichen deleted the feature/41 branch January 7, 2026 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BUILD CORE works for Gluten Core INFRA VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Spark 4.1.x Shim Layer

4 participants