Spark 4.0 integration #12494

huaxingao · 2025-03-10T20:59:07Z

No description provided.

huaxingao · 2025-04-13T00:14:44Z

build.gradle

    mavenCentral()
    mavenLocal()
+    maven {
+      url "https://repository.apache.org/content/repositories/snapshots/"


Need this for now because this PR uses 4.0.1-SNAPSHOT.
Will remove this in the future

gradle/libs.versions.toml

huaxingao · 2025-04-13T00:16:55Z

gradle/libs.versions.toml

 aliyun-sdk-oss = "3.10.2"
 analyticsaccelerator = "1.0.0"
 antlr = "4.9.3"
+antlr413 = "4.13.1" # For Spark 4.0 support


to be consistent with antlr version in Spark 4.0

huaxingao · 2025-04-13T00:27:12Z

build.gradle

    testImplementation project(path: ':iceberg-api', configuration: 'testArtifacts')
    testImplementation project(path: ':iceberg-core', configuration: 'testArtifacts')
    testImplementation libs.awaitility
+    testImplementation 'com.zaxxer:HikariCP:5.1.0'


Spark 4.0 removed dependency on com.jolbox:bonecp. In TestHiveMetastore.java, I explicitly set the connection pool to conf.set("datanucleus.connectionPoolingType", "HikariCP");, otherwise, it will default to ``com.jolbox:bonecp` and we will get

Exception in thread "TThreadPoolServer WorkerProcess-5" java.lang.NoClassDefFoundError: com/jolbox/bonecp/BoneCPConfig at org.apache.hadoop.hive.metastore.txn.TxnHandler.setupJdbcConnectionPool(TxnHandler.java:3156) at org.apache.hadoop.hive.metastore.txn.TxnHandler.setConf(TxnHandler.java:260)

Since I explicitly set the connection pool to conf.set("datanucleus.connectionPoolingType", "HikariCP");, we need to add the test dependency.

DBCP(already pulled by Spark, see SPARK-48538) can be used to avoid pulling HikariCP dependency

somehow after I changed to DBCP, I started to have

> Task :iceberg-spark:iceberg-spark-runtime-4.0_2.13:integrationTest FAILED

It doesn't have any additional information.

If I run ./gradlew :iceberg-spark:iceberg-spark-runtime-4.0_2.13:integrationTest on my local, it always passes.

Now even after I switched back to HikariCP. Task :iceberg-spark:iceberg-spark-runtime-4.0_2.13:integrationTest still failed.

should be irrelevant, could you try uploading the reports for the failure test?

.github/workflows/spark-ci.yml

- uses: actions/upload-artifact@v4 if: failure() with: name: test logs path: | **/build/testlogs + **/build/reports

what's the root cause?

Caused by: org.junit.platform.commons.JUnitException: OutputDirectoryProvider not available; probably due to unaligned versions of the junit-platform-engine and junit-platform-launcher jars on the classpath/module path.

I guess junit-platform-engine has been upgraded in upstream. I rebased and the problem went away.

huaxingao · 2025-04-13T00:32:27Z

core/src/main/java/org/apache/iceberg/MetadataColumns.java

  public static final String SPEC_ID_COLUMN_DOC = "Spec ID used to track the file containing a row";
  public static final NestedField SPEC_ID =
-      NestedField.required(
+      NestedField.optional(


In Spark 4.0, the metadata columns are nullable, so need to change this field to optional.
apache/spark#50246 (comment)

huaxingao · 2025-04-13T00:34:19Z

hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveMetastore.java

    conf.set(
        HiveConf.ConfVars.HIVE_IN_TEST.varname, HiveConf.ConfVars.HIVE_IN_TEST.getDefaultValue());
+
+    conf.set("datanucleus.connectionPoolingType", "HikariCP");


Spark 4.0 removed dependency on com.jolbox:bonecp. If not set the cp type, it will default to com.jolbox:bonecp and we will get

Exception in thread "TThreadPoolServer WorkerProcess-5" java.lang.NoClassDefFoundError: com/jolbox/bonecp/BoneCPConfig at org.apache.hadoop.hive.metastore.txn.TxnHandler.setupJdbcConnectionPool(TxnHandler.java:3156) at org.apache.hadoop.hive.metastore.txn.TxnHandler.setConf(TxnHandler.java:260)

huaxingao · 2025-04-13T01:27:29Z

...v4.0/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckViews.scala

    val newCyclePath = cyclePath :+ currentViewIdent
    if (currentViewIdent == viewIdent) {
-      throw new AnalysisException(String.format("Recursive cycle in view detected: %s (cycle: %s)",
+      throw new IcebergAnalysisException(String.format("Recursive cycle in view detected: %s (cycle: %s)",


Spark 4.0 doesn't allow constructing an AnalysisException with error message only. We have to provide an error class. In order to get around this, I have added IcebergAnalysisException, which only needs an error message.

huaxingao · 2025-04-13T01:30:33Z

...park-extensions/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/IcebergCall.scala

+import org.apache.spark.sql.catalyst.util.truncatedString
+import org.apache.spark.sql.connector.iceberg.catalog.Procedure
+
+case class IcebergCall(procedure: Procedure, args: Seq[Expression]) extends LeafCommand {


Spark 4.0 has its own implementation of Call. In this Spark 4.0/Iceberg integration PR, I simply change Call to IcebergCall to avoid name collision. In the future, will need to migrate to Spark's native Call framework.

huaxingao · 2025-04-13T01:36:02Z

...0/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/ExtensionsTestBase.java

    TestBase.hiveConf = metastore.hiveConf();

-    TestBase.spark.close();
+    TestBase.spark.stop();


don't have this sql.SparkSession.close() any more so replace with stop

huaxingao · 2025-04-13T01:37:35Z

...rk-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestCallStatementParser.java

-            });
+  public void testDelegateUnsupportedProcedure() throws ParseException {
+    LogicalPlan plan = parser.parsePlan("CALL cat.d.t()");
+    assertThat(plan.toString().contains("CALL cat.d.t()"));


Since Spark 4.0 has Stored Procedure support, so this doesn't throw ParseException any more

Shouldn't we just remove this test then? I'm not sure what we are checking here any more?

...java/org/apache/iceberg/spark/extensions/TestSystemFunctionPushDownInRowLevelOperations.java

huaxingao · 2025-04-13T02:42:48Z

....0/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

    Column[] sortSpec = sortSpec(df, repartitionSpec, false);
    StructType schema = df.schema();
-    String[] identifierFields =
-        Arrays.stream(repartitionSpec).map(Column::toString).toArray(String[]::new);


I can't use Column::toString any more because it now returns a fully qualified column, and it causes fieldIndex(name: String) to fail later.

We actually have a String array of identifierFields in computeUpdateImages (only need to append MetadataColumns.CHANGE_ORDINAL.name() in the end). So Instead of calling

override def toString: String = node.sql

I think it's more efficient to pass in the String array of identifierFields

huaxingao · 2025-04-13T02:57:17Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

    SparkTable sparkTable = new SparkTable(table, snapshotId, false);
    DataSourceV2Relation relation = createRelation(sparkTable, ImmutableMap.of());
-    return Dataset.ofRows(spark, relation);
+    if (!(spark instanceof org.apache.spark.sql.classic.SparkSession)) {


Spark 4.0 has abstract class SparkSession and classic.SparkSession. We will always try to use the general one, and only cast to classic.SparkSession when the private APIs are needed.

The same rule applies to Dataset

I think I noted other places, but let's change all of these checks to a Precondition if we can

huaxingao · 2025-04-13T03:01:26Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java

    final Configuration conf = spark.sessionState().newHadoopConf();
+
    spark
-        .sqlContext()


The generic abstract class SparkSession doesn't have this API sqlContext any more

huaxingao · 2025-04-14T15:49:47Z

cc @RussellSpitzer @szehon-ho @amogh-jahagirdar @nastra
Could you please take a look at this PR when you have a moment? Thanks a lot!

.github/workflows/java-ci.yml

RussellSpitzer · 2025-04-16T13:54:54Z

settings.gradle

 }

+if (sparkVersions.contains("4.0")) {
+  include ":iceberg-spark:spark-4.0_2.13"


Does this theoretically create Spark 4.0 if Scala 2.12 is set without error? Feels like we should still use scalaVersion here but have an assert > 2.12 or something.

The problem is that Flink does work with scala 2.13. I use 2.13 to build Spark module only. I think we can't assert > 2.12 because we still need 2.12 to build Flink.

Doesn't that mean when Scala is set to 2.12 we just should not allow building Spark 4.0?

gradle/libs.versions.toml

RussellSpitzer · 2025-04-16T13:59:22Z

spark/v4.0/build.gradle

    testImplementation libs.parquet.hadoop
    testImplementation libs.awaitility
-    testImplementation "org.apache.datafusion:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.5.0"
+    testImplementation "org.apache.datafusion:comet-spark-spark3.5_2.13:0.5.0"


Why are we using the old version of comet here? Just because the new one isn't available?

Rignt, we don't have a Spark 4.0 Comet yet.

...nsions/src/main/scala/org/apache/spark/sql/catalyst/analysis/ProcedureArgumentCoercion.scala

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

amogh-jahagirdar · 2025-05-07T21:20:12Z

...sions/src/main/scala/org/apache/iceberg/spark/extensions/IcebergSparkSessionExtensions.scala

+    extensions.injectResolutionRule { spark => ResolveViews(spark) }
+    extensions.injectResolutionRule { _ => ProcedureArgumentCoercion }
+    extensions.injectCheckRule(_ => CheckViews)
+    extensions.injectResolutionRule { _ => RewriteUpdateTableForRowLineage}


Discussed offline with @huaxingao , since row lineage in 4.0 will have a very different implementation (no need for custom analyzer rules given teh changes done in apache/spark#49493), we will remove the changes from https://github.com/apache/iceberg/pull/12736/files that are included in this changes.

)" This reverts commit 829ae7a.

…pache#12736)"" This reverts commit 5170f8c.

… from Spark4.0

huaxingao · 2025-05-08T03:55:20Z

@amogh-jahagirdar I have removed the changes from https://github.com/apache/iceberg/pull/12736/files for Spark 4.0. I have checked all the recent commits after my yesterday's rebase, the only one that has Spark changes is my Comet bump version PR. I will manually port the changes into this PR. After all the checks pass, we can merge.

amogh-jahagirdar · 2025-05-08T05:15:57Z

Ok, thank you @huaxingao for your dilligence on keeping up with this PR, I'll go ahead and merge and we can iterate on any follow ups. When the official spark 4.0 release is performed, we would do the dependency changes then. Thank you @RussellSpitzer @pan3793 @szehon-ho for reviewing.

huaxingao · 2025-05-08T05:28:38Z

Thank you all! @amogh-jahagirdar @RussellSpitzer @pan3793 @szehon-ho @aihuaxu

pan3793 · 2025-05-08T06:01:20Z

Congrats on merging to trunk, but alright, the commit history is lost eventually

huaxingao · 2025-05-08T06:10:23Z

Sorry I forgot to squash the commits. I will fix this

huaxingao · 2025-05-08T06:24:26Z

I will revert the PR for now

This reverts commit ad7f5c4.

This reverts commit a5bcacd.

This reverts commit ad7f5c4.

github-actions bot added spark INFRA build labels Mar 10, 2025

github-actions bot added the core label Apr 2, 2025

huaxingao force-pushed the spark4.0 branch from e648978 to ed7efb8 Compare April 11, 2025 17:30

github-actions bot added the hive label Apr 12, 2025

huaxingao force-pushed the spark4.0 branch from 24aa416 to 28ea82d Compare April 12, 2025 21:32

huaxingao commented Apr 13, 2025

View reviewed changes

gradle/libs.versions.toml Outdated Show resolved Hide resolved

huaxingao commented Apr 13, 2025

View reviewed changes

...java/org/apache/iceberg/spark/extensions/TestSystemFunctionPushDownInRowLevelOperations.java Show resolved Hide resolved

huaxingao commented Apr 13, 2025

View reviewed changes

RussellSpitzer reviewed Apr 15, 2025

View reviewed changes

.github/workflows/java-ci.yml Show resolved Hide resolved

huaxingao closed this Apr 16, 2025

huaxingao reopened this Apr 16, 2025

RussellSpitzer reviewed Apr 16, 2025

View reviewed changes

gradle/libs.versions.toml Show resolved Hide resolved

RussellSpitzer reviewed Apr 16, 2025

View reviewed changes

...nsions/src/main/scala/org/apache/spark/sql/catalyst/analysis/ProcedureArgumentCoercion.scala Show resolved Hide resolved

RussellSpitzer reviewed Apr 16, 2025

View reviewed changes

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java Show resolved Hide resolved

rebase

3832ce9

huaxingao force-pushed the spark4.0 branch from 70bafcd to 3832ce9 Compare May 7, 2025 01:38

huaxingao added 2 commits May 7, 2025 10:38

fix wrong number of args

476b079

change to Preconditions.checkArgument

80044a5

amogh-jahagirdar reviewed May 7, 2025

View reviewed changes

huaxingao added 3 commits May 7, 2025 14:31

Revert "Spark 3.5: Update MERGE and UPDATE for row lineage (apache#12736

5170f8c

)" This reverts commit 829ae7a.

Revert "Revert "Spark 3.5: Update MERGE and UPDATE for row lineage (a…

75d2c14

…pache#12736)"" This reverts commit 5170f8c.

manually revert Update MERGE and UPDATE for row lineage (apache#12736)…

fb7fb9c

… from Spark4.0

huaxingao closed this May 8, 2025

huaxingao reopened this May 8, 2025

fix test failure

9c63a4c

bump Comet version to 0.8.1

60568f1

amogh-jahagirdar merged commit ad7f5c4 into apache:main May 8, 2025
43 checks passed

huaxingao deleted the spark4.0 branch May 8, 2025 05:28

huaxingao restored the spark4.0 branch May 8, 2025 06:10

huaxingao added a commit that referenced this pull request May 8, 2025

Revert "Spark: Spark 4.0 initial support (#12494)"

21d1459

This reverts commit ad7f5c4.

huaxingao mentioned this pull request May 8, 2025

Revert "Spark 4.0 integration" #13006

Merged

huaxingao added a commit to huaxingao/iceberg that referenced this pull request May 8, 2025

Revert "Spark: Spark 4.0 initial support (apache#12494)"

14334dc

This reverts commit ad7f5c4.

huaxingao added a commit that referenced this pull request May 8, 2025

Revert "Spark: Spark 4.0 initial support (#12494)"

11e1873

This reverts commit ad7f5c4.

amogh-jahagirdar pushed a commit that referenced this pull request May 9, 2025

Revert "Spark: Spark 4.0 initial support (#12494)"

a5bcacd

This reverts commit ad7f5c4.

aihuaxu added a commit to aihuaxu/iceberg that referenced this pull request May 10, 2025

Revert "Revert "Spark: Spark 4.0 initial support (apache#12494)""

7e35506

This reverts commit a5bcacd.

huaxingao mentioned this pull request May 10, 2025

Spark4.0 rc5 integration with Iceberg #13022

Closed

devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025

Spark: Spark 4.0 initial support (apache#12494)

2b1a132

devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025

Revert "Spark: Spark 4.0 initial support (apache#12494)"

71fbe8c

This reverts commit ad7f5c4.

Spark 4.0 integration #12494

Spark 4.0 integration #12494

Uh oh!

Conversation

huaxingao commented Mar 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

huaxingao Apr 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Apr 14, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amogh-jahagirdar May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amogh-jahagirdar commented May 8, 2025

Uh oh!

Uh oh!

huaxingao commented May 8, 2025

Uh oh!

pan3793 commented May 8, 2025

Uh oh!

huaxingao Apr 13, 2025 •

edited

Loading

RussellSpitzer May 6, 2025 •

edited

Loading

amogh-jahagirdar May 7, 2025 •

edited

Loading

huaxingao commented May 8, 2025 •

edited

Loading