Spark4.0 rc5 integration with Iceberg #13022

huaxingao · 2025-05-10T02:21:11Z

This PR is the same as #12494, except:

The old PR uses Spark 4.0 RC4, while this PR uses Spark 4.0 RC5.
The old PR includes a Comet version bump, whereas this PR does not, since the Comet version bump is already included. (That’s why the old PR has 584 files changed, while this one has 580 files changed.)

huaxingao · 2025-05-10T16:22:07Z

cc @amogh-jahagirdar @wypoon @pan3793 @manuzhang

huaxingao · 2025-05-10T16:30:05Z

@tomtongue Sorry, your changes in https://github.com/apache/iceberg/pull/13007/files were removed when I reverted my Spark 4.0 integration PR. I didn't include your changes in this PR. After this PR is merged, could you please submit a new PR to redo the Spark 4.0 changes so that you will receive the commit credit? Thanks!

tomtongue · 2025-05-10T17:13:15Z

@tomtongue Sorry, your changes in https://github.com/apache/iceberg/pull/13007/files were removed when I reverted my Spark 4.0 integration PR. I didn't include your changes in this PR. After this PR is merged, could you please submit a new PR to redo the Spark 4.0 changes so that you will receive the commit credit? Thanks!

@huaxingao Sure no problem! Thanks for letting me know. I'm working on refactoring more tests for Spark 3.5 and Spark 4.0 in addition to those changes, so I will create a new PR including the changes after it's merged.

amogh-jahagirdar · 2025-05-10T23:31:08Z

Thanks for understanding and helping out here @tomtongue!

amogh-jahagirdar · 2025-05-10T23:46:16Z

.github/workflows/jmh-benchmarks.yml

+        description: 'The spark project version to use, such as iceberg-spark-4.0'
+        default: 'iceberg-spark-4.0'


I think it would be better if we hold off on updating this to use 4.0 until the release is official? We can update it alongside the default spark version update when the Spark 4.0 is actually released. Until then I think it makes sense for JMH benchmarks to just run against 3.5

amogh-jahagirdar · 2025-05-10T23:46:50Z

.github/workflows/recurring-jmh-benchmarks.yml

                    "IcebergSourceParquetEqDeleteBenchmark", "IcebergSourceParquetMultiDeleteFileBenchmark",
                    "IcebergSourceParquetPosDeleteBenchmark", "IcebergSourceParquetWithUnrelatedDeleteBenchmark"]
-        spark_version: ['iceberg-spark-3.5']
+        spark_version: ['iceberg-spark-4.0']


Same as above, I think we should leave this at 3.5 until the 4.0 is official (and upgrade this alongside default spark version)

Changed. Thanks!

ajantha-bhat · 2025-05-11T07:33:46Z

build.gradle

    mavenLocal()
+    maven {
+      url "https://repository.apache.org/content/repositories/orgapachespark-1481/"
+    }


are we planning to merge this PR with RC5?
(because I saw that we recently merged and reverted #13006)

I think it is not a good practice to depend on RC on main branch. Why don't we continue the development in the separate branch till the official release available?

I think it is not a good practice to depend on RC on main branch. Why don't we continue the development in the separate branch till the official release available?

@ajantha-bhat It's true that we'd have a dependency on an RC on main but there's a benefit to being able to develop on top of it for any new integrations (e.g. for V3 for example) while the RC is still going because we don't expect those dependent features to fundamentally change between RC and release. Keep in mind the defaultSparkVersion and any infra like benchmarking will still default to 3.5 until the official release.

The main challenge is that continuing development in the separate branch means that an individual needs to keep rebasing and evaluating if a change needs to be kept in sync with 4.0 with any intermediate changes to 3.4/3.5. Merging the initial means that every subsequent change to Iceberg-Spark puts it on the author of that change to keep it in sync, which is more narrow and that author will have a lot more context.

Combining that with the previous point about new integrations simply means we can safely and reasonably iterate on 4.0 integration until the official release rather than an individual just wait for all of that while rebasing/keeping in sync. That feels worthwhile to me compared to the awkwardness of having a RC dependency in main in the short term

I see. Thanks for the explanation.

I am neutral about this. If community agrees to depend on RC, it works for me. Just make sure, it has more visibility.
Tagging @RussellSpitzer, @szehon-ho, @rdblue, @danielcweeks for more visibility/approvals.

Sounds good, I went ahead and added a few folks for reviews. Thanks!

Yes, as RC5 is getting closer to Spark 4.0 release (only blockers are getting in now), I think it makes sense to work on Iceberg in parallel as we dont anticipate any significant change bumping to the final RC

amogh-jahagirdar

Everything looks good to me, thanks @huaxingao! I'll hold for @ajantha-bhat and any other reviewers for a bit

nastra · 2025-05-12T15:57:50Z

.github/workflows/java-ci.yml

    strategy:
      matrix:
-        jvm: [11, 17, 21]
+        jvm: [17, 21]


what's the reason for removing JDK 11 from this? we're releasing JDK 11 jars so I think we should keep testing with JDK 11

https://github.com/apache/iceberg/pull/12494/files#r2045335206

core/src/main/java/org/apache/iceberg/MetadataColumns.java

pan3793

Good job! Just a nit: better to rename the commit message from

[Move 3.5 as 4.0]
[Copy back 4.0 as 3.5]
[initial support for Spark 4.0]

to

Spark: Move 3.5 as 4.0
Spark: Copy back 4.0 as 3.5
Spark: Initial support for Spark 4.0

wypoon · 2025-05-14T03:44:54Z

Hi @huaxingao, I'm looking at the 3 commits in this PR, and looking at the 3rd one, Spark: initial support for Spark 4.0. I'm puzzled as to why there are changes in spark/v3.5 in this commit (they seem to be test changes). Shouldn't the changes (aside from workflow and build-related ones) be in spark/v4.0 only? From what I can see, you already have all the latest changes in spark/v3.5 before the 1st commit, so moving that, and copying it back should result in the current state of spark/v3.5 and thus that should need no further changes.

wypoon · 2025-05-14T04:48:55Z

Compared to #12494, it seems the main changes in spark/v4.0 (excluding test changes that appear to be due to test changes in spark/v3.5 since the previous PR) are

diff -r ../../apache3/iceberg/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
149,150c149
<     Configuration conf =
<         SparkUtil.hadoopConfCatalogOverrides(SparkSession.getActiveSession().get(), name);
---
>     Configuration conf = SparkUtil.hadoopConfCatalogOverrides(SparkSession.active(), name);
153,157c152,153
<     optionsMap.put(
<         CatalogProperties.APP_ID,
<         SparkSession.getActiveSession().get().sparkContext().applicationId());
<     optionsMap.put(
<         CatalogProperties.USER, SparkSession.getActiveSession().get().sparkContext().sparkUser());
---
>     optionsMap.put(CatalogProperties.APP_ID, SparkSession.active().sparkContext().applicationId());
>     optionsMap.put(CatalogProperties.USER, SparkSession.active().sparkContext().sparkUser());

Making the SPEC_ID metadata column nullable in Spark instead of in core (MetadataColumns):

diff -r ../../apache3/iceberg/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkChangelogTable.java spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkChangelogTable.java
113c113
<       new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, false),
---
>       new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, true),
diff -r ../../apache3/iceberg/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWriteBuilder.java spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWriteBuilder.java
20a21,23
> import static org.apache.iceberg.MetadataColumns.SPEC_ID_COLUMN_DOC;
> import static org.apache.iceberg.MetadataColumns.SPEC_ID_COLUMN_ID;
> 
29a33
> import org.apache.iceberg.types.Types;
105c109,110
<             MetadataColumns.SPEC_ID,
---
>             Types.NestedField.optional(
>                 SPEC_ID_COLUMN_ID, "_spec_id", Types.IntegerType.get(), SPEC_ID_COLUMN_DOC),
diff -r ../../apache3/iceberg/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java
263c263
<         new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, false),
---
>         new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, true),

and

diff -r ../../apache3/iceberg/spark/v4.0/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala spark/v4.0/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala
31d30
< import org.apache.iceberg.spark.Spark3Util
33d31
< import org.apache.iceberg.spark.source.SparkTable
38d35
< import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
40d36
< import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
47,48d42
< import org.apache.spark.sql.connector.catalog.Table
< import org.apache.spark.sql.connector.catalog.TableCatalog
54d47
< import scala.util.Try
117a111,112
> 
>   override def parseRoutineParam(sqlText: String): StructType = throw new UnsupportedOperationException()

I'm not sure why parseRoutineParam is overridden now and wasn't before.
It looks fine to me, except I'm not sure about the last one.

huaxingao · 2025-05-14T05:27:48Z

@wypoon Thanks for checking. There shouldn't be any Spark 3.5 changes in the 3rd commit. I will take a look.

huaxingao · 2025-05-14T06:22:32Z

@wypoon Thanks a lot for verifying this!
If you look at the overall changes, there are 577 files modified, all in Spark v4.0. There are no changes to Spark v3.5.

However, if you look at the third commit specifically, there are some changes to Spark v3.5 tests. The reason is that after I submitted my PR, there were updates to the Spark v3.5 tests (e.g., ba75a11, 5d2230e, etc.), so I needed to sync Spark v3.5 to the latest. That’s why there are some changes to Spark v3.5 in the third commit.

I think as long as in the overall changes, there is no changes in Spark 3.5. we should be fine.

huaxingao · 2025-05-14T06:33:05Z

@wypoon I checked the three changes you pointed out in Spark 4.0, all of these are intended changes:

The change in SparkCatalog.java is intended. I previous changed SparkSession.active() to SparkSession.getActiveSession().get(), this is unnecessary, so I changed it back.
Making the SPEC_ID metadata column nullable in Spark instead of in core.
def parseRoutineParam(sqlText: String): StructType was added to Spark ParserInterface.scala in Spark4.0 RC5, it was not in RC4, that's why in the previous PR, it was not overridden.

huaxingao · 2025-05-14T06:35:31Z

cc @amogh-jahagirdar Spark RC6 works OK. All checks have passed.

pan3793 · 2025-05-14T12:05:41Z

@wypoon 's comment is correct, @huaxingao though you make 3 commits, but changes in each commit does not match the message

huaxingao · 2025-05-14T13:52:40Z

@amogh-jahagirdar @pan3793 @wypoon

... though you make 3 commits, but changes in each commit does not match the message

Then I probably need to create a new PR. While I work on the new PR, we should avoid committing any other changes to Spark 3.5.

amogh-jahagirdar · 2025-05-14T14:24:24Z

@pan3793 @wypoon I'm not very concerned about including the 3.5 changes in the third commit as part of the most recent rebase. Those are just syncing the changes made in main, when these are rebased and merged, when someone looks at the diff-tree off that commit the 3.5 changes won't even be visible since they already exist in main. e.g. after a local rebase of these changes onto my local up-to-date master you can see the diff-tree of the 4.0 commit doesn't even include the 3.5 changes. The most important aspect is the outcome when the commits are rebased + merged on main and making sure history for the older spark versions are preserved. The intermediate changes to 3.5 were also a handful of test changes

huaxingao · 2025-05-14T15:33:14Z

I need to rebase again to pick up the changes in d44cba5

pan3793 · 2025-05-14T17:01:49Z

@amogh-jahagirdar the benefit is: we are going to delete the old Spark folder gradually, but we want to keep the whole commit history of each file.

take spark/v3.5/foo.scala as an example, after several years, iceberg would drop support for Spark 3.5, and the foo.scala may live at spark/v4.3/foo.scala, we still want to know who committed each line of the foo.scala and for what reason, this is important for developers to understand the code and the balance/debate of each changes

amogh-jahagirdar · 2025-05-14T17:09:25Z

@pan3793 I understand that but in this case the intermediate changes were a few junit test class upgrades so losing that history for those specific files didn't seem like a big deal to me.

Anyways, I discussed with @huaxingao who said she'd put up another PR with cleaner history (it seems like it'd be difficult to cleanly arrange the first copy commit at this point). In the mean time, we'll hold off on merging any Spark 3.5 changes to make this integration easier.

pan3793 · 2025-05-14T17:17:51Z

... in this case the intermediate changes were a few junit test class upgrades so losing that history for those specific files didn't seem like a big deal to me.

Okay, also fine to me

huaxingao · 2025-05-14T19:00:21Z

I created a new PR so we can have a cleaner history. Closing this PR for now.

wypoon · 2025-05-14T22:27:00Z

@wypoon I checked the three changes you pointed out in Spark 4.0, all of these are intended changes:

The change in SparkCatalog.java is intended. I previous changed SparkSession.active() to SparkSession.getActiveSession().get(), this is unnecessary, so I changed it back.

Making the SPEC_ID metadata column nullable in Spark instead of in core.

def parseRoutineParam(sqlText: String): StructType was added to Spark ParserInterface.scala in Spark4.0 RC5, it was not in RC4, that's why in the previous PR, it was not overridden.

@huaxingao thanks for the explanation. I was only unclear about 3.

github-actions bot added spark core INFRA build hive labels May 10, 2025

huaxingao force-pushed the spark4.0_rc5 branch 2 times, most recently from 237bc04 to 4f6fc5d Compare May 10, 2025 06:04

amogh-jahagirdar reviewed May 10, 2025

View reviewed changes

huaxingao force-pushed the spark4.0_rc5 branch from 4f6fc5d to b3cbc90 Compare May 11, 2025 00:13

ajantha-bhat reviewed May 11, 2025

View reviewed changes

amogh-jahagirdar approved these changes May 12, 2025

View reviewed changes

amogh-jahagirdar requested review from RussellSpitzer, aokolnychyi, danielcweeks and szehon-ho May 12, 2025 14:44

nastra reviewed May 12, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/MetadataColumns.java Outdated Show resolved Hide resolved

RussellSpitzer approved these changes May 12, 2025

View reviewed changes

szehon-ho approved these changes May 12, 2025

View reviewed changes

huaxingao force-pushed the spark4.0_rc5 branch from b3cbc90 to d369517 Compare May 12, 2025 22:05

pan3793 approved these changes May 13, 2025

View reviewed changes

huaxingao force-pushed the spark4.0_rc5 branch from d369517 to ef05538 Compare May 13, 2025 02:54

huaxingao added 2 commits May 13, 2025 15:51

Spark: Move 3.5 as 4.0

429c926

Spark: Copy back 4.0 as 3.5

bc2940f

huaxingao force-pushed the spark4.0_rc5 branch 2 times, most recently from 5630c27 to 7eee3dd Compare May 13, 2025 23:26

huaxingao closed this May 14, 2025

huaxingao reopened this May 14, 2025

Spark: initial support for Spark 4.0

512f452

huaxingao force-pushed the spark4.0_rc5 branch from 7eee3dd to 512f452 Compare May 14, 2025 03:29

huaxingao closed this May 14, 2025

huaxingao reopened this May 14, 2025

huaxingao closed this May 14, 2025

huaxingao deleted the spark4.0_rc5 branch May 14, 2025 19:00

amogh-jahagirdar mentioned this pull request May 14, 2025

Spark4.0 rc6 integration with Iceberg #13059

Merged

tomtongue mentioned this pull request May 19, 2025

Spark 4.0: Reapply changes removed during the Spark 4.0 reversion #13096

Merged

		description: 'The spark project version to use, such as iceberg-spark-4.0'
		default: 'iceberg-spark-4.0'

Spark4.0 rc5 integration with Iceberg #13022

Spark4.0 rc5 integration with Iceberg #13022

Uh oh!

Conversation

huaxingao commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huaxingao commented May 10, 2025

Uh oh!

huaxingao commented May 10, 2025

Uh oh!

tomtongue commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amogh-jahagirdar commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pan3793 left a comment

Choose a reason for hiding this comment

Uh oh!

wypoon commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wypoon commented May 14, 2025

Uh oh!

huaxingao commented May 14, 2025

Uh oh!

huaxingao commented May 14, 2025

Uh oh!

huaxingao commented May 14, 2025

Uh oh!

huaxingao commented May 14, 2025

Uh oh!

pan3793 commented May 14, 2025

Uh oh!

huaxingao commented May 14, 2025

Uh oh!

amogh-jahagirdar commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huaxingao commented May 14, 2025

Uh oh!

pan3793 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amogh-jahagirdar commented May 14, 2025

Uh oh!

pan3793 commented May 14, 2025

Uh oh!

huaxingao commented May 14, 2025

Uh oh!

wypoon commented May 14, 2025

huaxingao commented May 10, 2025 •

edited

Loading

tomtongue commented May 10, 2025 •

edited

Loading

amogh-jahagirdar commented May 10, 2025 •

edited

Loading

amogh-jahagirdar May 12, 2025 •

edited

Loading

wypoon commented May 14, 2025 •

edited

Loading

amogh-jahagirdar commented May 14, 2025 •

edited

Loading

pan3793 commented May 14, 2025 •

edited

Loading