Skip to content

Conversation

@huaxingao
Copy link
Contributor

@huaxingao huaxingao commented May 10, 2025

This PR is the same as #12494, except:

  • The old PR uses Spark 4.0 RC4, while this PR uses Spark 4.0 RC5.
  • The old PR includes a Comet version bump, whereas this PR does not, since the Comet version bump is already included. (That’s why the old PR has 584 files changed, while this one has 580 files changed.)

@huaxingao
Copy link
Contributor Author

@huaxingao
Copy link
Contributor Author

@tomtongue Sorry, your changes in https://github.com/apache/iceberg/pull/13007/files were removed when I reverted my Spark 4.0 integration PR. I didn't include your changes in this PR. After this PR is merged, could you please submit a new PR to redo the Spark 4.0 changes so that you will receive the commit credit? Thanks!

@tomtongue
Copy link
Contributor

tomtongue commented May 10, 2025

@tomtongue Sorry, your changes in https://github.com/apache/iceberg/pull/13007/files were removed when I reverted my Spark 4.0 integration PR. I didn't include your changes in this PR. After this PR is merged, could you please submit a new PR to redo the Spark 4.0 changes so that you will receive the commit credit? Thanks!

@huaxingao Sure no problem! Thanks for letting me know. I'm working on refactoring more tests for Spark 3.5 and Spark 4.0 in addition to those changes, so I will create a new PR including the changes after it's merged.

@amogh-jahagirdar
Copy link
Contributor

amogh-jahagirdar commented May 10, 2025

Thanks for understanding and helping out here @tomtongue!

Comment on lines 31 to 32
description: 'The spark project version to use, such as iceberg-spark-4.0'
default: 'iceberg-spark-4.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if we hold off on updating this to use 4.0 until the release is official? We can update it alongside the default spark version update when the Spark 4.0 is actually released. Until then I think it makes sense for JMH benchmarks to just run against 3.5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

"IcebergSourceParquetEqDeleteBenchmark", "IcebergSourceParquetMultiDeleteFileBenchmark",
"IcebergSourceParquetPosDeleteBenchmark", "IcebergSourceParquetWithUnrelatedDeleteBenchmark"]
spark_version: ['iceberg-spark-3.5']
spark_version: ['iceberg-spark-4.0']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I think we should leave this at 3.5 until the 4.0 is official (and upgrade this alongside default spark version)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Thanks!

mavenLocal()
maven {
url "https://repository.apache.org/content/repositories/orgapachespark-1481/"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we planning to merge this PR with RC5?
(because I saw that we recently merged and reverted #13006)

I think it is not a good practice to depend on RC on main branch. Why don't we continue the development in the separate branch till the official release available?

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not a good practice to depend on RC on main branch. Why don't we continue the development in the separate branch till the official release available?

@ajantha-bhat It's true that we'd have a dependency on an RC on main but there's a benefit to being able to develop on top of it for any new integrations (e.g. for V3 for example) while the RC is still going because we don't expect those dependent features to fundamentally change between RC and release. Keep in mind the defaultSparkVersion and any infra like benchmarking will still default to 3.5 until the official release.

The main challenge is that continuing development in the separate branch means that an individual needs to keep rebasing and evaluating if a change needs to be kept in sync with 4.0 with any intermediate changes to 3.4/3.5. Merging the initial means that every subsequent change to Iceberg-Spark puts it on the author of that change to keep it in sync, which is more narrow and that author will have a lot more context.

Combining that with the previous point about new integrations simply means we can safely and reasonably iterate on 4.0 integration until the official release rather than an individual just wait for all of that while rebasing/keeping in sync. That feels worthwhile to me compared to the awkwardness of having a RC dependency in main in the short term

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for the explanation.

I am neutral about this. If community agrees to depend on RC, it works for me. Just make sure, it has more visibility.
Tagging @RussellSpitzer, @szehon-ho, @rdblue, @danielcweeks for more visibility/approvals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I went ahead and added a few folks for reviews. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as RC5 is getting closer to Spark 4.0 release (only blockers are getting in now), I think it makes sense to work on Iceberg in parallel as we dont anticipate any significant change bumping to the final RC

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me, thanks @huaxingao! I'll hold for @ajantha-bhat and any other reviewers for a bit

strategy:
matrix:
jvm: [11, 17, 21]
jvm: [17, 21]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for removing JDK 11 from this? we're releasing JDK 11 jars so I think we should keep testing with JDK 11

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! Just a nit: better to rename the commit message from

  • [Move 3.5 as 4.0]
  • [Copy back 4.0 as 3.5]
  • [initial support for Spark 4.0]

to

  • Spark: Move 3.5 as 4.0
  • Spark: Copy back 4.0 as 3.5
  • Spark: Initial support for Spark 4.0

@huaxingao huaxingao force-pushed the spark4.0_rc5 branch 2 times, most recently from 5630c27 to 7eee3dd Compare May 13, 2025 23:26
@huaxingao huaxingao closed this May 14, 2025
@huaxingao huaxingao reopened this May 14, 2025
@wypoon
Copy link
Contributor

wypoon commented May 14, 2025

Hi @huaxingao, I'm looking at the 3 commits in this PR, and looking at the 3rd one, Spark: initial support for Spark 4.0. I'm puzzled as to why there are changes in spark/v3.5 in this commit (they seem to be test changes). Shouldn't the changes (aside from workflow and build-related ones) be in spark/v4.0 only? From what I can see, you already have all the latest changes in spark/v3.5 before the 1st commit, so moving that, and copying it back should result in the current state of spark/v3.5 and thus that should need no further changes.

@wypoon
Copy link
Contributor

wypoon commented May 14, 2025

Compared to #12494, it seems the main changes in spark/v4.0 (excluding test changes that appear to be due to test changes in spark/v3.5 since the previous PR) are

diff -r ../../apache3/iceberg/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
149,150c149
<     Configuration conf =
<         SparkUtil.hadoopConfCatalogOverrides(SparkSession.getActiveSession().get(), name);
---
>     Configuration conf = SparkUtil.hadoopConfCatalogOverrides(SparkSession.active(), name);
153,157c152,153
<     optionsMap.put(
<         CatalogProperties.APP_ID,
<         SparkSession.getActiveSession().get().sparkContext().applicationId());
<     optionsMap.put(
<         CatalogProperties.USER, SparkSession.getActiveSession().get().sparkContext().sparkUser());
---
>     optionsMap.put(CatalogProperties.APP_ID, SparkSession.active().sparkContext().applicationId());
>     optionsMap.put(CatalogProperties.USER, SparkSession.active().sparkContext().sparkUser());

Making the SPEC_ID metadata column nullable in Spark instead of in core (MetadataColumns):

diff -r ../../apache3/iceberg/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkChangelogTable.java spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkChangelogTable.java
113c113
<       new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, false),
---
>       new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, true),
diff -r ../../apache3/iceberg/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWriteBuilder.java spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWriteBuilder.java
20a21,23
> import static org.apache.iceberg.MetadataColumns.SPEC_ID_COLUMN_DOC;
> import static org.apache.iceberg.MetadataColumns.SPEC_ID_COLUMN_ID;
> 
29a33
> import org.apache.iceberg.types.Types;
105c109,110
<             MetadataColumns.SPEC_ID,
---
>             Types.NestedField.optional(
>                 SPEC_ID_COLUMN_ID, "_spec_id", Types.IntegerType.get(), SPEC_ID_COLUMN_DOC),
diff -r ../../apache3/iceberg/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java
263c263
<         new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, false),
---
>         new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, true),

and

diff -r ../../apache3/iceberg/spark/v4.0/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala spark/v4.0/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala
31d30
< import org.apache.iceberg.spark.Spark3Util
33d31
< import org.apache.iceberg.spark.source.SparkTable
38d35
< import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
40d36
< import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
47,48d42
< import org.apache.spark.sql.connector.catalog.Table
< import org.apache.spark.sql.connector.catalog.TableCatalog
54d47
< import scala.util.Try
117a111,112
> 
>   override def parseRoutineParam(sqlText: String): StructType = throw new UnsupportedOperationException()

I'm not sure why parseRoutineParam is overridden now and wasn't before.
It looks fine to me, except I'm not sure about the last one.

@huaxingao huaxingao closed this May 14, 2025
@huaxingao huaxingao reopened this May 14, 2025
@huaxingao
Copy link
Contributor Author

@wypoon Thanks for checking. There shouldn't be any Spark 3.5 changes in the 3rd commit. I will take a look.

@huaxingao
Copy link
Contributor Author

@wypoon Thanks a lot for verifying this!
If you look at the overall changes, there are 577 files modified, all in Spark v4.0. There are no changes to Spark v3.5.

However, if you look at the third commit specifically, there are some changes to Spark v3.5 tests. The reason is that after I submitted my PR, there were updates to the Spark v3.5 tests (e.g., ba75a11, 5d2230e, etc.), so I needed to sync Spark v3.5 to the latest. That’s why there are some changes to Spark v3.5 in the third commit.

I think as long as in the overall changes, there is no changes in Spark 3.5. we should be fine.

@huaxingao
Copy link
Contributor Author

@wypoon I checked the three changes you pointed out in Spark 4.0, all of these are intended changes:

  1. The change in SparkCatalog.java is intended. I previous changed SparkSession.active() to SparkSession.getActiveSession().get(), this is unnecessary, so I changed it back.
  2. Making the SPEC_ID metadata column nullable in Spark instead of in core.
  3. def parseRoutineParam(sqlText: String): StructType was added to Spark ParserInterface.scala in Spark4.0 RC5, it was not in RC4, that's why in the previous PR, it was not overridden.

@huaxingao
Copy link
Contributor Author

cc @amogh-jahagirdar Spark RC6 works OK. All checks have passed.

@pan3793
Copy link
Member

pan3793 commented May 14, 2025

@wypoon 's comment is correct, @huaxingao though you make 3 commits, but changes in each commit does not match the message

@huaxingao
Copy link
Contributor Author

@amogh-jahagirdar @pan3793 @wypoon

... though you make 3 commits, but changes in each commit does not match the message

Then I probably need to create a new PR. While I work on the new PR, we should avoid committing any other changes to Spark 3.5.

@amogh-jahagirdar
Copy link
Contributor

amogh-jahagirdar commented May 14, 2025

@pan3793 @wypoon I'm not very concerned about including the 3.5 changes in the third commit as part of the most recent rebase. Those are just syncing the changes made in main, when these are rebased and merged, when someone looks at the diff-tree off that commit the 3.5 changes won't even be visible since they already exist in main. e.g. after a local rebase of these changes onto my local up-to-date master you can see the diff-tree of the 4.0 commit doesn't even include the 3.5 changes. The most important aspect is the outcome when the commits are rebased + merged on main and making sure history for the older spark versions are preserved. The intermediate changes to 3.5 were also a handful of test changes

Screenshot 2025-05-14 at 8 22 36 AM

@huaxingao
Copy link
Contributor Author

I need to rebase again to pick up the changes in d44cba5

@pan3793
Copy link
Member

pan3793 commented May 14, 2025

@amogh-jahagirdar the benefit is: we are going to delete the old Spark folder gradually, but we want to keep the whole commit history of each file.

take spark/v3.5/foo.scala as an example, after several years, iceberg would drop support for Spark 3.5, and the foo.scala may live at spark/v4.3/foo.scala, we still want to know who committed each line of the foo.scala and for what reason, this is important for developers to understand the code and the balance/debate of each changes

@amogh-jahagirdar
Copy link
Contributor

@pan3793 I understand that but in this case the intermediate changes were a few junit test class upgrades so losing that history for those specific files didn't seem like a big deal to me.

Anyways, I discussed with @huaxingao who said she'd put up another PR with cleaner history (it seems like it'd be difficult to cleanly arrange the first copy commit at this point). In the mean time, we'll hold off on merging any Spark 3.5 changes to make this integration easier.

@pan3793
Copy link
Member

pan3793 commented May 14, 2025

... in this case the intermediate changes were a few junit test class upgrades so losing that history for those specific files didn't seem like a big deal to me.

Okay, also fine to me

@huaxingao
Copy link
Contributor Author

I created a new PR so we can have a cleaner history. Closing this PR for now.

@huaxingao huaxingao closed this May 14, 2025
@huaxingao huaxingao deleted the spark4.0_rc5 branch May 14, 2025 19:00
@wypoon
Copy link
Contributor

wypoon commented May 14, 2025

@wypoon I checked the three changes you pointed out in Spark 4.0, all of these are intended changes:

  1. The change in SparkCatalog.java is intended. I previous changed SparkSession.active() to SparkSession.getActiveSession().get(), this is unnecessary, so I changed it back.
  2. Making the SPEC_ID metadata column nullable in Spark instead of in core.
  3. def parseRoutineParam(sqlText: String): StructType was added to Spark ParserInterface.scala in Spark4.0 RC5, it was not in RC4, that's why in the previous PR, it was not overridden.

@huaxingao thanks for the explanation. I was only unclear about 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants