-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark4.0 rc5 integration with Iceberg #13022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
237bc04 to
4f6fc5d
Compare
|
@tomtongue Sorry, your changes in https://github.com/apache/iceberg/pull/13007/files were removed when I reverted my Spark 4.0 integration PR. I didn't include your changes in this PR. After this PR is merged, could you please submit a new PR to redo the Spark 4.0 changes so that you will receive the commit credit? Thanks! |
@huaxingao Sure no problem! Thanks for letting me know. I'm working on refactoring more tests for Spark 3.5 and Spark 4.0 in addition to those changes, so I will create a new PR including the changes after it's merged. |
|
Thanks for understanding and helping out here @tomtongue! |
.github/workflows/jmh-benchmarks.yml
Outdated
| description: 'The spark project version to use, such as iceberg-spark-4.0' | ||
| default: 'iceberg-spark-4.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if we hold off on updating this to use 4.0 until the release is official? We can update it alongside the default spark version update when the Spark 4.0 is actually released. Until then I think it makes sense for JMH benchmarks to just run against 3.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
| "IcebergSourceParquetEqDeleteBenchmark", "IcebergSourceParquetMultiDeleteFileBenchmark", | ||
| "IcebergSourceParquetPosDeleteBenchmark", "IcebergSourceParquetWithUnrelatedDeleteBenchmark"] | ||
| spark_version: ['iceberg-spark-3.5'] | ||
| spark_version: ['iceberg-spark-4.0'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, I think we should leave this at 3.5 until the 4.0 is official (and upgrade this alongside default spark version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. Thanks!
| mavenLocal() | ||
| maven { | ||
| url "https://repository.apache.org/content/repositories/orgapachespark-1481/" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we planning to merge this PR with RC5?
(because I saw that we recently merged and reverted #13006)
I think it is not a good practice to depend on RC on main branch. Why don't we continue the development in the separate branch till the official release available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not a good practice to depend on RC on main branch. Why don't we continue the development in the separate branch till the official release available?
@ajantha-bhat It's true that we'd have a dependency on an RC on main but there's a benefit to being able to develop on top of it for any new integrations (e.g. for V3 for example) while the RC is still going because we don't expect those dependent features to fundamentally change between RC and release. Keep in mind the defaultSparkVersion and any infra like benchmarking will still default to 3.5 until the official release.
The main challenge is that continuing development in the separate branch means that an individual needs to keep rebasing and evaluating if a change needs to be kept in sync with 4.0 with any intermediate changes to 3.4/3.5. Merging the initial means that every subsequent change to Iceberg-Spark puts it on the author of that change to keep it in sync, which is more narrow and that author will have a lot more context.
Combining that with the previous point about new integrations simply means we can safely and reasonably iterate on 4.0 integration until the official release rather than an individual just wait for all of that while rebasing/keeping in sync. That feels worthwhile to me compared to the awkwardness of having a RC dependency in main in the short term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for the explanation.
I am neutral about this. If community agrees to depend on RC, it works for me. Just make sure, it has more visibility.
Tagging @RussellSpitzer, @szehon-ho, @rdblue, @danielcweeks for more visibility/approvals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I went ahead and added a few folks for reviews. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as RC5 is getting closer to Spark 4.0 release (only blockers are getting in now), I think it makes sense to work on Iceberg in parallel as we dont anticipate any significant change bumping to the final RC
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good to me, thanks @huaxingao! I'll hold for @ajantha-bhat and any other reviewers for a bit
| strategy: | ||
| matrix: | ||
| jvm: [11, 17, 21] | ||
| jvm: [17, 21] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason for removing JDK 11 from this? we're releasing JDK 11 jars so I think we should keep testing with JDK 11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pan3793
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! Just a nit: better to rename the commit message from
- [Move 3.5 as 4.0]
- [Copy back 4.0 as 3.5]
- [initial support for Spark 4.0]
to
- Spark: Move 3.5 as 4.0
- Spark: Copy back 4.0 as 3.5
- Spark: Initial support for Spark 4.0
5630c27 to
7eee3dd
Compare
|
Hi @huaxingao, I'm looking at the 3 commits in this PR, and looking at the 3rd one, Spark: initial support for Spark 4.0. I'm puzzled as to why there are changes in spark/v3.5 in this commit (they seem to be test changes). Shouldn't the changes (aside from workflow and build-related ones) be in spark/v4.0 only? From what I can see, you already have all the latest changes in spark/v3.5 before the 1st commit, so moving that, and copying it back should result in the current state of spark/v3.5 and thus that should need no further changes. |
|
Compared to #12494, it seems the main changes in spark/v4.0 (excluding test changes that appear to be due to test changes in spark/v3.5 since the previous PR) are Making the SPEC_ID metadata column nullable in Spark instead of in core ( and I'm not sure why |
|
@wypoon Thanks for checking. There shouldn't be any Spark 3.5 changes in the 3rd commit. I will take a look. |
|
@wypoon Thanks a lot for verifying this! However, if you look at the third commit specifically, there are some changes to Spark v3.5 tests. The reason is that after I submitted my PR, there were updates to the Spark v3.5 tests (e.g., ba75a11, 5d2230e, etc.), so I needed to sync Spark v3.5 to the latest. That’s why there are some changes to Spark v3.5 in the third commit. I think as long as in the overall changes, there is no changes in Spark 3.5. we should be fine. |
|
@wypoon I checked the three changes you pointed out in Spark 4.0, all of these are intended changes:
|
|
cc @amogh-jahagirdar Spark RC6 works OK. All checks have passed. |
|
@wypoon 's comment is correct, @huaxingao though you make 3 commits, but changes in each commit does not match the message |
|
@amogh-jahagirdar @pan3793 @wypoon
Then I probably need to create a new PR. While I work on the new PR, we should avoid committing any other changes to Spark 3.5. |
|
@pan3793 @wypoon I'm not very concerned about including the 3.5 changes in the third commit as part of the most recent rebase. Those are just syncing the changes made in main, when these are rebased and merged, when someone looks at the diff-tree off that commit the 3.5 changes won't even be visible since they already exist in main. e.g. after a local rebase of these changes onto my local up-to-date master you can see the diff-tree of the 4.0 commit doesn't even include the 3.5 changes. The most important aspect is the outcome when the commits are rebased + merged on main and making sure history for the older spark versions are preserved. The intermediate changes to 3.5 were also a handful of test changes
|
|
I need to rebase again to pick up the changes in d44cba5 |
|
@amogh-jahagirdar the benefit is: we are going to delete the old Spark folder gradually, but we want to keep the whole commit history of each file. take |
|
@pan3793 I understand that but in this case the intermediate changes were a few junit test class upgrades so losing that history for those specific files didn't seem like a big deal to me. Anyways, I discussed with @huaxingao who said she'd put up another PR with cleaner history (it seems like it'd be difficult to cleanly arrange the first copy commit at this point). In the mean time, we'll hold off on merging any Spark 3.5 changes to make this integration easier. |
Okay, also fine to me |
|
I created a new PR so we can have a cleaner history. Closing this PR for now. |
@huaxingao thanks for the explanation. I was only unclear about 3. |

This PR is the same as #12494, except: