-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark - Accept an output-spec-id that allows writing to a desired partition spec
#7120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark - Accept an output-spec-id that allows writing to a desired partition spec
#7120
Conversation
|
Should this be in Spark 2.4 as well? |
|
@edgarRd I looked into it, but I'm not sure whether it is worthwhile to add it to Spark 2.4:
For our use-case, adding only on Spark 3 should be enough. |
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @RussellSpitzer was interested in this as well (for rewrite to different spec id).
spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWrites.java
Show resolved
Hide resolved
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/PartitionedWritesTestBase.java
Outdated
Show resolved
Hide resolved
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me, left a few minor comments.
Would also ask @aokolnychyi to see if I am missing some potential problems by doing so.
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/PartitionedWritesTestBase.java
Outdated
Show resolved
Hide resolved
62592b2 to
684f1cf
Compare
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just a style nit
spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
gustavoatt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review @szehon-ho!
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the code changes, those look good now. Missed some questions/comments on the test.
spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWrites.java
Outdated
Show resolved
Hide resolved
spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWrites.java
Outdated
Show resolved
Hide resolved
spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestPartitionedWrites.java
Outdated
Show resolved
Hide resolved
spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java
Outdated
Show resolved
Hide resolved
|
Thanks @gustavoatt for all the changes, let me wait a little bit to see if any other concerns, will merge if not |
| this.extraSnapshotMetadata = writeConf.extraSnapshotMetadata(); | ||
| this.partitionedFanoutEnabled = writeConf.fanoutWriterEnabled(); | ||
|
|
||
| if (writeConf.outputSpecId() == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have this inside SparkWriteConf and make outputSpecId() return int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I moved this logic to SparkWriteConf. The main reason why I initially did not do it there was because I did not want to store the specs and current spec there, but I think that should be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could keep a reference to Table, just like we do in SparkReadConf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up PR #7348
| return sessionConf.get("spark.wap.id", null); | ||
| } | ||
|
|
||
| public Integer outputSpecId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially we were not sure whether to make spec IDs public but I also don't see a good alternative.
I am OK with the overall idea of exposing this.
spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
|
The change looks good to me, I had a few minor comments. |
|
Thanks for the review @aokolnychyi and @szehon-ho. There is a check failure on this PR but looks unrelated to my changes: |
|
Yea I think its appeared a few times, tracked by #7321, let's just retrigger (close and re-open for example) |
| } | ||
|
|
||
| public int outputSpecId() { | ||
| final int outputSpecId = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Would you mind remove 'final' here? I know from @aokolnychyi (when he reviewed me change), he prefers not to have extra finals except in class fields, as modern compiler usually adds it anyway. ex: #4293 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I usually just keep the final as a way of having something like const to avoid accidentally modifying something I did not intend to. But I think it is not necessary in this case and would prefer to do it to keep consistency in the rpo.
|
Merged, thanks @gustavoatt , and also @aokolnychyi, @edgarRd for additional review |
|
Thanks for the review and merging @szehon-ho @aokolnychyi! Appreciate the effort spent reviewing! |
Summary
Allow passing
output-spec-idto the Spark writer, so that we can customize which partition spec to write. This is useful for when a table has more than one active spec (e.g. daily and hourly partition spec).Fixes #6932
Testing