-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 3.4: IcebergSource extends SessionConfigSupport #7732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dramaticlly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, excited to use SQL syntax to influence spark write options. I am curious to see if similar can be applied to DELETE and MERGE INTO which can only be achieved via SQL today in iceberg
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java
Outdated
Show resolved
Hide resolved
|
@dramaticlly thanks, I refined the test code. |
|
Kindly ping @RussellSpitzer @aokolnychyi @rdblue |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, small suggestion for the test, let me know what you think
|
|
||
| withSQLConf( | ||
| // set write option through session configuration | ||
| ImmutableMap.of("spark.datasource.iceberg.overwrite-mode", "dynamic"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion, wdyt to test with snapshot-property and assert that its set explicitly? It may make the test a bit more clear without needing to understand what overwrite-mode is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szehon-ho Hmm.. sorry I don't get your point.
Let me explain my idea briefly, the test case should cover both the read and write paths:
- create a table, write some data into the table, and record the snapshot as
s1 - overwrite the table with dynamic overwrite mode (test setting write options through session conf) and check the current snapshot of the table
- read the table from the snapshot
s1(test setting read options through session conf) and check the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea sorry, i was not clear. Its just a suggestion.
I think, because the test is to test SessionConfigSupport functionality only, it may be more clear for the reader if the first check (on write part) is like:
'spark.datasource.iceberg.snapshot-property.foo=bar'
and then check if foo is set on latest snapshot summary?
Because i think the reader of the test need to know what is 'dynamic overwrite' mode to understand the assert (its not related to the feature), whereas the above is a bit more self-explanatory imo.
I think the read part is decently understandable without additional context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed description. I am educated because I didn't know that Iceberg can add custom snapshot properties through options. I update the test case to follow the suggestions.
75d88fa to
b21335a
Compare
|
rebased on the latest main branch |
|
|
||
| withSQLConf( | ||
| // set write option through session configuration | ||
| ImmutableMap.of("spark.datasource.iceberg.overwrite-mode", "dynamic"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea sorry, i was not clear. Its just a suggestion.
I think, because the test is to test SessionConfigSupport functionality only, it may be more clear for the reader if the first check (on write part) is like:
'spark.datasource.iceberg.snapshot-property.foo=bar'
and then check if foo is set on latest snapshot summary?
Because i think the reader of the test need to know what is 'dynamic overwrite' mode to understand the assert (its not related to the feature), whereas the above is a bit more self-explanatory imo.
I think the read part is decently understandable without additional context.
| * over any namespace resolution. | ||
| */ | ||
| public class IcebergSource implements DataSourceRegister, SupportsCatalogOptions { | ||
| public class IcebergSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another comment, is it now multiple ways to configure properties (including #4011), it may be confusing to user. Worth to add a documentation about it, listing the precedence, ie:
I guess using dataframe API (to be double-checked)
- explicit dataframe option
- dataframe session default
- if table exists, explicit table option
- if table exists, table default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the docs and hope it's clear now.
docs/docs/spark-configuration.md
Outdated
| ### Write options | ||
|
|
||
| Spark write options are passed when configuring the DataFrameWriter, like this: | ||
| Spark write options are passed when configuring the DataFrameWriterV2, like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the example with DataFrameWriterV2 because
iceberg/docs/docs/spark-writes.md
Line 264 in 319f29e
| The v1 DataFrame `write` API is still supported, but is not recommended. |
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @pan3793 i think it looks good, but it looks like it may take a few iterations on the doc part. Should we leave it for another pr to make this smaller? (can get the functionality in first)
docs/docs/spark-configuration.md
Outdated
| | vectorization-enabled | As per table property | Overrides this table's read.parquet.vectorization.enabled | | ||
| | batch-size | As per table property | Overrides this table's read.parquet.vectorization.batch-size | | ||
| | stream-from-timestamp | (none) | A timestamp in milliseconds to stream from; if before the oldest known ancestor snapshot, the oldest will be used | | ||
| Iceberg 1.8.0 and later support setting read options by Spark session configuration `spark.datasource.iceberg.<key>=<value>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good, but was also thinking of adding a section for priority as well as mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be in its own section, like "session level configuration"?
docs/docs/spark-configuration.md
Outdated
| when using DataFrame to read Iceberg tables, for example: `spark.datasource.iceberg.split-size=512m`, it has lower priority | ||
| than options explicitly passed to DataFrameReader. | ||
|
|
||
| | Spark option | Default | Description | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can revert change to this table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was auto-formatted by IDEA, reverted
362ccac to
ea7a515
Compare
|
@szehon-ho WDYT of the current state? I keep the docs change minimally in this patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pan3793 for minimizing the changes. I think the doc can still be improved , i put the comments.
But I think it'd be faster if you want to get the code changes in first, to split it in another PR.
docs/docs/spark-configuration.md
Outdated
| | vectorization-enabled | As per table property | Overrides this table's read.parquet.vectorization.enabled | | ||
| | batch-size | As per table property | Overrides this table's read.parquet.vectorization.batch-size | | ||
| | stream-from-timestamp | (none) | A timestamp in milliseconds to stream from; if before the oldest known ancestor snapshot, the oldest will be used | | ||
| Iceberg 1.8.0 and later support setting read options by Spark session configuration `spark.datasource.iceberg.<key>=<value>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be in its own section, like "session level configuration"?
docs/docs/spark-configuration.md
Outdated
| .append() | ||
| ``` | ||
|
|
||
| Iceberg 1.8.0 and later support setting write options by Spark session configuration `spark.datasource.iceberg.<key>=<value>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we extract to its own section, no need to repeat it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I write it here because it's "Write options", actually, Spark has different concepts to allow the format/extensions to control the behavior, i.e. table properties, session configurations, options.
docs/docs/spark-configuration.md
Outdated
| .table("catalog.db.table") | ||
| ``` | ||
|
|
||
| Iceberg 1.8.0 and later support setting read options by Spark session configuration `spark.datasource.iceberg.<key>=<value>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we need new section like 'Configuration Priority' where we can explain the order of precedence:
DataFrame Writes:
- explicit dataframeWriter option
- dataframe session default
- if table exists, explicit table option
- if table exists, table default
DataFrame Reads:
- explicit dataFrameReader option
- dataframe session default
- if table exists, explicit table option
- if table exists, table default
(please double check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hesitate to write such a section because the situation looks more complex, some configurations are allowed to be set by dedicated session configuration, for example
public boolean localityEnabled() {
boolean defaultValue = Util.mayHaveBlockLocations(table.io(), table.location());
return confParser
.booleanConf()
.option(SparkReadOptions.LOCALITY)
.sessionConf(SparkSQLProperties.LOCALITY)
.defaultValue(defaultValue)
.parse();
}
This reverts commit ea7a515.
This reverts commit d3b9f9d.
|
@szehon-ho I made a minor change on assertion statement after your approval, also create two backports PR for Spark 3.3 and 3.5, thanks for your detailed review. |
|
Sure, thanks its a good catch, assertThat is better. |
|
Merged, thanks @pan3793 |
This PR aims to make
IcebergSource extends SessionConfigSupportto improve the Spark DataSource v2 API coverage.It allows to set read/write options by setting Spark session configuration when using the DataFrame API to read/write tables. For examples,