-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-50286][SQL] Correctly propagate SQL options to WriteBuilder #48822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add an assert that only one of them can be non empty?
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2OptionSuite.scala
Outdated
Show resolved
Hide resolved
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good catch!
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you for the fix, @pan3793 .
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, did not realize this. Looks from @cloud-fan comment that only one can be set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my fault, but we can optionally change the first JIRA's in these tests to SPARK-49098 as its the one that added the support to the inserts?
@szehon-ho yes, as mentioned in the description, DataFrame's API |
|
Wait, I forget the In fact, I submitted a PR to Iceberg to support this feature, but unfortunately, this patch doesn't seem to be getting attention, @szehon-ho do you think we can re-open this PR and get it in? If so, the assumption would not hold.
and we should define the priority, I think it should be
Currently, if there are duplicated options, 2 overrides 3, see spark/sql/core/src/main/scala/org/apache/spark/sql/internal/DataFrameWriterImpl.scala Lines 141 to 142 in c1968a1
@cloud-fan, do you think the proposed priority makes sense? or any new ideas? |
|
yea 3 should have lower priority. |
|
I think assert as @cloud-fan suggest makes sense if no known usecase of 1+2 together? (can see if anything fails). Yea sure I re-opened the apache/iceberg#7732 and will take a look. Yea makes sense for these confs to have lower priority vs the SQL/DF ones, as there's one per entire source. |
|
I spent a few hours to modify the test cases but they do not work as expected, it seems
Making
|
|
After touching the code, I realized the @szehon-ho @cloud-fan please take a look when you have time, thanks in advance. (the CI failures look irrelevant) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will take care of SessionConfigSupport in followup PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it already takes care of SessionConfigSupport for DataFrame API (both v1 and v2), but not SQL. I will investigate the SQL cases later, but that may need some refactorings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, I modified it to capture QueryExecution and reused
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2OptionSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also test the v1 DataFrameWriter API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, add two cases
SPARK-50286: Propagate options for DataFrameWriter AppendSPARK-50286: Propagate options for DataFrameWriter Overwrite
|
thanks, merging to master! |
What changes were proposed in this pull request?
SPARK-49098 introduced a SQL syntax to allow users to set table options on DSv2 write cases, but unfortunately, the options set by SQL are not propagated correctly to the underlying DSv2
WriteBuilderFrom the user's perspective, the above two are equivalent, but internal implementations differ slightly. Both of them are going to construct an
but the SQL
optionsare carried byr.options, and theDataFrameAPIoptionsare carried bywriteOptions. Currently, only the latter is propagated to theWriteBuilder, and the former is silently dropped. This PR fixes the above issue by merging those twooptions.Currently, the
optionspropagation is inconsistent inDataFrame,DataFrameV2, and SQL:optionsare carried by bothwriteOptionsandDataSourceV2Relationwrite optionsoptionsare only carried byDataSourceV2RelationBTW,
SessionConfigSupportonly takes effect onDataFrameandDataFrameV2API, it is not considered in theSQLread/write path entirely in the current codebase.Why are the changes needed?
Correctly propagate SQL options to
WriteBuilder, to complete the feature added in SPARK-49098, so that DSv2 implementations like Iceberg can benefit.Does this PR introduce any user-facing change?
No, it's an unreleased feature.
How was this patch tested?
UTs added by SPARK-36680 and SPARK-49098 are updated also to check SQL
optionsare correctly propagated to the physical planWas this patch authored or co-authored using generative AI tooling?
No.