[SPARK-50286][SQL] Correctly propagate SQL options to WriteBuilder #48822

pan3793 · 2024-11-12T06:22:33Z

What changes were proposed in this pull request?

SPARK-49098 introduced a SQL syntax to allow users to set table options on DSv2 write cases, but unfortunately, the options set by SQL are not propagated correctly to the underlying DSv2 WriteBuilder

INSERT INTO $t1 WITH (`write.split-size` = 10) SELECT ...

df.writeTo(t1).option("write.split-size", "10").append()

From the user's perspective, the above two are equivalent, but internal implementations differ slightly. Both of them are going to construct an

AppendData(r: DataSourceV2Relation, ..., writeOptions, ...)

but the SQL options are carried by r.options, and the DataFrame API options are carried by writeOptions. Currently, only the latter is propagated to the WriteBuilder, and the former is silently dropped. This PR fixes the above issue by merging those two options.

Currently, the options propagation is inconsistent in DataFrame, DataFrameV2, and SQL:

DataFrame API, the same options are carried by both writeOptions and DataSourceV2Relation
DataFrameV2 API cases, options are only carried by write options
SQL, options are only carried by DataSourceV2Relation

BTW, SessionConfigSupport only takes effect on DataFrame and DataFrameV2 API, it is not considered in the SQL read/write path entirely in the current codebase.

Why are the changes needed?

Correctly propagate SQL options to WriteBuilder, to complete the feature added in SPARK-49098, so that DSv2 implementations like Iceberg can benefit.

Does this PR introduce any user-facing change?

No, it's an unreleased feature.

How was this patch tested?

UTs added by SPARK-36680 and SPARK-49098 are updated also to check SQL options are correctly propagated to the physical plan

Was this patch authored or co-authored using generative AI tooling?

No.

pan3793 · 2024-11-13T07:31:14Z

cc @szehon-ho @cloud-fan @dongjoon-hyun

cloud-fan · 2024-11-13T13:25:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala

can we add an assert that only one of them can be non empty?

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2OptionSuite.scala

cloud-fan

This is a good catch!

dongjoon-hyun

+1, LGTM. Thank you for the fix, @pan3793 .

szehon-ho

Thanks, did not realize this. Looks from @cloud-fan comment that only one can be set?

szehon-ho · 2024-11-13T17:17:18Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2OptionSuite.scala

This is my fault, but we can optionally change the first JIRA's in these tests to SPARK-49098 as its the one that added the support to the inserts?

pan3793 · 2024-11-14T12:19:44Z

... that only one can be set

@szehon-ho yes, as mentioned in the description, DataFrame's API options go writeOptions while SQL options go r.options, I think they won't be set together in normal cases, but it would be great if someone could double check that.

pan3793 · 2024-11-14T13:49:16Z

Wait, I forget the SessionConfigSupport

In fact, I submitted a PR to Iceberg to support this feature, but unfortunately, this patch doesn't seem to be getting attention, @szehon-ho do you think we can re-open this PR and get it in? If so, the assumption would not hold.

... that only one can be set

and we should define the priority, I think it should be

options from SQL
options from DataFrame API
options from session configuration

Currently, if there are duplicated options, 2 overrides 3, see

spark/sql/core/src/main/scala/org/apache/spark/sql/internal/DataFrameWriterImpl.scala

Lines 141 to 142 in c1968a1

    
           val finalOptions = sessionOptions.filter { case (k, _) => !optionsWithPath.contains(k) } ++ 
        
             optionsWithPath.originalMap

@cloud-fan, do you think the proposed priority makes sense? or any new ideas?

cloud-fan · 2024-11-14T13:52:53Z

yea 3 should have lower priority.

szehon-ho · 2024-11-17T18:44:49Z

I think assert as @cloud-fan suggest makes sense if no known usecase of 1+2 together? (can see if anything fails).

Yea sure I re-opened the apache/iceberg#7732 and will take a look. Yea makes sense for these confs to have lower priority vs the SQL/DF ones, as there's one per entire source.

pan3793 · 2024-11-18T08:56:38Z

I spent a few hours to modify the test cases but they do not work as expected, it seems SessionConfigSupport was not considered in the SQL read/write path entirely in the current codebase, though it should be, according to SessionConfigSupport's docs

Data sources can implement this interface to propagate session configs with the specified key-prefix to all data source operations in this session.

Making SessionConfigSupport work on the SQL read/write path may be out of the scope of this PR, let me try the previous direction and focus on fixing the SQL options propagation issue.

I think assert as @cloud-fan suggest makes sense

pan3793 · 2024-11-18T17:42:16Z

After touching the code, I realized the options propagation is inconsistent in DataFrame, DataFrameV2, and SQL, and SessionConfigSupport also does not take effect on SQL cases (I think this is a bigger issue that should be fixed independently), I updated the code with assertions and add some test cases for DataFrame API.

@szehon-ho @cloud-fan please take a look when you have time, thanks in advance.

(the CI failures look irrelevant)

cloud-fan · 2024-11-20T08:34:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala

we will take care of SessionConfigSupport in followup PRs?

it already takes care of SessionConfigSupport for DataFrame API (both v1 and v2), but not SQL. I will investigate the SQL cases later, but that may need some refactorings.

cloud-fan · 2024-11-20T08:42:45Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2OptionSuite.scala

we can use this new util function: https://github.com/apache/spark/pull/48867/files#diff-33334296c47a10ab23e8f3e62fc095aa04eb24227e06da99abb39769974bd41dR452

nice, I modified it to capture QueryExecution and reused

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2OptionSuite.scala

cloud-fan · 2024-11-20T08:44:40Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2OptionSuite.scala

shall we also test the v1 DataFrameWriter API?

sure, add two cases

SPARK-50286: Propagate options for DataFrameWriter Append

SPARK-50286: Propagate options for DataFrameWriter Overwrite

cloud-fan · 2024-11-25T13:51:31Z

thanks, merging to master!

github-actions bot added the SQL label Nov 12, 2024

pan3793 marked this pull request as ready for review November 13, 2024 07:30

cloud-fan reviewed Nov 13, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2OptionSuite.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Nov 13, 2024

View reviewed changes

dongjoon-hyun approved these changes Nov 13, 2024

View reviewed changes

szehon-ho reviewed Nov 13, 2024

View reviewed changes

pan3793 marked this pull request as draft November 15, 2024 12:48

pan3793 force-pushed the SPARK-50286 branch from c83c9fd to 078061e Compare November 18, 2024 14:57

pan3793 marked this pull request as ready for review November 18, 2024 17:32

cloud-fan reviewed Nov 20, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2OptionSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 20, 2024

View reviewed changes

pan3793 added 6 commits November 20, 2024 16:59

[SPARK-50286][SQL] Correctly propogate SQL options to WriteBuilder

f020cd5

DataSourceV2OptionSuite

32f5027

nit

9c3b103

assertion and more tests

a0d881e

nit

2705a04

address comments

ccf4cc9

pan3793 force-pushed the SPARK-50286 branch from ff18547 to ccf4cc9 Compare November 21, 2024 09:06

pan3793 added 2 commits November 21, 2024 17:28

insert into

8b2dd6d

nit

fdb4390

pan3793 requested a review from cloud-fan November 25, 2024 05:45

cloud-fan approved these changes Nov 25, 2024

View reviewed changes

cloud-fan closed this in 976f887 Nov 25, 2024

[SPARK-50286][SQL] Correctly propagate SQL options to WriteBuilder #48822

[SPARK-50286][SQL] Correctly propagate SQL options to WriteBuilder #48822

Uh oh!

Conversation

pan3793 commented Nov 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented Nov 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Nov 14, 2024

Uh oh!

pan3793 commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Nov 14, 2024

Uh oh!

szehon-ho commented Nov 17, 2024

Uh oh!

pan3793 commented Nov 18, 2024

Uh oh!

pan3793 commented Nov 18, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pan3793 commented Nov 12, 2024 •

edited

Loading

pan3793 commented Nov 14, 2024 •

edited

Loading