Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented Nov 12, 2024

What changes were proposed in this pull request?

SPARK-49098 introduced a SQL syntax to allow users to set table options on DSv2 write cases, but unfortunately, the options set by SQL are not propagated correctly to the underlying DSv2 WriteBuilder

INSERT INTO $t1 WITH (`write.split-size` = 10) SELECT ...
df.writeTo(t1).option("write.split-size", "10").append()

From the user's perspective, the above two are equivalent, but internal implementations differ slightly. Both of them are going to construct an

AppendData(r: DataSourceV2Relation, ..., writeOptions, ...)

but the SQL options are carried by r.options, and the DataFrame API options are carried by writeOptions. Currently, only the latter is propagated to the WriteBuilder, and the former is silently dropped. This PR fixes the above issue by merging those two options.

Currently, the options propagation is inconsistent in DataFrame, DataFrameV2, and SQL:

  • DataFrame API, the same options are carried by both writeOptions and DataSourceV2Relation
  • DataFrameV2 API cases, options are only carried by write options
  • SQL, options are only carried by DataSourceV2Relation

BTW, SessionConfigSupport only takes effect on DataFrame and DataFrameV2 API, it is not considered in the SQL read/write path entirely in the current codebase.

Why are the changes needed?

Correctly propagate SQL options to WriteBuilder, to complete the feature added in SPARK-49098, so that DSv2 implementations like Iceberg can benefit.

Does this PR introduce any user-facing change?

No, it's an unreleased feature.

How was this patch tested?

UTs added by SPARK-36680 and SPARK-49098 are updated also to check SQL options are correctly propagated to the physical plan

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Nov 12, 2024
@pan3793 pan3793 marked this pull request as ready for review November 13, 2024 07:30
@pan3793
Copy link
Member Author

pan3793 commented Nov 13, 2024

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add an assert that only one of them can be non empty?

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you for the fix, @pan3793 .

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, did not realize this. Looks from @cloud-fan comment that only one can be set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my fault, but we can optionally change the first JIRA's in these tests to SPARK-49098 as its the one that added the support to the inserts?

@pan3793
Copy link
Member Author

pan3793 commented Nov 14, 2024

... that only one can be set

@szehon-ho yes, as mentioned in the description, DataFrame's API options go writeOptions while SQL options go r.options, I think they won't be set together in normal cases, but it would be great if someone could double check that.

@pan3793
Copy link
Member Author

pan3793 commented Nov 14, 2024

Wait, I forget the SessionConfigSupport

In fact, I submitted a PR to Iceberg to support this feature, but unfortunately, this patch doesn't seem to be getting attention, @szehon-ho do you think we can re-open this PR and get it in? If so, the assumption would not hold.

... that only one can be set

and we should define the priority, I think it should be

  1. options from SQL
  2. options from DataFrame API
  3. options from session configuration

Currently, if there are duplicated options, 2 overrides 3, see

val finalOptions = sessionOptions.filter { case (k, _) => !optionsWithPath.contains(k) } ++
optionsWithPath.originalMap

@cloud-fan, do you think the proposed priority makes sense? or any new ideas?

@cloud-fan
Copy link
Contributor

yea 3 should have lower priority.

@pan3793 pan3793 marked this pull request as draft November 15, 2024 12:48
@szehon-ho
Copy link
Member

I think assert as @cloud-fan suggest makes sense if no known usecase of 1+2 together? (can see if anything fails).

Yea sure I re-opened the apache/iceberg#7732 and will take a look. Yea makes sense for these confs to have lower priority vs the SQL/DF ones, as there's one per entire source.

@pan3793
Copy link
Member Author

pan3793 commented Nov 18, 2024

I spent a few hours to modify the test cases but they do not work as expected, it seems SessionConfigSupport was not considered in the SQL read/write path entirely in the current codebase, though it should be, according to SessionConfigSupport's docs

Data sources can implement this interface to propagate session configs with the specified key-prefix to all data source operations in this session.

Making SessionConfigSupport work on the SQL read/write path may be out of the scope of this PR, let me try the previous direction and focus on fixing the SQL options propagation issue.

I think assert as @cloud-fan suggest makes sense

@pan3793 pan3793 marked this pull request as ready for review November 18, 2024 17:32
@pan3793
Copy link
Member Author

pan3793 commented Nov 18, 2024

After touching the code, I realized the options propagation is inconsistent in DataFrame, DataFrameV2, and SQL, and SessionConfigSupport also does not take effect on SQL cases (I think this is a bigger issue that should be fixed independently), I updated the code with assertions and add some test cases for DataFrame API.

@szehon-ho @cloud-fan please take a look when you have time, thanks in advance.

(the CI failures look irrelevant)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will take care of SessionConfigSupport in followup PRs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it already takes care of SessionConfigSupport for DataFrame API (both v1 and v2), but not SQL. I will investigate the SQL cases later, but that may need some refactorings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I modified it to capture QueryExecution and reused

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also test the v1 DataFrameWriter API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, add two cases

  • SPARK-50286: Propagate options for DataFrameWriter Append
  • SPARK-50286: Propagate options for DataFrameWriter Overwrite

@pan3793 pan3793 requested a review from cloud-fan November 25, 2024 05:45
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 976f887 Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants