[SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources #44681

HyukjinKwon · 2024-01-11T04:55:09Z

What changes were proposed in this pull request?

This PR proposes to make DataSourceManager isolated and self clone-able without actual lookup by separating separating static and runtime Python Data Sources.

Why are the changes needed?

For better maintenance of the code. Now, we triggers Python execution that actually initializes SparkSession via SQLConf. There are too many side effects.

Also, we should separate static and runtime Python Data Sources in any event.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unittest was added.

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2024-01-11T04:56:01Z

cc @cloud-fan and @allisonwang-db

…rces/DataSourceManager.scala

allisonwang-db

Agree we should separate the runtime registration from static registration.

allisonwang-db · 2024-01-11T06:50:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala

-    builders.putAll(DataSourceManager.initialDataSourceBuilders.asJava)
-    builders
+  private lazy val staticDataSourceBuilders = initDataSourceBuilders.getOrElse {
+    initialDataSourceBuilders


I think the DataSourceManager is session-level and should not be the one to initialize static data sources. When we initialize the DataSourceManager for each spark session, we can pass in the static ones.

So it might make more sense to have an API in SparkContext for static data sources?

Yeah I agree .. but the problem is that UserDefinedPythonDataSourceLookupRunner.runInPython requires SQLConf.get that requires SparkSession initialization.

So, this initialization of static datasources must happen at least when a session is created. So, I here put the static initialization logic into the first call of DataSourceManager in any session for now.

Ah because of these two configs:

val simplifiedTraceback: Boolean = SQLConf.get.pysparkSimplifiedTraceback val workerMemoryMb = SQLConf.get.pythonPlannerExecMemory

I think instead of accessing the SQLConf here, we should pass them as parameters to this method runInPython to avoid this initialization issue. Maybe we can add a TODO for a follow up PR?

HyukjinKwon · 2024-01-12T02:24:31Z

Merged to master.

…Sources around when cloning DataSourceManager ### What changes were proposed in this pull request? This PR is a followup of #44681 that proposes to remove the logic of passing static Python Data Sources around when cloning `DataSourceManager`. They are static Data Sources so we don't actually have to pass around. ### Why are the changes needed? For better readability. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Existing test cases should cover. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44743 from HyukjinKwon/SPARK-46670-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

Make DataSourceManager isolated and self clone-able

7560a38

github-actions bot added the SQL label Jan 11, 2024

Separate static and runtime

a2a04ac

HyukjinKwon changed the title ~~[SPARK-46670][PYTHON][SQL] Make DataSourceManager isolated and self clone-able~~ [SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources Jan 11, 2024

Update sql/core/src/main/scala/org/apache/spark/sql/execution/datasou…

a5c2e57

…rces/DataSourceManager.scala

allisonwang-db reviewed Jan 11, 2024

View reviewed changes

allisonwang-db approved these changes Jan 12, 2024

View reviewed changes

HyukjinKwon closed this in 4828f49 Jan 12, 2024

HyukjinKwon deleted the SPARK-46670 branch January 15, 2024 00:46

HyukjinKwon mentioned this pull request Jan 16, 2024

[SPARK-46670][PYTHON][SQL][FOLLOW-UP] Do not pass static Python Data Sources around when cloning DataSourceManager #44743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources #44681

[SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources #44681

Uh oh!

HyukjinKwon commented Jan 11, 2024 •

edited

Loading

Uh oh!

HyukjinKwon commented Jan 11, 2024

Uh oh!

allisonwang-db left a comment

Uh oh!

allisonwang-db Jan 11, 2024

Uh oh!

HyukjinKwon Jan 11, 2024

Uh oh!

allisonwang-db Jan 12, 2024

Uh oh!

HyukjinKwon commented Jan 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources #44681

[SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources #44681

Uh oh!

Conversation

HyukjinKwon commented Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Jan 11, 2024

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Jan 11, 2024

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 11, 2024

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Jan 12, 2024

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HyukjinKwon commented Jan 11, 2024 •

edited

Loading