-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources #44681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan and @allisonwang-db |
…rces/DataSourceManager.scala
allisonwang-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree we should separate the runtime registration from static registration.
| builders.putAll(DataSourceManager.initialDataSourceBuilders.asJava) | ||
| builders | ||
| private lazy val staticDataSourceBuilders = initDataSourceBuilders.getOrElse { | ||
| initialDataSourceBuilders |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the DataSourceManager is session-level and should not be the one to initialize static data sources. When we initialize the DataSourceManager for each spark session, we can pass in the static ones.
So it might make more sense to have an API in SparkContext for static data sources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree .. but the problem is that UserDefinedPythonDataSourceLookupRunner.runInPython requires SQLConf.get that requires SparkSession initialization.
So, this initialization of static datasources must happen at least when a session is created. So, I here put the static initialization logic into the first call of DataSourceManager in any session for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah because of these two configs:
val simplifiedTraceback: Boolean = SQLConf.get.pysparkSimplifiedTraceback
val workerMemoryMb = SQLConf.get.pythonPlannerExecMemory
I think instead of accessing the SQLConf here, we should pass them as parameters to this method runInPython to avoid this initialization issue. Maybe we can add a TODO for a follow up PR?
|
Merged to master. |
…Sources around when cloning DataSourceManager ### What changes were proposed in this pull request? This PR is a followup of #44681 that proposes to remove the logic of passing static Python Data Sources around when cloning `DataSourceManager`. They are static Data Sources so we don't actually have to pass around. ### Why are the changes needed? For better readability. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Existing test cases should cover. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44743 from HyukjinKwon/SPARK-46670-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
This PR proposes to make DataSourceManager isolated and self clone-able without actual lookup by separating separating static and runtime Python Data Sources.
Why are the changes needed?
For better maintenance of the code. Now, we triggers Python execution that actually initializes
SparkSessionviaSQLConf. There are too many side effects.Also, we should separate static and runtime Python Data Sources in any event.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unittest was added.
Was this patch authored or co-authored using generative AI tooling?
No.