Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jan 11, 2024

What changes were proposed in this pull request?

This PR proposes to make DataSourceManager isolated and self clone-able without actual lookup by separating separating static and runtime Python Data Sources.

Why are the changes needed?

For better maintenance of the code. Now, we triggers Python execution that actually initializes SparkSession via SQLConf. There are too many side effects.

Also, we should separate static and runtime Python Data Sources in any event.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unittest was added.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jan 11, 2024
@HyukjinKwon
Copy link
Member Author

cc @cloud-fan and @allisonwang-db

@HyukjinKwon HyukjinKwon changed the title [SPARK-46670][PYTHON][SQL] Make DataSourceManager isolated and self clone-able [SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources Jan 11, 2024
Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we should separate the runtime registration from static registration.

builders.putAll(DataSourceManager.initialDataSourceBuilders.asJava)
builders
private lazy val staticDataSourceBuilders = initDataSourceBuilders.getOrElse {
initialDataSourceBuilders
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the DataSourceManager is session-level and should not be the one to initialize static data sources. When we initialize the DataSourceManager for each spark session, we can pass in the static ones.

So it might make more sense to have an API in SparkContext for static data sources?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree .. but the problem is that UserDefinedPythonDataSourceLookupRunner.runInPython requires SQLConf.get that requires SparkSession initialization.

So, this initialization of static datasources must happen at least when a session is created. So, I here put the static initialization logic into the first call of DataSourceManager in any session for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah because of these two configs:

    val simplifiedTraceback: Boolean = SQLConf.get.pysparkSimplifiedTraceback
    val workerMemoryMb = SQLConf.get.pythonPlannerExecMemory

I think instead of accessing the SQLConf here, we should pass them as parameters to this method runInPython to avoid this initialization issue. Maybe we can add a TODO for a follow up PR?

@HyukjinKwon
Copy link
Member Author

Merged to master.

@HyukjinKwon HyukjinKwon deleted the SPARK-46670 branch January 15, 2024 00:46
HyukjinKwon added a commit that referenced this pull request Jan 16, 2024
…Sources around when cloning DataSourceManager

### What changes were proposed in this pull request?

This PR is a followup of #44681 that proposes to remove the logic of passing static Python Data Sources around when cloning `DataSourceManager`. They are static Data Sources so we don't actually have to pass around.

### Why are the changes needed?

For better readability.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Existing test cases should cover.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44743 from HyukjinKwon/SPARK-46670-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants