Skip to content

Conversation

@vicennial
Copy link
Contributor

@vicennial vicennial commented Nov 9, 2023

What changes were proposed in this pull request?

The significant changes in this PR include:

  • SparkConnectArtifactManager is renamed to ArtifactManager and moved out of Spark Connect and into sql/core (available in SparkSession through SessionState) along with all corresponding tests and confs.
  • While ArtifactManager in part of SparkSession, we keep the legacy behaviour for non-connect spark while utilising the ArtifactManager in connect pathways
    • This is done by exposing a new method withResources in the artifact manager that sets the context class loader (for driver-side operations) and propagates the JobArtifactState such that the resources reach the executor.
    • Spark Connect pathways utilise this method through the SessionHolder#withActive
    • When withResources is not used, neither the custom context classloader nor the JobArtifactState is propagated and hence, non Spark Connect pathways remain with legacy behaviour.

Why are the changes needed?

The ArtifactManager that currently lies in the connect package can be moved into the wider sql/core package (e.g SparkSession) to expand the scope. This is possible because the ArtifactManager is tied solely to the SparkSession#sessionUUID and hence can be cleanly detached from Spark Connect and be made generally available.

Does this PR introduce any user-facing change?

No. Existing behaviour is kept intact for both non-connect and connect spark.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@vicennial vicennial marked this pull request as ready for review November 10, 2023 13:42
@github-actions github-actions bot added the BUILD label Nov 10, 2023
@vicennial
Copy link
Contributor Author

PTAL @hvanhovell @HyukjinKwon

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the JIRA, this is only for Apache Spark 4.0.0, right?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to support configurations in a way because we didn't deprecate yet in the Apache Spark community.

spark.connect.copyFromLocalToFs.allowDestLocal

* Returns an `ArtifactManager` that supports adding, managing and using session-scoped artifacts
* (jars, classfiles, etc).
*
* @since 3.5.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 4.0.0 because this PR is for Apache Spark 4.0.0, @vicennial .

@dongjoon-hyun
Copy link
Member

Could you re-trigger the failed pipelines?

@vicennial
Copy link
Contributor Author

@dongjoon-hyun Thank you for the review!

I've updated the version and for the deprecated conf spark.connect.copyFromLocalToFs.allowDestLocal, I've added it to the deprecatedSQLConfigs list so that it would generate a warning if used. Currently, both configs would work but if the new config is set (currently optional), it would override the deprecated config. Eventually, we should remove the deprecated conf and convert the new config from optional to have a default value.

@vicennial
Copy link
Contributor Author

Hmm, JavaDocGeneration is failing in 2 tests and from the logs, its not clear why...
I will merge master and see if that helps

@vicennial
Copy link
Contributor Author

@dongjoon-hyun The CI is green now :)

@HyukjinKwon
Copy link
Member

Merged to master.

@fhalde
Copy link

fhalde commented Nov 22, 2023

Hi, we are super interested in having the isolated classloader per spark session ability for our usecase. i believe this today is only achievable if jobs are run from a connect client. we want to avoid using connect client

but with this pr merged, it should be possible to have isolated classloaders per spark session on the executors right?

our use-case involves starting a spark driver and dynamically loading/adding jars and running transformations present within the jars. without isolated classloaders per session, on the executor side we would risk classpath conflict

@HyukjinKwon if your pr addresses my concern, i can back port it to 3.5.0 in my fork

@vicennial
Copy link
Contributor Author

@fhalde Yes, with this PR, it would be possible to have isolated classloaders per spark session on the executors without going through Spark Connect.

The withResources method should be used to wrap all executions (like in Spark Connect here) and note that, all artifacts (i.e Jars, classfiles) would need to be added through the ArtifactManager (directly adding these to SparkContext would not work), refer to Spark Connect's AddArtifactsHandler to see how we use the API.

ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.streaming.StreamingQueryException"),

// SPARK-45856: Move ArtifactManager from Spark Connect into SparkSession (sql/core)
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.storage.CacheId.apply"),
Copy link
Contributor

@LuciferYang LuciferYang Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vicennial I would like to reconfirm, the ProblemFilters added by SPARK-45856 will never need to undergo a mima check in versions after Spark 4.0, is that correct? Or is this just the ProblemFilters added for the mima check between Spark 4.0 and Spark 3.5?I found that it has been placed in defaultExcludes.

grundprinzip pushed a commit that referenced this pull request Jul 12, 2024
### What changes were proposed in this pull request?

This jar was added in #42069 but moved in #43735.

### Why are the changes needed?

To clean up a jar not used.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests should check

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47315 from HyukjinKwon/minor-cleanup-jar-2.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Martin Grund <[email protected]>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
### What changes were proposed in this pull request?

This jar was added in apache#42069 but moved in apache#43735.

### Why are the changes needed?

To clean up a jar not used.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests should check

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47315 from HyukjinKwon/minor-cleanup-jar-2.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Martin Grund <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants