[SPARK-44146][CONNECT] Isolate Spark Connect Session jars and classfiles #41701

vicennial · 2023-06-22T17:22:02Z

What changes were proposed in this pull request?

This PR follows up on #41625 to utilize the classloader/resource isolation in Spark to support multi-user Spark Connect sessions which are isolated from each other (currently, classfiles and jars) and thus, enables multi-user REPLs and UDFs.

Instead of a single instance of SparkArtifactManager handling all the artifact movement, each instance is now responsible for a single sessionHolder (i.e a Spark Connect session) which it requires in it's constructor.
Previously, all artifacts were stored under a common directory sparkConnectArtifactDirectory which was initialised in SparkContext. Moving forward, all artifacts are instead separated based on the underlying SparkSession (using it's sessionUUID) they belong to in the format of ROOT_ARTIFACT_DIR/<sessionUUID>/jars/....
The SparkConnectArtifactManager also builds a JobArtifactSet here which is eventually propagated to the executors where the classloader isolation mechanism uses the uuid parameter.
Currently, classfile and jars are isolated but files and archives aren't.

Why are the changes needed?

To enable support for multi-user sessions coexisting on a singular Spark cluster. For example, multi-user Scala REPLs/UDFs will be supported with this PR.

Does this PR introduce any user-facing change?

Yes, multiple Spark Connect REPLs may use a single Spark cluster at once and execute their own UDFs without intefering with each other.

How was this patch tested?

New unit tests in ArtifactManagerSuite + existing tests.

vicennial · 2023-06-23T20:44:52Z

The PR is ready to review (the falling tests are flaky, I've submitted a rerun request).

vicennial · 2023-06-26T07:45:55Z

@HyukjinKwon @hvanhovell Could you have a look? Thanks!

HyukjinKwon

Looks fine from a cursory look but I think we should have @hvanhovell signoff

hvanhovell · 2023-06-27T01:16:45Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

  private lazy val pythonExec =
    sys.env.getOrElse("PYSPARK_PYTHON", sys.env.getOrElse("PYSPARK_DRIVER_PYTHON", "python3"))

+  // SparkConnectPlanner is used per request.


We could put this in the session holder right?

hvanhovell · 2023-06-27T01:20:24Z

...erver/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala

+    }
+    val oldArtifactUri = currentArtifactRootUri
+    currentArtifactRootUri = SparkEnv.get.rpcEnv.fileServer
+      .addDirectoryIfAbsent(ARTIFACT_DIRECTORY_PREFIX, artifactRootPath.toFile)


Can we use addDirectory instead? The if-absent bit if pretty well protected by this object.

hvanhovell · 2023-06-27T01:26:33Z

...ector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala

+   * @param f
+   * @tparam T
+   */
+  def withContext[T](f: => T): T = {


This name is a bit too vague for my liking. How about withContextClassLoader?

hvanhovell · 2023-06-27T01:27:29Z

...ector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala

+   */
+  def withSessionBasedPythonPaths[T](f: => T): T = {
+    try {
+      session.conf.set(


qq, we don't really need to unset this right? Or is exposing it to the client a bad idea? cc @HyukjinKwon

yeah I planned to remove this and @vicennial removed it in #41789 (comment). This was just a hack I added to avoid additional refactoring.

hvanhovell · 2023-06-27T01:39:03Z

...nnect/server/src/test/scala/org/apache/spark/sql/connect/artifact/ArtifactManagerSuite.scala

+    addHelloClass(holder1)
+
+    val classLoader1 = holder1.classloader
+    val instance1 = classLoader1


You could add another session here where you can load Hello. In that cases the classes for the different sessions should not be equal.

hvanhovell

LGTM

hvanhovell · 2023-06-27T01:41:54Z

I am merging this to unblock a couple of follow-ups. Please address my comments in a small follow-up.

…nnect Jar/Classfile Isolation ### What changes were proposed in this pull request? This PR is a follow-up of #41701 and addresses the comments mentioned [here](#41701 (comment)). The summary is: - `pythonIncludes` are directly fetched from the `ArtifactManager` via `SessionHolder` instead of propagating through the spark conf - `SessionHolder#withContext` renamed to `SessionHolder#withContextClassLoader` to decrease ambiguity. - General increased test coverage for isolated classloading (New unit test in `ArtifactManagerSuite` and a new suite `ClassLoaderIsolationSuite`. ### Why are the changes needed? General follow-ups from [here.](#41701 (comment)) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New test suite and unit tests. Closes #41789 from vicennial/SPARK-44246. Authored-by: vicennial <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### Previous behaviour Previously, we kept `JobArtifactSet` and leveraged thread local for each client. 1. The execution block is wrapped with `SessionHolder.withSession` [here](https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala#L53). 2. `SessionHolder.withContextClassLoader` is then called [here](https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala#L130) which in turn calls `JobArtifactSet.withActive` [here](https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala#L118) and sets the active set to `SessionHolder.connectJobArtifactSet` 3. The actual `JobArtifactSet` that is used is built up [here](https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala#L157) in `SparkConnectArtifactManager.jobArtifactSet` Since each client has their own `JobArtifactSet` made `active` when executing an operation, the `TaskDescription` would have artifacts specific to that client and subsequently, `IsolatedSessionState` in Executor. Therefore, we were able to separate the Spark Connect specific logic to the Spark Connect module. ### Problem Mainly it was all good; however, the problem is that we don't call `SparkContext.addFile` or `SparkContext.addJar`, but we just pass it directly at the scheduler (to `TaskDescription`). This is fine in general but exposes several problems by not directly calling `SparkContext.addFile`: - `SparkContext.postEnvironmentUpdate` is not invoked at `SparkContext.addFile` which matters in, for example, recording the events for History Server. - Specifically for archives, `Utils.unpack(source, dest)` is not invoked at `SparkContext.addFile` in order to untar properly in the Driver. Therefore, we should duplicate those logics in Spark Connect server side, which is not ideal. In addition, we already added the isolation logic into the Executor. Driver and Executor are the symmetry (not Spark Connect Server <> Executor). Therefore, it matters about code readability, and expectation in their roles. ### Solution in this PR This PR proposes to support session-based files and archives in Spark Connect. This PR leverages the basework for #41701 and #41625 (for jars in Spark Connect Scala client). The changed logic is as follows: - Keep the session UUID, and Spark Connect Server specific information such as REPL class path within a thread local. - Add session ID when we add files or archives. `SparkContext` keeps them with a map `Map(session -> Map(file and timestamp))` in order to reuse the existing logic to address the problem mentioned After that, on executor side, - Executors create additional directory, named by session UUID, on the top of the default directory (that is the current working directory, see `SparkFiles.getRootDirectory`). - When we execute Python workers, it sets the current working directory to the one created above. - End users access to these files via using the current working directory e.g., `./blahblah.txt` in their Python UDF. Therefore, compatible with/without Spark Connect. Note that: - Here it creates Python workers for individual session because we set the session UUID as an environment variable, and we create new Python workers if environment variables are different, see also `SparkEnv.createPythonWorker` - It already kills the daemon and Python workers if they are not used for a while. ### TODOs and limitations Executor also maintains the file list but with a cache so it can evict the cache. However, it has a problem - It works as follows: - New `IsolatedSessionState` is created. - Task is executed once, and `IsolatedSessionState` holds the file list. - Later `IsolatedSessionState` is evicted at https://github.com/apache/spark/pull/41625/files#diff-d7a989c491f3cb77cca02c701496a9e2a3443f70af73b0d1ab0899239f3a789dR187 - Executor will create a new `IsolatedSessionState` with empty file lists. - Executor will attempt to redownload and overwrite the files (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L1058-L1064) - `spark.files.overwrite` is `false` by default. So the task will suddenly fail at this point. Possible solutions are: - For 1., we should maintain a cache with TTL, and remove them - For 2. we should have a dedicated directory (which this PR does) and remove the directory away when the cache is evicted. So the overwrite does not happen ### Why are the changes needed? In order to allow session-based artifact control and multi tenancy. ### Does this PR introduce _any_ user-facing change? Yes, this PR now allows multiple sessions to have their own space. For example, session A and session B can add a file in the same name. Previously this was not possible. ### How was this patch tested? Unittests were added. Closes #41495 from HyukjinKwon/session-base-exec-dir. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

init

6e4b67e

github-actions bot added CONNECT CORE SQL labels Jun 22, 2023

vicennial changed the title ~~[WIP][CONNECT][SPARK-44146] Isolate Spark Connect sessions/artifacts~~ [WIP][CONNECT][SPARK-44146] Isolate Spark Connect Session jars and classfiles Jun 23, 2023

vicennial marked this pull request as ready for review June 23, 2023 20:43

vicennial changed the title ~~[WIP][CONNECT][SPARK-44146] Isolate Spark Connect Session jars and classfiles~~ [CONNECT][SPARK-44146] Isolate Spark Connect Session jars and classfiles Jun 23, 2023

HyukjinKwon changed the title ~~[CONNECT][SPARK-44146] Isolate Spark Connect Session jars and classfiles~~ [SPARK-44146][CONNECT] Isolate Spark Connect Session jars and classfiles Jun 26, 2023

HyukjinKwon approved these changes Jun 26, 2023

View reviewed changes

hvanhovell reviewed Jun 27, 2023

View reviewed changes

hvanhovell approved these changes Jun 27, 2023

View reviewed changes

hvanhovell closed this in b02ea4c Jun 27, 2023

vicennial mentioned this pull request Jun 29, 2023

[SPARK-44246][CONNECT][FOLLOW-UP] Miscellaneous cleanups for Spark Connect Jar/Classfile Isolation #41789

Closed

HyukjinKwon mentioned this pull request Jul 6, 2023

[SPARK-44290][CONNECT] Session-based files and archives in Spark Connect #41495

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-44146][CONNECT] Isolate Spark Connect Session jars and classfiles #41701

[SPARK-44146][CONNECT] Isolate Spark Connect Session jars and classfiles #41701

Uh oh!

vicennial commented Jun 22, 2023 •

edited

Loading

Uh oh!

vicennial commented Jun 23, 2023

Uh oh!

vicennial commented Jun 26, 2023

Uh oh!

HyukjinKwon left a comment •

edited

Loading

Uh oh!

hvanhovell Jun 27, 2023

Uh oh!

hvanhovell Jun 27, 2023

Uh oh!

hvanhovell Jun 27, 2023

Uh oh!

hvanhovell Jun 27, 2023 •

edited

Loading

Uh oh!

HyukjinKwon Jul 4, 2023

Uh oh!

hvanhovell Jun 27, 2023

Uh oh!

hvanhovell left a comment

Uh oh!

hvanhovell commented Jun 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-44146][CONNECT] Isolate Spark Connect Session jars and classfiles #41701

[SPARK-44146][CONNECT] Isolate Spark Connect Session jars and classfiles #41701

Uh oh!

Conversation

vicennial commented Jun 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

vicennial commented Jun 23, 2023

Uh oh!

vicennial commented Jun 26, 2023

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell Jun 27, 2023

Choose a reason for hiding this comment

Uh oh!

hvanhovell Jun 27, 2023

Choose a reason for hiding this comment

Uh oh!

hvanhovell Jun 27, 2023

Choose a reason for hiding this comment

Uh oh!

hvanhovell Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 4, 2023

Choose a reason for hiding this comment

Uh oh!

hvanhovell Jun 27, 2023

Choose a reason for hiding this comment

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Jun 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vicennial commented Jun 22, 2023 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

hvanhovell Jun 27, 2023 •

edited

Loading