-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-44290][CONNECT] Session-based files and archives in Spark Connect #41495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b62e592 to
2e3ecc2
Compare
|
qq: If we make existing |
|
Yeah, let me try to decouple this. |
f92d487 to
4e0a3a9
Compare
### What changes were proposed in this pull request? This adds SessionHolder rather than just SparkSession to `SparkConnectPlanner`. This is to allow access to session specific state at connect server level. Note that this is Spark-Connect specific session state, and is not stored with SparkSession. E.g. * Mapping from _dataframe reference id_ to actual dataframe in #41580 * File and archives stored with session in #41495 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Existing unit tests. Closes #41618 from rangadi/session-holder. Authored-by: Raghu Angadi <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>
### What changes were proposed in this pull request? This adds SessionHolder rather than just SparkSession to `SparkConnectPlanner`. This is to allow access to session specific state at connect server level. Note that this is Spark-Connect specific session state, and is not stored with SparkSession. E.g. * Mapping from _dataframe reference id_ to actual dataframe in apache#41580 * File and archives stored with session in apache#41495 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Existing unit tests. Closes apache#41618 from rangadi/session-holder. Authored-by: Raghu Angadi <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>
961f5cb to
df475ec
Compare
9a191f0 to
61c2c9c
Compare
d2877f4 to
36f9be5
Compare
d422aef to
cc464a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I already faced the limitation here when the test takes longer than 5 mins. I increased it as a workaround for now. We should fix them before the release.
...erver/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
Outdated
Show resolved
Hide resolved
96938d4 to
f476025
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think with the helper instance, it makes much more sense now.
87eda2b to
be78ab7
Compare
be78ab7 to
c161ac9
Compare
|
Both tests were flaky. I manually checked them locally. Merged to master. |
… session UUID together ### What changes were proposed in this pull request? This PR is a followup of #41495 that skips a couple of flaky tests. In addition, this PR fixes a typo together. ### Why are the changes needed? To keep the tests green. In order to reenable the tests, it needs other fixes together that might refactor the whole test cases which takes a while. I will followup and fix them in SPARK-44348 ### Does this PR introduce _any_ user-facing change? No, the feature is not released to end users yet. ### How was this patch tested? Unittests skipped for now. Closes #41913 from HyukjinKwon/SPARK-44290-followup. Lead-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: Kent Yao <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…vant changes ### What changes were proposed in this pull request? This PR is a sort of a followup of #41495. This PR contains several changes to make the tests working: - Always uses `JobArtifactSet.getCurrentJobArtifactState` to get the current UUID in the current thread. - Specify the current state (from `JobArtifactSet.getCurrentJobArtifactState`) when add the artifacts (so we can get the state in `SparkContext`). - Creates a dedicated directory in Driver side too. We provide Spark Connect Server as a service. It creates a session dedicated directory, and put the added files there in the server. - Notice that we do not support `SparkFiles.getRootDirectory` in Spark Connect so this should be fine. This dedicated directory will also be used to execute Python process within Driver side (for dependency management, e.g., foreachBatch in Structured Streaming with Spark Connect). - Get the current UUID in the Driver side for Python UDF execution. Previously, we tired to get it from executor side which results in `None`. - Rename `sessionUUID` (or similar) to `jobArtifactUUID`. In Core code context, it's a job artifact state. - Fix Spark Connect Python client local debug mode (e.g., `local` or `local-cluster`) to send jars when `local-cluster` mode is specified. If not, it throws an exception that `SparkConnectPlugin` cannot be found. - Refactor and fix the tests to verify archive, file and pyfiles in both `local` and `local-cluster` modes. ### Why are the changes needed? To make session-based artifact management working. ### Does this PR introduce _any_ user-facing change? No, this feature has not been released yet. ### How was this patch tested? Unittests added. Closes #41942 from HyukjinKwon/SPARK-44348. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
Previous behaviour
Previously, we kept
JobArtifactSetand leveraged thread local for each client.SessionHolder.withSessionhere.SessionHolder.withContextClassLoaderis then called here which in turn callsJobArtifactSet.withActivehere and sets the active set toSessionHolder.connectJobArtifactSetJobArtifactSetthat is used is built up here inSparkConnectArtifactManager.jobArtifactSetSince each client has their own
JobArtifactSetmadeactivewhen executing an operation, theTaskDescriptionwould have artifacts specific to that client and subsequently,IsolatedSessionStatein Executor. Therefore, we were able to separate the Spark Connect specific logic to the Spark Connect module.Problem
Mainly it was all good; however, the problem is that we don't call
SparkContext.addFileorSparkContext.addJar, but we just pass it directly at the scheduler (toTaskDescription). This is fine in general but exposes several problems by not directly callingSparkContext.addFile:SparkContext.postEnvironmentUpdateis not invoked atSparkContext.addFilewhich matters in, for example, recording the events for History Server.Utils.unpack(source, dest)is not invoked atSparkContext.addFilein order to untar properly in the Driver.Therefore, we should duplicate those logics in Spark Connect server side, which is not ideal.
In addition, we already added the isolation logic into the Executor. Driver and Executor are the symmetry (not Spark Connect Server <> Executor). Therefore, it matters about code readability, and expectation in their roles.
Solution in this PR
This PR proposes to support session-based files and archives in Spark Connect. This PR leverages the basework for #41701 and #41625 (for jars in Spark Connect Scala client).
The changed logic is as follows:
SparkContextkeeps them with a mapMap(session -> Map(file and timestamp))in order to reuse the existing logic to address the problem mentionedAfter that, on executor side,
SparkFiles.getRootDirectory)../blahblah.txtin their Python UDF. Therefore, compatible with/without Spark Connect.Note that:
SparkEnv.createPythonWorkerTODOs and limitations
Executor also maintains the file list but with a cache so it can evict the cache. However, it has a problem - It works as follows:
IsolatedSessionStateis created.IsolatedSessionStateholds the file list.IsolatedSessionStateis evicted at https://github.com/apache/spark/pull/41625/files#diff-d7a989c491f3cb77cca02c701496a9e2a3443f70af73b0d1ab0899239f3a789dR187IsolatedSessionStatewith empty file lists.spark.files.overwriteisfalseby default. So the task will suddenly fail at this point.Possible solutions are:
Why are the changes needed?
In order to allow session-based artifact control and multi tenancy.
Does this PR introduce any user-facing change?
Yes, this PR now allows multiple sessions to have their own space. For example, session A and session B can add a file in the same name. Previously this was not possible.
How was this patch tested?
Unittests were added.