-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes #41942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Apologies that this PR happened to touch a lot of codebase. I would appreciate if you guys find some time to take a look, cc @hvanhovell @ueshin @vicennial @zhengruifeng |
| // If the session ID was specified from SparkSession, it's from a Spark Connect client. | ||
| // Specify a dedicated directory for Spark Connect client. | ||
| // We're running Spark Connect as a service so regular PySpark path | ||
| // is not affected. | ||
| lazy val root = if (jobArtifactUUID != "default") { | ||
| val newDest = new File(SparkFiles.getRootDirectory(), jobArtifactUUID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed because the session-specific handling is more generic now?
Because for the JARs from Spark Connect, we preiovusly just registered the root artifact directory in the file server and built URIs that let the executor fetch the file directly without the need of copying over to the generic Spark Files directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it now needs to reuse PythonWorkerFactory in which assumes that there is a UUID named directory under SparkFiles.getRootDirectory() at both Driver and Executor. We could try to reuse the local artifact directory but I would prefer to have another copy in the local for now for better maintainability and reusability for now.
Otherwise, it does upload to the Spark file server twice (as we discussed offline). I pushed new changes to avoid this. So, after this change, we do not upload twice anymore by:
- Directly pass the
spark://URI toaddFileandaddJar addFileandaddJarwill not attempt to upload the files, but bypass the original URI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reuse PythonWorkerFactory in which assumes that there is a UUID named directory under SparkFiles.getRootDirectory() at both Driver and Executor
Ahh gotcha, I am not very well aware of the Python side, good to know 👍
So, after this change, we do not upload twice anymore by:
Directly pass the spark:// URI to addFile and addJar
addFile and addJar will not attempt to upload the files, but bypass the original URI.
Awesome!
d9459b2 to
324d9e7
Compare
|
Merged to master. |
…cal-cluster tests ### What changes were proposed in this pull request? This PR is a followup of #41942 that reduces the memory used in tests. ### Why are the changes needed? To reduce the memory used in GitHub Actions test. This is consistent with: https://github.com/apache/spark/blob/master/python/pyspark/ml/tests/connect/test_parity_torch_distributor.py#L67C55-L67C58 See also #40874 ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ran the tests in my local to verify the change. Closes #41977 from HyukjinKwon/SPARK-44348-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? This PR is a followup of #41942 that does `substring(1)` to remove the leading slash so it makes it relative parts from URI. Otherwise, it can end up with having double slashes in the middle. ### Why are the changes needed? To avoid having unnecessary double slashes ... and save one byte :-) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. It's really trivial. Closes #42051 from HyukjinKwon/minor-change. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? This PR is a followup of #41942 that does `substring(1)` to remove the leading slash so it makes it relative parts from URI. Otherwise, it can end up with having double slashes in the middle. ### Why are the changes needed? To avoid having unnecessary double slashes ... and save one byte :-) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. It's really trivial. Closes #42051 from HyukjinKwon/minor-change. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
This PR is a sort of a followup of #41495. This PR contains several changes to make the tests working:
JobArtifactSet.getCurrentJobArtifactStateto get the current UUID in the current thread.JobArtifactSet.getCurrentJobArtifactState) when add the artifacts (so we can get the state inSparkContext).SparkFiles.getRootDirectoryin Spark Connect so this should be fine. This dedicated directory will also be used to execute Python process within Driver side (for dependency management, e.g., foreachBatch in Structured Streaming with Spark Connect).None.sessionUUID(or similar) tojobArtifactUUID. In Core code context, it's a job artifact state.localorlocal-cluster) to send jars whenlocal-clustermode is specified. If not, it throws an exception thatSparkConnectPlugincannot be found.localandlocal-clustermodes.Why are the changes needed?
To make session-based artifact management working.
Does this PR introduce any user-facing change?
No, this feature has not been released yet.
How was this patch tested?
Unittests added.