[SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes #41942

HyukjinKwon · 2023-07-11T12:24:20Z

What changes were proposed in this pull request?

This PR is a sort of a followup of #41495. This PR contains several changes to make the tests working:

Always uses JobArtifactSet.getCurrentJobArtifactState to get the current UUID in the current thread.
Specify the current state (from JobArtifactSet.getCurrentJobArtifactState) when add the artifacts (so we can get the state in SparkContext).
Creates a dedicated directory in Driver side too. We provide Spark Connect Server as a service. It creates a session dedicated directory, and put the added files there in the server.
- Notice that we do not support SparkFiles.getRootDirectory in Spark Connect so this should be fine. This dedicated directory will also be used to execute Python process within Driver side (for dependency management, e.g., foreachBatch in Structured Streaming with Spark Connect).
Get the current UUID in the Driver side for Python UDF execution. Previously, we tired to get it from executor side which results in None.
Rename sessionUUID (or similar) to jobArtifactUUID. In Core code context, it's a job artifact state.
Fix Spark Connect Python client local debug mode (e.g., local or local-cluster) to send jars when local-cluster mode is specified. If not, it throws an exception that SparkConnectPlugin cannot be found.
Refactor and fix the tests to verify archive, file and pyfiles in both local and local-cluster modes.

Why are the changes needed?

To make session-based artifact management working.

Does this PR introduce any user-facing change?

No, this feature has not been released yet.

How was this patch tested?

Unittests added.

HyukjinKwon · 2023-07-11T13:06:45Z

Apologies that this PR happened to touch a lot of codebase. I would appreciate if you guys find some time to take a look, cc @hvanhovell @ueshin @vicennial @zhengruifeng

vicennial · 2023-07-11T17:30:50Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+    // If the session ID was specified from SparkSession, it's from a Spark Connect client.
+    // Specify a dedicated directory for Spark Connect client.
+    // We're running Spark Connect as a service so regular PySpark path
+    // is not affected.
+    lazy val root = if (jobArtifactUUID != "default") {
+      val newDest = new File(SparkFiles.getRootDirectory(), jobArtifactUUID)


Is this needed because the session-specific handling is more generic now?
Because for the JARs from Spark Connect, we preiovusly just registered the root artifact directory in the file server and built URIs that let the executor fetch the file directly without the need of copying over to the generic Spark Files directory.

Yeah, it now needs to reuse PythonWorkerFactory in which assumes that there is a UUID named directory under SparkFiles.getRootDirectory() at both Driver and Executor. We could try to reuse the local artifact directory but I would prefer to have another copy in the local for now for better maintainability and reusability for now.

Otherwise, it does upload to the Spark file server twice (as we discussed offline). I pushed new changes to avoid this. So, after this change, we do not upload twice anymore by:

Directly pass the spark:// URI to addFile and addJar

addFile and addJar will not attempt to upload the files, but bypass the original URI.

reuse PythonWorkerFactory in which assumes that there is a UUID named directory under SparkFiles.getRootDirectory() at both Driver and Executor

Ahh gotcha, I am not very well aware of the Python side, good to know 👍

So, after this change, we do not upload twice anymore by:
Directly pass the spark:// URI to addFile and addJar
addFile and addJar will not attempt to upload the files, but bypass the original URI.

Awesome!

HyukjinKwon · 2023-07-12T23:48:32Z

Merged to master.

…cal-cluster tests ### What changes were proposed in this pull request? This PR is a followup of #41942 that reduces the memory used in tests. ### Why are the changes needed? To reduce the memory used in GitHub Actions test. This is consistent with: https://github.com/apache/spark/blob/master/python/pyspark/ml/tests/connect/test_parity_torch_distributor.py#L67C55-L67C58 See also #40874 ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ran the tests in my local to verify the change. Closes #41977 from HyukjinKwon/SPARK-44348-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This PR is a followup of #41942 that does `substring(1)` to remove the leading slash so it makes it relative parts from URI. Otherwise, it can end up with having double slashes in the middle. ### Why are the changes needed? To avoid having unnecessary double slashes ... and save one byte :-) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. It's really trivial. Closes #42051 from HyukjinKwon/minor-change. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

HyukjinKwon marked this pull request as draft July 11, 2023 12:24

github-actions bot added SQL CORE PYTHON CONNECT labels Jul 11, 2023

HyukjinKwon force-pushed the SPARK-44348 branch from 0ae125e to 4061664 Compare July 11, 2023 12:51

HyukjinKwon changed the title ~~[WIP][SPARK-44348][CORE][CONNECT][TESTS] Reenable test_artifact~~ [SPARK-44348][CORE][CONNECT][TESTS] Reenable test_artifact with relevant changes Jul 11, 2023

HyukjinKwon marked this pull request as ready for review July 11, 2023 13:05

Make the tests working

6e63b6a

HyukjinKwon force-pushed the SPARK-44348 branch from 4061664 to 6e63b6a Compare July 11, 2023 13:11

HyukjinKwon changed the title ~~[SPARK-44348][CORE][CONNECT][TESTS] Reenable test_artifact with relevant changes~~ [SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes Jul 11, 2023

Fix tests

0b4a5d7

vicennial reviewed Jul 11, 2023

View reviewed changes

HyukjinKwon marked this pull request as draft July 12, 2023 00:19

HyukjinKwon marked this pull request as ready for review July 12, 2023 10:30

HyukjinKwon force-pushed the SPARK-44348 branch 2 times, most recently from d9459b2 to 324d9e7 Compare July 12, 2023 10:34

Address review comments

db52b94

HyukjinKwon force-pushed the SPARK-44348 branch from 324d9e7 to db52b94 Compare July 12, 2023 10:39

itholic approved these changes Jul 12, 2023

View reviewed changes

HyukjinKwon closed this in caa3df4 Jul 12, 2023

HyukjinKwon mentioned this pull request Jul 13, 2023

[SPARK-44348][TESTS][PYTHON][FOLLOW-UP] Reduces the memory used in local-cluster tests #41977

Closed

HyukjinKwon mentioned this pull request Jul 18, 2023

[SPARK-44348][CONNECT][FOLLOW-UP] Avoid double slashes in the URI #42051

Closed

HyukjinKwon deleted the SPARK-44348 branch January 15, 2024 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes #41942

[SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes #41942

Uh oh!

HyukjinKwon commented Jul 11, 2023 •

edited

Loading

Uh oh!

HyukjinKwon commented Jul 11, 2023

Uh oh!

vicennial Jul 11, 2023

Uh oh!

HyukjinKwon Jul 12, 2023

Uh oh!

vicennial Jul 12, 2023

Uh oh!

HyukjinKwon commented Jul 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes #41942

[SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes #41942

Uh oh!

Conversation

HyukjinKwon commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jul 11, 2023

Uh oh!

vicennial Jul 11, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

vicennial Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HyukjinKwon commented Jul 11, 2023 •

edited

Loading