Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jul 11, 2023

What changes were proposed in this pull request?

This PR is a sort of a followup of #41495. This PR contains several changes to make the tests working:

  • Always uses JobArtifactSet.getCurrentJobArtifactState to get the current UUID in the current thread.
  • Specify the current state (from JobArtifactSet.getCurrentJobArtifactState) when add the artifacts (so we can get the state in SparkContext).
  • Creates a dedicated directory in Driver side too. We provide Spark Connect Server as a service. It creates a session dedicated directory, and put the added files there in the server.
    • Notice that we do not support SparkFiles.getRootDirectory in Spark Connect so this should be fine. This dedicated directory will also be used to execute Python process within Driver side (for dependency management, e.g., foreachBatch in Structured Streaming with Spark Connect).
  • Get the current UUID in the Driver side for Python UDF execution. Previously, we tired to get it from executor side which results in None.
  • Rename sessionUUID (or similar) to jobArtifactUUID. In Core code context, it's a job artifact state.
  • Fix Spark Connect Python client local debug mode (e.g., local or local-cluster) to send jars when local-cluster mode is specified. If not, it throws an exception that SparkConnectPlugin cannot be found.
  • Refactor and fix the tests to verify archive, file and pyfiles in both local and local-cluster modes.

Why are the changes needed?

To make session-based artifact management working.

Does this PR introduce any user-facing change?

No, this feature has not been released yet.

How was this patch tested?

Unittests added.

@HyukjinKwon HyukjinKwon marked this pull request as draft July 11, 2023 12:24
@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-44348][CORE][CONNECT][TESTS] Reenable test_artifact [SPARK-44348][CORE][CONNECT][TESTS] Reenable test_artifact with relevant changes Jul 11, 2023
@HyukjinKwon HyukjinKwon marked this pull request as ready for review July 11, 2023 13:05
@HyukjinKwon
Copy link
Member Author

Apologies that this PR happened to touch a lot of codebase. I would appreciate if you guys find some time to take a look, cc @hvanhovell @ueshin @vicennial @zhengruifeng

@HyukjinKwon HyukjinKwon changed the title [SPARK-44348][CORE][CONNECT][TESTS] Reenable test_artifact with relevant changes [SPARK-44348][CORE][CONNECT][PYTHON] Reenable test_artifact with relevant changes Jul 11, 2023
Comment on lines +1776 to +1781
// If the session ID was specified from SparkSession, it's from a Spark Connect client.
// Specify a dedicated directory for Spark Connect client.
// We're running Spark Connect as a service so regular PySpark path
// is not affected.
lazy val root = if (jobArtifactUUID != "default") {
val newDest = new File(SparkFiles.getRootDirectory(), jobArtifactUUID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed because the session-specific handling is more generic now?
Because for the JARs from Spark Connect, we preiovusly just registered the root artifact directory in the file server and built URIs that let the executor fetch the file directly without the need of copying over to the generic Spark Files directory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it now needs to reuse PythonWorkerFactory in which assumes that there is a UUID named directory under SparkFiles.getRootDirectory() at both Driver and Executor. We could try to reuse the local artifact directory but I would prefer to have another copy in the local for now for better maintainability and reusability for now.

Otherwise, it does upload to the Spark file server twice (as we discussed offline). I pushed new changes to avoid this. So, after this change, we do not upload twice anymore by:

  1. Directly pass the spark:// URI to addFile and addJar
  2. addFile and addJar will not attempt to upload the files, but bypass the original URI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reuse PythonWorkerFactory in which assumes that there is a UUID named directory under SparkFiles.getRootDirectory() at both Driver and Executor

Ahh gotcha, I am not very well aware of the Python side, good to know 👍

So, after this change, we do not upload twice anymore by:
Directly pass the spark:// URI to addFile and addJar
addFile and addJar will not attempt to upload the files, but bypass the original URI.

Awesome!

@HyukjinKwon HyukjinKwon marked this pull request as draft July 12, 2023 00:19
@HyukjinKwon HyukjinKwon marked this pull request as ready for review July 12, 2023 10:30
@HyukjinKwon HyukjinKwon force-pushed the SPARK-44348 branch 2 times, most recently from d9459b2 to 324d9e7 Compare July 12, 2023 10:34
@HyukjinKwon
Copy link
Member Author

Merged to master.

HyukjinKwon added a commit that referenced this pull request Jul 13, 2023
…cal-cluster tests

### What changes were proposed in this pull request?

This PR is a followup of #41942 that reduces the memory used in tests.

### Why are the changes needed?

To reduce the memory used in GitHub Actions test. This is consistent with:

https://github.com/apache/spark/blob/master/python/pyspark/ml/tests/connect/test_parity_torch_distributor.py#L67C55-L67C58

See also #40874

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually ran the tests in my local to verify the change.

Closes #41977 from HyukjinKwon/SPARK-44348-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Jul 18, 2023
### What changes were proposed in this pull request?

This PR is a followup of #41942 that does `substring(1)` to remove the leading slash so it makes it relative parts from URI. Otherwise, it can end up with having double slashes in the middle.

### Why are the changes needed?

To avoid having unnecessary double slashes  ... and save one byte :-)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested. It's really trivial.

Closes #42051 from HyukjinKwon/minor-change.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Jul 18, 2023
### What changes were proposed in this pull request?

This PR is a followup of #41942 that does `substring(1)` to remove the leading slash so it makes it relative parts from URI. Otherwise, it can end up with having double slashes in the middle.

### Why are the changes needed?

To avoid having unnecessary double slashes  ... and save one byte :-)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested. It's really trivial.

Closes #42051 from HyukjinKwon/minor-change.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@HyukjinKwon HyukjinKwon deleted the SPARK-44348 branch January 15, 2024 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants