[SPARK-30845] Do not upload local pyspark archives for spark-submit on Yarn#27598
[SPARK-30845] Do not upload local pyspark archives for spark-submit on Yarn#27598shanyu wants to merge 4 commits intoapache:masterfrom
Conversation
merge from apache master
|
ok to test |
|
Test build #123434 has finished for PR 27598 at commit
|
|
test this please |
|
Test build #123435 has finished for PR 27598 at commit
|
|
@shanyu I'm not sure why build is failing would you be able to up merge this? |
|
Can we please test this? |
|
Test build #123581 has finished for PR 27598 at commit
|
tgravescs
left a comment
There was a problem hiding this comment.
changes look good pending Jenkins, thanks for updating.
|
thanks @shanyu merged to master |
| } | ||
|
|
||
| pySparkArchives.foreach { f => distribute(f) } | ||
| pySparkArchives.foreach { f => |
There was a problem hiding this comment.
Does it work when Spark is not installed in other nodes? IIRC, we can run the application in the cluster where Spark is not installed because the jars are shipped together in Yarn cluster.
Likewise, PySpark was able to run. From my very cursory look, it's going to break this case because it will not distribute the local pyspark archive anymore. Can you confirm this @shanyu and @tgravescs?
There was a problem hiding this comment.
this is the case someone explicitly put local: on the url so its expected to be on every machine. YARN distributes everything that is file: or downloads it if its hdfs:
There was a problem hiding this comment.
Okay, thanks for clarification. LGTM
What changes were proposed in this pull request?
Use spark-submit to submit a pyspark app on Yarn, and set this in spark-env.sh:
export PYSPARK_ARCHIVES_PATH=local:/opt/spark/python/lib/pyspark.zip,local:/opt/spark/python/lib/py4j-0.10.7-src.zip
You can see that these local archives are still uploaded to Yarn distributed cache:
yarn.Client: Uploading resource file:/opt/spark/python/lib/pyspark.zip -> hdfs://myhdfs/user/test1/.sparkStaging/application_1581024490249_0001/pyspark.zip
This PR fix this issue by checking the files specified in PYSPARK_ARCHIVES_PATH, if they are local archives, don't distribute to Yarn dist cache.
Why are the changes needed?
For pyspark appp to support local pyspark archives set in PYSPARK_ARCHIVES_PATH.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests and manual tests.