Skip to content

Comments

[SPARK-30845] Do not upload local pyspark archives for spark-submit on Yarn#27598

Closed
shanyu wants to merge 4 commits intoapache:masterfrom
shanyu:shanyu-30845
Closed

[SPARK-30845] Do not upload local pyspark archives for spark-submit on Yarn#27598
shanyu wants to merge 4 commits intoapache:masterfrom
shanyu:shanyu-30845

Conversation

@shanyu
Copy link
Contributor

@shanyu shanyu commented Feb 16, 2020

What changes were proposed in this pull request?

Use spark-submit to submit a pyspark app on Yarn, and set this in spark-env.sh:
export PYSPARK_ARCHIVES_PATH=local:/opt/spark/python/lib/pyspark.zip,local:/opt/spark/python/lib/py4j-0.10.7-src.zip

You can see that these local archives are still uploaded to Yarn distributed cache:
yarn.Client: Uploading resource file:/opt/spark/python/lib/pyspark.zip -> hdfs://myhdfs/user/test1/.sparkStaging/application_1581024490249_0001/pyspark.zip

This PR fix this issue by checking the files specified in PYSPARK_ARCHIVES_PATH, if they are local archives, don't distribute to Yarn dist cache.

Why are the changes needed?

For pyspark appp to support local pyspark archives set in PYSPARK_ARCHIVES_PATH.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests and manual tests.

@gatorsmile
Copy link
Member

cc @HyukjinKwon @vanzin @tgravescs

@tgravescs
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Jun 2, 2020

Test build #123434 has finished for PR 27598 at commit f212848.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Jun 2, 2020

Test build #123435 has finished for PR 27598 at commit f212848.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

@shanyu I'm not sure why build is failing would you be able to up merge this?

Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
@shanyu
Copy link
Contributor Author

shanyu commented Jun 5, 2020

Can we please test this?

@SparkQA
Copy link

SparkQA commented Jun 5, 2020

Test build #123581 has finished for PR 27598 at commit 20a7a9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good pending Jenkins, thanks for updating.

@tgravescs
Copy link
Contributor

thanks @shanyu merged to master

@asfgit asfgit closed this in 37b7d32 Jun 8, 2020
}

pySparkArchives.foreach { f => distribute(f) }
pySparkArchives.foreach { f =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work when Spark is not installed in other nodes? IIRC, we can run the application in the cluster where Spark is not installed because the jars are shipped together in Yarn cluster.

Likewise, PySpark was able to run. From my very cursory look, it's going to break this case because it will not distribute the local pyspark archive anymore. Can you confirm this @shanyu and @tgravescs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the case someone explicitly put local: on the url so its expected to be on every machine. YARN distributes everything that is file: or downloads it if its hdfs:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for clarification. LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants