Skip to content

Conversation

@peter-toth
Copy link
Contributor

What changes were proposed in this pull request?

This reverts commit 866df69.

Why are the changes needed?

After the change environment variables were not substituted in user classpath entries. Please find an example on SPARK-35672.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

…utors using config instead of command line"

This reverts commit 866df69.
@peter-toth
Copy link
Contributor Author

cc @xkrogen, @tgravescs

Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a clean revert. Would be great to get @xkrogen to comment

@tgravescs
Copy link
Contributor

Note we will need to revert from the spark 3.1 branch as well.

@SparkQA
Copy link

SparkQA commented Sep 23, 2021

Test build #143562 has finished for PR 34082 at commit 5ae2d9a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 23, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48071/

@xkrogen
Copy link
Contributor

xkrogen commented Sep 23, 2021

Thanks for reporting this @peter-toth !
From a technical perspective the revert LGTM.
I'm interested in fixing-forward -- either hiding this behind a feature flag, or adding logic to handle environment variables when extracting the values from the conf. I don't think either approach would require much time (could be completed by, say, tomorrow), but I understand if there is a desire to revert given that we are already in RC phase.

cc @mridulm as well.

@peter-toth
Copy link
Contributor Author

I'm interested in fixing-forward -- either hiding this behind a feature flag, or adding logic to handle environment variables when extracting the values from the conf. I don't think either approach would require much time (could be completed by, say, tomorrow), but I understand if there is a desire to revert given that we are already in RC phase.

I'm open to other kind of fixes, just wanted to unblock RCs quickly with this revert...

@xkrogen
Copy link
Contributor

xkrogen commented Sep 23, 2021

Ack, let me see if I can get at least a POC fix-forward up by today, and if we think it looks good we can continue with that approach. If I'm not able to get something up by today, we revert to unblock the RC. Sound okay?

@SparkQA
Copy link

SparkQA commented Sep 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48072/

@SparkQA
Copy link

SparkQA commented Sep 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48072/

@SparkQA
Copy link

SparkQA commented Sep 23, 2021

Test build #143563 has finished for PR 34082 at commit 8f557d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xkrogen
Copy link
Contributor

xkrogen commented Sep 23, 2021

Put up #34084, PTAL folks.

@HyukjinKwon
Copy link
Member

oh

@HyukjinKwon
Copy link
Member

yeah, let's revert it first.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 24, 2021

cc @gengliangwang . I think it has to go through up to branch-3.1. EDIT: just saw #34082 (comment) :-).

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 24, 2021

Merged to master.

@HyukjinKwon
Copy link
Member

@peter-toth, it has a conflict on branch-3.2. It seems trivial but I think we should better make a PR since we will release right away. mind creating a backporting Pr please?

@gengliangwang
Copy link
Member

Yeah let's revert it now and fix it later.
Thank you @peter-toth

@gengliangwang
Copy link
Member

I created a backport PR in #34088

@HyukjinKwon
Copy link
Member

Thx!

attilapiros pushed a commit that referenced this pull request Nov 16, 2021
…ing config instead of command line

### What changes were proposed in this pull request?
Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs. Within `yarn.Client`, environment variables in the configured paths are resolved before constructing the classpath.

Please note that this is a re-submission of #32810, which was reverted in #34082 due to the issues described in [this comment](https://issues.apache.org/jira/browse/SPARK-35672?focusedCommentId=17419285&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17419285). This PR additionally includes the changes described in #34084 to resolve the issue, though this PR has been enhanced to properly handle escape strings, unlike #34084.

### Why are the changes needed?
User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

> /bin/bash: Argument list too long

A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

### Does this PR introduce _any_ user-facing change?
There is one small behavioral change which is a bug fix. Previously the `spark.yarn.config.gatewayPath` and `spark.yarn.config.replacementPath` options were only applied to executors, meaning they would not work for the driver when running in cluster mode. This appears to be a bug; the [documentation for this functionality](https://spark.apache.org/docs/latest/running-on-yarn.html) does not mention any limitations that this is only for executors. This PR fixes that issue.

Additionally, this fixes the main bash argument length issue, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before, and substitution of environment variables in `spark.jars` or `spark.yarn.config.replacementPath` works as expected.

### How was this patch tested?
New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success.

Closes #34120 from xkrogen/xkrogen-SPARK-35672-yarn-classpath-list-take2.

Authored-by: Erik Krogen <[email protected]>
Signed-off-by: attilapiros <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants