Skip to content

Conversation

@xkrogen
Copy link
Contributor

@xkrogen xkrogen commented Jun 25, 2021

What changes were proposed in this pull request?

Refactor the logic for constructing the user classpath from yarn.ApplicationMaster into yarn.Client so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, getUserClassPath, is added to CoarseGrainedExecutorBackend which defaults to Nil (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). YarnCoarseGrainedExecutorBackend overrides this to construct the user classpath from the existing APP_JAR and SECONDARY_JARS configs.

Why are the changes needed?

User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in ExecutorRunnable), which then passes the information to the executors (CoarseGrainedExecutorBackend) by specifying each JAR on the executor command line as --user-class-path /path/to/myjar.jar. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

/bin/bash: Argument list too long

A Google search indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

Does this PR introduce any user-facing change?

No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before.

How was this patch tested?

New unit tests were added in YarnClusterSuite. Also, we have been running a similar fix internally for 4 months with great success.

Note that this is a backport of #32810 with minor conflicts around imports.

…ing config instead of command line

Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs.

User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

> /bin/bash: Argument list too long

A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before.

New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success.

Closes apache#32810 from xkrogen/xkrogen-SPARK-35672-classpath-scalable.

Authored-by: Erik Krogen <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>

(cherry picked from commit 866df69)
Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, pending Jenkins

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44861/

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44861/

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Test build #140329 has finished for PR 33090 at commit 5963a21.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @xkrogen and @tgravescs . Just a quick question: Removal of --user-class-path looks like a breaking change. Is it safe for backporting? Is it purely internal?

@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 25, 2021

Thanks for looking @dongjoon-hyun ! It's not a user-facing parameter, it's only used by the driver when constructing the arguments to use when launching the executor. Purely internal, as you mentioned.

@tgravescs
Copy link
Contributor

All of those classes are private[spark] so should be safe to back port. Do you know of anyone doing customization there?
it is kind of a core change to how that is passed though too so really I'm fine either way.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 25, 2021

Thank you for the confirmation, @xkrogen and @tgravescs . I don't know anyone either. Just the following printUsageAndExit hit my head.

  private def printUsageAndExit(classNameForEntry: String): Unit = {
    // scalastyle:off println
    System.err.println(
      s"""
      |Usage: $classNameForEntry [options]
      |
      | Options are:
      |   --driver-url <driverUrl>
      |   --executor-id <executorId>
      |   --bind-address <bindAddress>
      |   --hostname <hostname>
      |   --cores <cores>
       |   --resourcesFile <fileWithJSONResourceInformation>
       |   --app-id <appid>
       |   --worker-url <workerUrl>
       |   --user-class-path <url>
       |   --resourceProfileId <id>
       |""".stripMargin)
     // scalastyle:on println

@mridulm
Copy link
Contributor

mridulm commented Jun 25, 2021

I always thought that was for our internal debugging @dongjoon-hyun, though it is fairly nicely done :-)

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me 2

@mridulm
Copy link
Contributor

mridulm commented Jun 28, 2021

Looks like something messed up for me locally w.r.t branch-3.1 ... can someone else merge it please ? Thanks.

@HyukjinKwon
Copy link
Member

Sure, Merged to branch-3.1.

HyukjinKwon pushed a commit that referenced this pull request Jun 28, 2021
…rs using config instead of command line

### What changes were proposed in this pull request?
Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs.

### Why are the changes needed?
User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

> /bin/bash: Argument list too long

A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

### Does this PR introduce _any_ user-facing change?
No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before.

### How was this patch tested?
New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success.

Note that this is a backport of #32810 with minor conflicts around imports.

Closes #33090 from xkrogen/xkrogen-SPARK-35672-classpath-scalable-branch-3.1.

Authored-by: Erik Krogen <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@xkrogen
Copy link
Contributor Author

xkrogen commented Jun 28, 2021

Thanks @HyukjinKwon ! And for the extra sets of eyes @dongjoon-hyun @tgravescs @mridulm

@xkrogen xkrogen deleted the xkrogen-SPARK-35672-classpath-scalable-branch-3.1 branch June 28, 2021 15:56
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…rs using config instead of command line

### What changes were proposed in this pull request?
Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs.

### Why are the changes needed?
User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

> /bin/bash: Argument list too long

A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

### Does this PR introduce _any_ user-facing change?
No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before.

### How was this patch tested?
New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success.

Note that this is a backport of apache#32810 with minor conflicts around imports.

Closes apache#33090 from xkrogen/xkrogen-SPARK-35672-classpath-scalable-branch-3.1.

Authored-by: Erik Krogen <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit b4916d4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants