[SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service #26000

nonsleepr · 2019-10-02T14:13:30Z

What changes were proposed in this pull request?

This PR allows users to configure used Spark Shuffle Service name to be able to run vanilla Spark on HDP Hadoop clusters.

Why are the changes needed?

As of now, Spark uses hardcoded value spark_shuffle as the name of the Shuffle Service.

HDP distribution of Spark, on the other hand, uses spark2_shuffle. This is done to be able to run both Spark 1.6 and Spark 2.x on the same Hadoop cluster.

Running vanilla Spark on HDP cluster with only Spark 2.x shuffle service (HDP favor) running becomes impossible due to the shuffle service name mismatch.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The test cases were changed to use the new option.
Patched version of spark_yarn library was used to run jobs on HDP cluster.

Allow creation of YarnShuffleService with custom service name

viirya · 2019-10-02T19:53:31Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java


  public YarnShuffleService() {
-    super("spark_shuffle");
+    this("spark_shuffle");


Is this still hardcoded? Should we use configured SHUFFLE_SERVICE_NAME?

It is still hardcoded. I haven't found a way to access Spark configuration from that constructor and org.apache.hadoop.yarn.server.api.AuxiliaryService requires the name. Do you have a suggestion of how that could be done?

As I commented below #26000 (comment), if this is just for yarn, put it in YarnShuffleService, like "spark.yarn.shuffle.stopOnFailure"?

It is hardcoded here. Once the shuffle service name is configured, won't they mismatch? Will it cause problem?

It is hardcoded here. HDP hardcodes another value though (spark2_shuffle). While vanilla Spark would keep working as is and would use the name spark_shuffle, the new configuration option would allow users to point Spark to non-vanilla shuffle service.
The changes to that class are done only to test that changing the name of the service and in the configuration play nicely together.

It seems impossible to register the service with the name passed in the configuration because the configuration is passed after the class is instantiated.

I see. So this config can only be used to let Spark choose which service to connect. It cannot change the name of Shuffle Service.

Yes. I guess I could implement a workaround, which would get the config setting from the default Configuration, but that, at least theoretically, wouldn't guarantee that the exact configuration would be passed during service initialization.

viirya · 2019-10-02T19:59:25Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

    ConfigBuilder("spark.shuffle.service.port").intConf.createWithDefault(7337)

+  private[spark] val SHUFFLE_SERVICE_NAME =
+    ConfigBuilder("spark.shuffle.service.name").stringConf.createWithDefault("spark_shuffle")


This is just for yarn external shuffle service, right? If so, just put in YarnShuffleService and name as spark.yarn.shuffle.service.name?

There's spark.shuffle.service.port right above it, which specifies the port of the said service. If the ports match but the name doesn't, the service wouldn't be located.

EDIT: I would think being consistent with the names is more important than properly namespacing the option.

Not sure if I understand you correctly. Only yarn needs this config, cannot it be in YarnShuffleService like "spark.yarn.shuffle.stopOnFailure"?

You only need to make sure ExecutorRunnable, YarnShuffleService uses the same config and configuring it correctly in yarn-site.xml. Isn't? Or I miss anything here?

Oh. Okay. I'm convinced.
Will move the option and the docs to YARN namespace.

viirya · 2019-10-02T20:00:27Z

docs/configuration.md

  </td>
 </tr>
+<tr>
+  <td><code>spark.shuffle.service.name</code></td>


For yarn, docs/running-on-yarn.md is more suitable for documentation?

You're right. Will move it there.

viirya · 2019-10-02T20:03:05Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java

+    this("spark_shuffle");
+  }
+
+  public YarnShuffleService(String serviceName) {


Is it needed to add new constructor? You could not just read configured service name inside original constructor and do initialization?

This is mostly to be able to test it since yarn.nodemanager.aux-services.spark_shuffle.class option assumes no-args constructor. Maybe make it protected?

…and move that option to appropriate place

nonsleepr · 2019-10-03T13:01:39Z

docs/running-on-yarn.md

  </td>
 </tr>
+<tr>
+  <td><code>spark.yarn.shuffle.service.name</code></td>


I added this option under Spark Properties section and not under Configuring the External Shuffle Service section because it's a "client" setting, not the setting of the external shuffle service.

tgravescs · 2019-10-04T13:09:30Z

ok to test

SparkQA · 2019-10-04T13:31:24Z

Test build #111780 has finished for PR 26000 at commit 27e5c87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-10-09T19:01:20Z

docs/running-on-yarn.md

+  <td><code>spark.yarn.shuffle.service.name</code></td>
+  <td><code>spark_shuffle</code></td>
+  <td>
+    Name of the external shuffle service.


I think we need more description here. This isn't setting what the service runs as, you have to configure that via yarn, this is what executors use for external shuffle service name when launching the container.

That's what "external" word is for, isn't it?
Should I reword it as follows?

Name of the external shuffle service used by executors.
The external service is configured and started by YARN (see Configuring the External Shuffle Service for details). Apache Spark distribution uses name spark_shuffle, but other distributions (e.g. HDP) might use other names.

many newbie's aren't familiar with what external shuffle service is or even yarn so its best to be clear. How about:

The name of the external shuffle service.
The external shuffle service itself is configured and started by YARN (see Configuring the External Shuffle Service for details). The name specified here must match the name YARN used.

tgravescs · 2019-10-10T17:54:33Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java

+
+  protected YarnShuffleService(String serviceName) {
+    super(serviceName);
    logger.info("Initializing YARN shuffle service for Spark");


lets change the log statement to have the servicename in it

tgravescs · 2019-10-10T19:12:26Z

docs/running-on-yarn.md

+  <td><code>spark.yarn.shuffle.service.name</code></td>
+  <td><code>spark_shuffle</code></td>
+  <td>
+    Name of the external shuffle service.


many newbie's aren't familiar with what external shuffle service is or even yarn so its best to be clear. How about:

The name of the external shuffle service.
The external shuffle service itself is configured and started by YARN (see Configuring the External Shuffle Service for details). The name specified here must match the name YARN used.

tgravescs · 2019-10-10T19:15:57Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java

+    this("spark_shuffle");
+  }
+
+  protected YarnShuffleService(String serviceName) {


So the name by itself isn't going to be enough. If you really want it configurable we are going to have to have the port configurable. For instance the config name for the port spark.shuffle.service.port needs to be able to be something like spark.shuffle.service.{serviceName}.port. Otherwise all the spark shuffle servers will try to get the same port and fail. The only other option will be to use 0 for ephemeral but

The name specified here is actually useful only in tests. YARN's service instantiation logic wouldn't even pass the name of the service used in the config to instantiated service. I guess that's the main reason the names and ports are hardcoded or bound to non-namespaced configuration keys.
The way HDP overcomes that is by providing different classpaths with different implementations for different versions of the service (spark_shuffle for Spark 1.6.x and spark2_shuffle for Spark 2+). The only way I see it's possible to pass different parameters to the same implementation of the service is by providing different configs on the classpath.

I will add a comment here stating that the name is actually only used for the tests, but otherwise would always be hardcoded to spark_shuffle.

I think there are a few things getting muddled together here -- one is how you'd support running two shuffle services, and the other is how a client could choose which shuffle service it talks to.

The client can already set the port for the shuffle server with spark.shuffle.service.port, it just can't set the name used in the ExecutorRunnable.

The other thing to add about how the names of the shuffle servers matter in yarn is that the name goes into yarn-site.xml as described in the "Configuring the External Shuffle Service" in running-on-yarn.md.

nonsleepr · 2019-11-01T20:23:35Z

@tgravescs, @viirya Any other concerns?

tgravescs · 2019-11-04T14:36:28Z

yes I'm not sure how much this makes sense since you still have to make code modifications and rebuild the shuffle service. You have less to modify but I don't want to confuse users either.
I was going to go investigate if there is a other way to make it truly configurable on the yarn shuffle side but haven't had time.

nonsleepr · 2019-11-05T17:01:18Z

Maybe rename the option from spark.yarn.shuffle.service.name to spark.yarn.shuffle.thirdPartyService.name or something which would point to the fact that it's not for default Spark Shuffle service?

squito

I'm not really sure how I feel about this change. First and foremost, I dont' think Apache Spark needs to be making changes to accommodate talking to an HDP cluster, for changes to entirely internal implementation details.

Also, are we opening the door here for Apache Spark supporting running multiple external services? It may not be hard, but do we want to test it and document it? I'm worried about the confusion this will cause.

That said, this is a relatively small change to allow the clients to choose which service they want to talk to, if multiple are running. Maybe that'll become more common with Spark3 coming out, where users will want a spark 2.4 and spark 3 shuffle service running.

squito · 2019-12-06T17:42:41Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java

+    this("spark_shuffle");
+  }
+
+  protected YarnShuffleService(String serviceName) {


I think there are a few things getting muddled together here -- one is how you'd support running two shuffle services, and the other is how a client could choose which shuffle service it talks to.

The client can already set the port for the shuffle server with spark.shuffle.service.port, it just can't set the name used in the ExecutorRunnable.

The other thing to add about how the names of the shuffle servers matter in yarn is that the name goes into yarn-site.xml as described in the "Configuring the External Shuffle Service" in running-on-yarn.md.

squito · 2019-12-06T17:52:59Z

docs/running-on-yarn.md

+  <td><code>spark_shuffle</code></td>
+  <td>
+    The name of the external shuffle service.
+    The external shuffle service itself is configured and started by YARN (see [Configuring the External Shuffle Service](#configuring-the-external-shuffle-service) for details). The name specified here must match the name used in YARN service implementation.


I think it would help to mention that must match the name given to the shuffle service in yarn-site.xml under yarn.nodemanager.aux-services.

squito · 2019-12-06T17:53:59Z

cc @attilapiros

github-actions · 2020-03-16T00:14:50Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

koertkuipers · 2020-10-01T04:41:20Z

thanks for this
we merged this into our inhouse spark build. it allows us to run on our clients hdp and hdinsight platforms using the providedspark2_shuffle spark shuffle service.

Alexander Bessonov added 2 commits October 2, 2019 09:22

Define internal configuration option spark.shuffle.service.name

786f24a

Allow creation of YarnShuffleService with custom service name

Add spark.shuffle.service.name to the documentation

30f8c1d

dongjoon-hyun added SPARK CORE YARN labels Oct 2, 2019

viirya reviewed Oct 2, 2019

View reviewed changes

Rename spark.shuffle.service.name to spark.yarn.shuffle.service.name …

27e5c87

…and move that option to appropriate place

nonsleepr commented Oct 3, 2019

View reviewed changes

tgravescs reviewed Oct 9, 2019

View reviewed changes

tgravescs reviewed Oct 10, 2019

View reviewed changes

Clarify Spark Shuffle Service name option

60795d4

squito reviewed Dec 6, 2019

View reviewed changes

github-actions bot added the Stale label Mar 16, 2020

github-actions bot closed this Mar 17, 2020

tgravescs mentioned this pull request Mar 23, 2021

[SPARK-34828][YARN] Make shuffle service name configurable on client side and allow for classpath-based config override on server side #31936

Closed

[SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service #26000

[SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service #26000

Uh oh!

Conversation

nonsleepr commented Oct 2, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Oct 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nonsleepr Oct 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Oct 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Oct 4, 2019

Uh oh!

SparkQA commented Oct 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nonsleepr commented Nov 1, 2019

Uh oh!

tgravescs commented Nov 4, 2019

Uh oh!

nonsleepr commented Nov 5, 2019

Uh oh!

squito left a comment

Choose a reason for hiding this comment

viirya Oct 2, 2019 •

edited

Loading

nonsleepr Oct 2, 2019 •

edited

Loading

viirya Oct 2, 2019 •

edited

Loading