[SPARK-17321][YARN] Avoid writing shuffle metadata to disk if NM recovery is disabled #19032

jerryshao · 2017-08-24T03:49:16Z

What changes were proposed in this pull request?

In the current code, if NM recovery is not enabled then YarnShuffleService will write shuffle metadata to NM local dir-1, if this local dir-1 is on bad disk, then YarnShuffleService will be failed to start. So to solve this issue, in Spark side if NM recovery is not enabled, then Spark will not persist data into leveldb, in that case yarn shuffle service can still be served but lose the ability for recovery, (it is fine because the failure of NM will kill the containers as well as applications).

How was this patch tested?

Tested in the local cluster with NM recovery off and on to see if folder is created or not. MiniCluster UT isn't added because in MiniCluster NM will always set port to 0, but NM recovery requires non-ephemeral port.

Change-Id: Id062d71589f46052706058c151c706dae38b1e6e

jerryshao · 2017-08-24T03:51:28Z

CC @LiShuMing please take a look at another approach to fix the bad disk issue.

Also ping @tgravescs to view the PR.

Thanks a lot.

SparkQA · 2017-08-24T04:10:38Z

Test build #81067 has finished for PR 19032 at commit 5abbe75.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Can you add @Override to setRecoveryPath? It took some head scratching to figure out where that method was being called from without it. While at it you can update the javadoc for that method.

You also don't need the "recoveryEnabled" check, you can use the recovery path for that since it's only set by YARN when recovery is enabled.

vanzin · 2017-08-25T21:24:21Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java

 public class YarnShuffleService extends AuxiliaryService {
  private static final Logger logger = LoggerFactory.getLogger(YarnShuffleService.class);

+  private static final boolean DEFAULT_NM_RECOVERY_ENABLED = false;


Isn't this in YarnConfiguration?

Let me check the yarn code.

vanzin · 2017-08-25T21:25:56Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java

      boolean authEnabled = conf.getBoolean(SPARK_AUTHENTICATE_KEY, DEFAULT_SPARK_AUTHENTICATE);
      if (authEnabled) {
-        createSecretManager();
+        createSecretManager(recoveryEnabled);


I think at this point it would be cleaner to do:

secretManager = new ShuffleSecretManager(); if (recoveryEnabled) { loadSecretsFromDb(); }

Change-Id: I06866813d24af5cd6ae64f45df1c7a4ebaf2b12d

tgravescs · 2017-08-28T13:33:45Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java

  /**
   * Figure out the recovery path and handle moving the DB if YARN NM recovery gets enabled
   * when it previously was not. If YARN NM recovery is enabled it uses that path, otherwise
   * it will uses a YARN local dir.


need to update comment, probably just remove the last sentence.

Change-Id: I90ae354f840534fad0b448392cab5713eb7c7171

tgravescs · 2017-08-28T13:59:34Z

lgtm

SparkQA · 2017-08-28T17:59:44Z

Test build #81180 has finished for PR 19032 at commit 9fe7e99.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2017-08-30T01:05:09Z

Jenkins, retest this please.

jerryshao · 2017-08-30T01:05:29Z

@vanzin @tgravescs do you have any further comment?

SparkQA · 2017-08-30T01:25:46Z

Test build #81236 has finished for PR 19032 at commit 9fe7e99.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2017-08-30T14:31:28Z

+1 go ahead and commit

vanzin · 2017-08-30T17:11:39Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java


  /**
   * Set the recovery path for shuffle service recovery when NM is restarted. The method will be
   * overrode and called when Hadoop version is 2.5+ and NM recovery is enabled, otherwise we


Could you update this comment since it's out of date now?

Sure I will.

Change-Id: I85129842468e74c3a91232991595916d50b206fc

SparkQA · 2017-08-31T01:10:40Z

Test build #81271 has finished for PR 19032 at commit ebe0a24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2017-08-31T01:22:23Z

Merge to master branch.

…ervice.db.backend` in `running-on-yarn.md` ### What changes were proposed in this pull request? From the context from [pr](#19032) of [SPARK-17321](https://issues.apache.org/jira/browse/SPARK-17321), `YarnShuffleService` will persist data into `Level/RocksDB` when Yarn NM recovery is enabled. So this pr adds the precondition description related to `Yarn NM recovery is enabled` for `spark.shuffle.service.db.backend`. in `running-on-yarn.md` ### Why are the changes needed? Add precondition description for `spark.shuffle.service.db.backend` in `running-on-yarn.md` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #37853 from LuciferYang/SPARK-40404. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ervice.db.backend` in `running-on-yarn.md` ### What changes were proposed in this pull request? From the context from [pr](apache#19032) of [SPARK-17321](https://issues.apache.org/jira/browse/SPARK-17321), `YarnShuffleService` will persist data into `Level/RocksDB` when Yarn NM recovery is enabled. So this pr adds the precondition description related to `Yarn NM recovery is enabled` for `spark.shuffle.service.db.backend`. in `running-on-yarn.md` ### Why are the changes needed? Add precondition description for `spark.shuffle.service.db.backend` in `running-on-yarn.md` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes apache#37853 from LuciferYang/SPARK-40404. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Avoid writing shuffle metadata to disk if NM recovery is disabled

5abbe75

Change-Id: Id062d71589f46052706058c151c706dae38b1e6e

LiShuMing mentioned this pull request Aug 24, 2017

[SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed to start when the chosen dir… #18905

Closed

vanzin reviewed Aug 25, 2017

View reviewed changes

Address the comments

185d3dd

Change-Id: I06866813d24af5cd6ae64f45df1c7a4ebaf2b12d

tgravescs reviewed Aug 28, 2017

View reviewed changes

Change the comment

9fe7e99

Change-Id: I90ae354f840534fad0b448392cab5713eb7c7171

vanzin approved these changes Aug 30, 2017

View reviewed changes

Update the comments

ebe0a24

Change-Id: I85129842468e74c3a91232991595916d50b206fc

asfgit closed this in 4482ff2 Aug 31, 2017

This was referenced Sep 10, 2022

[SPARK-40364][CORE] Use the unified DBProvider#initDB method #37826

Closed

[SPARK-40404][DOCS] Add precondition description for spark.shuffle.service.db.backend in running-on-yarn.md #37853

Closed

[SPARK-17321][YARN] Avoid writing shuffle metadata to disk if NM recovery is disabled #19032

[SPARK-17321][YARN] Avoid writing shuffle metadata to disk if NM recovery is disabled #19032

Uh oh!

Conversation

jerryshao commented Aug 24, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jerryshao commented Aug 24, 2017

Uh oh!

SparkQA commented Aug 24, 2017

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

vanzin Aug 25, 2017

Choose a reason for hiding this comment

Uh oh!

jerryshao Aug 28, 2017

Choose a reason for hiding this comment

Uh oh!

vanzin Aug 25, 2017

Choose a reason for hiding this comment

Uh oh!

tgravescs Aug 28, 2017

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Aug 28, 2017

Uh oh!

SparkQA commented Aug 28, 2017

Uh oh!

jerryshao commented Aug 30, 2017

Uh oh!

jerryshao commented Aug 30, 2017

Uh oh!

SparkQA commented Aug 30, 2017

Uh oh!

tgravescs commented Aug 30, 2017

Uh oh!

vanzin Aug 30, 2017

Choose a reason for hiding this comment

Uh oh!

jerryshao Aug 31, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 31, 2017

Uh oh!

jerryshao commented Aug 31, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants