Skip to content

Conversation

@jerryshao
Copy link
Contributor

What changes were proposed in this pull request?

In the current code, if NM recovery is not enabled then YarnShuffleService will write shuffle metadata to NM local dir-1, if this local dir-1 is on bad disk, then YarnShuffleService will be failed to start. So to solve this issue, in Spark side if NM recovery is not enabled, then Spark will not persist data into leveldb, in that case yarn shuffle service can still be served but lose the ability for recovery, (it is fine because the failure of NM will kill the containers as well as applications).

How was this patch tested?

Tested in the local cluster with NM recovery off and on to see if folder is created or not. MiniCluster UT isn't added because in MiniCluster NM will always set port to 0, but NM recovery requires non-ephemeral port.

Change-Id: Id062d71589f46052706058c151c706dae38b1e6e
@jerryshao
Copy link
Contributor Author

CC @LiShuMing please take a look at another approach to fix the bad disk issue.

Also ping @tgravescs to view the PR.

Thanks a lot.

@SparkQA
Copy link

SparkQA commented Aug 24, 2017

Test build #81067 has finished for PR 19032 at commit 5abbe75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add @Override to setRecoveryPath? It took some head scratching to figure out where that method was being called from without it. While at it you can update the javadoc for that method.

You also don't need the "recoveryEnabled" check, you can use the recovery path for that since it's only set by YARN when recovery is enabled.

public class YarnShuffleService extends AuxiliaryService {
private static final Logger logger = LoggerFactory.getLogger(YarnShuffleService.class);

private static final boolean DEFAULT_NM_RECOVERY_ENABLED = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this in YarnConfiguration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check the yarn code.

boolean authEnabled = conf.getBoolean(SPARK_AUTHENTICATE_KEY, DEFAULT_SPARK_AUTHENTICATE);
if (authEnabled) {
createSecretManager();
createSecretManager(recoveryEnabled);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point it would be cleaner to do:

secretManager = new ShuffleSecretManager();
if (recoveryEnabled) {
  loadSecretsFromDb();
}

Change-Id: I06866813d24af5cd6ae64f45df1c7a4ebaf2b12d
/**
* Figure out the recovery path and handle moving the DB if YARN NM recovery gets enabled
* when it previously was not. If YARN NM recovery is enabled it uses that path, otherwise
* it will uses a YARN local dir.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update comment, probably just remove the last sentence.

Change-Id: I90ae354f840534fad0b448392cab5713eb7c7171
@tgravescs
Copy link
Contributor

lgtm

@SparkQA
Copy link

SparkQA commented Aug 28, 2017

Test build #81180 has finished for PR 19032 at commit 9fe7e99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor Author

Jenkins, retest this please.

@jerryshao
Copy link
Contributor Author

@vanzin @tgravescs do you have any further comment?

@SparkQA
Copy link

SparkQA commented Aug 30, 2017

Test build #81236 has finished for PR 19032 at commit 9fe7e99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

+1 go ahead and commit


/**
* Set the recovery path for shuffle service recovery when NM is restarted. The method will be
* overrode and called when Hadoop version is 2.5+ and NM recovery is enabled, otherwise we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update this comment since it's out of date now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I will.

Change-Id: I85129842468e74c3a91232991595916d50b206fc
@SparkQA
Copy link

SparkQA commented Aug 31, 2017

Test build #81271 has finished for PR 19032 at commit ebe0a24.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor Author

Merge to master branch.

@asfgit asfgit closed this in 4482ff2 Aug 31, 2017
dongjoon-hyun pushed a commit that referenced this pull request Sep 19, 2022
…ervice.db.backend` in `running-on-yarn.md`

### What changes were proposed in this pull request?
From the context from [pr](#19032) of [SPARK-17321](https://issues.apache.org/jira/browse/SPARK-17321), `YarnShuffleService` will persist data into `Level/RocksDB` when Yarn NM recovery is enabled. So this pr adds the precondition description related to `Yarn NM recovery is enabled` for `spark.shuffle.service.db.backend`. in `running-on-yarn.md`

### Why are the changes needed?
Add precondition description for `spark.shuffle.service.db.backend` in `running-on-yarn.md`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #37853 from LuciferYang/SPARK-40404.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
LuciferYang added a commit to LuciferYang/spark that referenced this pull request Sep 20, 2022
…ervice.db.backend` in `running-on-yarn.md`

### What changes were proposed in this pull request?
From the context from [pr](apache#19032) of [SPARK-17321](https://issues.apache.org/jira/browse/SPARK-17321), `YarnShuffleService` will persist data into `Level/RocksDB` when Yarn NM recovery is enabled. So this pr adds the precondition description related to `Yarn NM recovery is enabled` for `spark.shuffle.service.db.backend`. in `running-on-yarn.md`

### Why are the changes needed?
Add precondition description for `spark.shuffle.service.db.backend` in `running-on-yarn.md`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes apache#37853 from LuciferYang/SPARK-40404.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants