Skip to content

[SPARK-18119][SPARK-CORE] Namenode safemode check is only performed on one namenode which can stuck the startup of SparkHistory server#15648

Closed
ashangit wants to merge 1 commit intoapache:masterfrom
criteo-forks:SPARK-18119
Closed

[SPARK-18119][SPARK-CORE] Namenode safemode check is only performed on one namenode which can stuck the startup of SparkHistory server#15648
ashangit wants to merge 1 commit intoapache:masterfrom
criteo-forks:SPARK-18119

Conversation

@ashangit
Copy link
Contributor

What changes were proposed in this pull request?

Instead of using the setSafeMode method that check the first namenode used the one which permitts to check only for active NNs

How was this patch tested?

manual tests

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

This commit is contributed by Criteo SA under the Apache v2 licence.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable from skimming the API, ideally it would be nice to have a test perhaps but that might be overkill for such a simple change. Maybe @srowen or someone can take a look?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I think it isn't just for testing anymore, maybe we should remove this comment since we are updating this function anyways?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mention what the true is for (e.g. with a named param or just something like /* isChecked - run on ActiveNN */

…n one namenode which can stuck the startup of SparkHistory server

This commit is contributed by Criteo SA under the Apache v2 licence.
@ashangit
Copy link
Contributor Author

ashangit commented Nov 7, 2016

Comments taken in account.
For the test it is in fact a really simple change but let me know if we have to add a it.

@ashangit
Copy link
Contributor Author

@holdenk and @srowen any news?
Thanks

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK, given the description in the JIRA. Eh, @squito do you know enough about HDFS to have an opinion? I checked, and this method is indeed available from Hadoop 2.2 onwards so we're OK there.

@squito
Copy link
Contributor

squito commented Nov 21, 2016

@srowen I'm actually not 100% sure what should happen with hdfs HA -- lemme ask around. This looks right but worth checking that we're not covering up some other cluster issue.

@srowen
Copy link
Member

srowen commented Nov 22, 2016

HDFS folks internally indicated this is likely the right change. Let's leave it for a bit before committing.

@srowen
Copy link
Member

srowen commented Nov 24, 2016

CC @steveloughran who may also have good insight into whether this it the right change for HDFS HA.

@steveloughran
Copy link
Contributor

steveloughran commented Nov 24, 2016

LGTM, as the javadocs say If true check only for Active NNs status, else check first NN's status. But I don't know enough about HDFS HA to be

It'll check the first NN, if that is on standby and stale reads are not allowed ( it'll log at error (HDFS-3477 proposes downgrading that), and throw an exception the url https://s.apache.org/sbnn-error. If someone sets dfs.ha.allow.stale.reads" then they get the old safe mode state; there's nothing that can be done there.

Where my knowledge of HDFS-HA fails is what happens then; Does the RPC client try another NN? Or just it just fail? Maybe @liuml07 could assist there.

The method went in with Hadoop 2.0.3 alpha in HDFS-3507, so will be across the whole of the Hadoop 2.x line. The enum used did change in 2015, with HDFS-4015; adding SAFEMODE_FORCE_EXIT; that action must be avoided. Luckily, the history server isn't trying to exit safemode.

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #3437 has finished for PR 15648 at commit eb76612.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 25, 2016

Merged to master/2.1

@asfgit asfgit closed this in f42db0c Nov 25, 2016
asfgit pushed a commit that referenced this pull request Nov 25, 2016
…n one namenode which can stuck the startup of SparkHistory server

## What changes were proposed in this pull request?

Instead of using the setSafeMode method that check the first namenode used the one which permitts to check only for active NNs
## How was this patch tested?

manual tests

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

This commit is contributed by Criteo SA under the Apache v2 licence.

Author: n.fraison <n.fraison@criteo.com>

Closes #15648 from ashangit/SPARK-18119.

(cherry picked from commit f42db0c)
Signed-off-by: Sean Owen <sowen@cloudera.com>
@ashangit ashangit deleted the SPARK-18119 branch November 25, 2016 10:00
@liuml07
Copy link
Member

liuml07 commented Nov 28, 2016

Sorry for coming late. The change is very reasonable. Glad it's merged.

Steve:

Where my knowledge of HDFS-HA fails is what happens then; Does the RPC client try another NN? Or just it just fail? Maybe @liuml07 could assist there.

So in general, setSafeMode(true) is a READ operation while the setSafeMode(false) is an unchecked operation.

  1. The standby NN will throw an exception for READ operations unless it's configured to allow so via config key dfs.ha.allow.stale.reads. And if the StandbyException is caught by the client side, the DFSClient internal proxy will automatically retry multiple times (say, 10) against the other NN (hopefully active). So the setSafeMode(true) request will eventually go to the active NN.
  2. In contrast, the setSafeMode(false) will go to a NameNode (can be standby) which will return its safe mode status instead of throwing a StandbyException. This return value may fool the client which was expecting the active NN is out of safe mode.

By the way, the above comments make sense only if we're using the logical HDFS service name.

ashangit pushed a commit to criteo-forks/spark that referenced this pull request Nov 30, 2016
…n one namenode which can stuck the startup of SparkHistory server

## What changes were proposed in this pull request?

Instead of using the setSafeMode method that check the first namenode used the one which permitts to check only for active NNs
## How was this patch tested?

manual tests

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

This commit is contributed by Criteo SA under the Apache v2 licence.

Author: n.fraison <n.fraison@criteo.com>

Closes apache#15648 from ashangit/SPARK-18119.
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
…n one namenode which can stuck the startup of SparkHistory server

## What changes were proposed in this pull request?

Instead of using the setSafeMode method that check the first namenode used the one which permitts to check only for active NNs
## How was this patch tested?

manual tests

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

This commit is contributed by Criteo SA under the Apache v2 licence.

Author: n.fraison <n.fraison@criteo.com>

Closes apache#15648 from ashangit/SPARK-18119.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…n one namenode which can stuck the startup of SparkHistory server

## What changes were proposed in this pull request?

Instead of using the setSafeMode method that check the first namenode used the one which permitts to check only for active NNs
## How was this patch tested?

manual tests

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

This commit is contributed by Criteo SA under the Apache v2 licence.

Author: n.fraison <n.fraison@criteo.com>

Closes apache#15648 from ashangit/SPARK-18119.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants