[SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed to start when the chosen dir… #18905

LiShuMing · 2017-08-10T08:51:43Z

What changes were proposed in this pull request?

See SPARK-21660, this PR add one simple strategy to validate the chosen disk writable to avoid choosing a read-only disk.

How was this patch tested?

How to mock disk corrupted?

change the recovery path read-only mode:
sudo chmod -R 400 /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle

Before this pr, when we start the nodemanager, exception below:

2017-08-10 16:30:08,112 INFO yarn.YarnShuffleService (YarnShuffleService.java:(136)) - Initializing YARN shuffle service for Spark
2017-08-10 16:30:08,112 INFO containermanager.AuxServices (AuxServices.java:addService(72)) - Adding auxiliary service spark_shuffle, "spark_shuffle"
2017-08-10 16:30:08,218 ERROR util.LevelDBProvider (LevelDBProvider.java:initLevelDB(61)) - error opening leveldb file /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle/registeredExecutors.ldb. Creating new file, will not be able to recover state for existing applications
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle/registeredExecutors.ldb/LOCK: Permission denied
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:48)
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:116)
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:94)
at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:66)
at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:245)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:261)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:495)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:543)
2017-08-10 16:30:08,220 WARN util.LevelDBProvider (LevelDBProvider.java:initLevelDB(71)) - error deleting /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle/registeredExecutors.ldb
2017-08-10 16:30:08,220 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service spark_shuffle failed in state INITED; cause: java.io.IOException: Unable to create state store
java.io.IOException: Unable to create state store
at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:77)
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:116)
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:94)
at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:66)
at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:245)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:261)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:495)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:543)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle/registeredExecutors.ldb/LOCK: Permission denied
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:75)
... 15 more

After this pr:

2017-08-10 16:36:49,101 INFO yarn.YarnShuffleService (YarnShuffleService.java:(136)) - Initializing YARN shuffle service for Spark
2017-08-10 16:36:49,101 INFO containermanager.AuxServices (AuxServices.java:addService(72)) - Adding auxiliary service spark_shuffle, "spark_shuffle"
2017-08-10 16:36:49,102 INFO yarn.YarnShuffleService (YarnShuffleService.java:initRecoveryDb(359)) - Recovery path /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle ldb available: false.
2017-08-10 16:36:49,102 WARN yarn.YarnShuffleService (YarnShuffleService.java:initRecoveryDb(367)) - Recovery path /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle unavailable: set it to null
2017-08-10 16:36:49,180 INFO util.LevelDBProvider (LevelDBProvider.java:initLevelDB(51)) - Creating state database at /mnt/dfs/0/hadoop/yarn/local/registeredExecutors.ldb
2017-08-10 16:36:49,317 INFO util.LevelDBProvider$LevelDBLogger (LevelDBProvider.java:log(93)) - Delete type=3 #1
2017-08-10 16:36:49,548 INFO yarn.YarnShuffleService (YarnShuffleService.java:serviceInit(186)) - Started YARN shuffle service for Spark on port 7337. Authentication is not enabled. Registered executor file is /mnt/dfs/0/hadoop/yarn/local/registeredExecutors.ld
b

…ectory become read-only

AmplabJenkins · 2017-08-10T08:52:03Z

Can one of the admins verify this patch?

jerryshao

My thinking is that if work preserving is enabled (recovery path is not null), then user should guarantee the availability of this directory, am not sure if it is good to change to other directories (is yarn internally relying on it).

Also would you please add an unit test to verify your logics.

jerryshao · 2017-08-14T08:27:58Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java


+  /**
+   * Check the chosen DB file available or not.
+   */


I'm not sure if it is a thorough way to check disk healthy, in our internal case, we found that disk is not mounted (due to failure), and trying to write to this unmounted disk throws permission deny exception.

I'm thinking that disk unwritable is just one case of disk unhealthy, maybe we should check YARN's disk healthy check mechanism.

jerryshao · 2017-08-14T08:28:13Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java

+  /**
+   * Check the chosen DB file available or not.
+   */
+  protected Boolean checkFileAvailable(File file) {


two space indent for the java code.

jerryshao · 2017-08-14T08:51:19Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java

        }
    }
+
+    // If recovery path unavailable, no use it any more.


I think recovery path is set by user or use yarn default, user should make sure the availability of this directory, and yarn internally relies on it. It doesn't make sense to change to another disk if recovery path is unavailable.

jerryshao · 2017-08-14T08:56:17Z

common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java

+        }
+      }
+    }
+


If _recoveryPath is still null I think we should throw an exception here, since none of the disk is good.

LiShuMing · 2017-08-15T03:41:10Z

@jerryshao Thanks for your replies! I will do such things then:

"it is good to change to other directories (is yarn internally relying on it)?"
I think the recovery path(local variable) is only used in YarnShuffleService, principally not affects yarn environment. This PR cares the scene that we can find a better way to choose a useful disk for the recovery path when there are many disks that can choose.
Check HDFS/YARN's disk healthy check mechanism to better define checkFileAvailable() ;
Fix code format.
Throw an exception when _recoveryPath is empty finally.

jerryshao · 2017-08-18T08:00:13Z

@LiShuMing any update on this?

LiShuMing · 2017-08-19T00:02:38Z

Sorry, busy recently, I will update it today...

LiShuMing · 2017-08-19T12:39:01Z

ping @jerryshao

I found a method to check disk in hadoop: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DiskChecker.java#L111

I add a unit test, Can you help me review my code?

jerryshao · 2017-08-22T13:49:01Z

I have two questions about the fix:

Is it a good idea to change recovery path to other directory? Since recovery path is configured by user or figured out by yarn, so maybe YARN has some assumption about this path, if we change to other one, will this introduce some issues. Also if recovery path is not null, should it be guaranteed by user for the availability.
What if the previous bad disk back to normal with orphan data? For example is dir1 is failed with state V1, and based on this logic we should another dir2 and state changed to v2. Then after a while if dir1 is back to normal, then which dirs are we choosing based on your current code?

CC @tgravescs to review.

tgravescs · 2017-08-22T14:27:38Z

The recovery path returned by yarn is supposed to be reliable and if it isn't working then the NM itself shouldn't run. So in general you should just use that if you want spark to be able to recover. If you don't have yarn recovery enabled them there is no need for us to write the DBs at all and I think we should change to not do that.

I think this jira is a dup of https://issues.apache.org/jira/browse/SPARK-17321

See my comments there.

LiShuMing · 2017-08-24T06:24:14Z

See another approach to solve this problem: #19032 and I will close this pr.

Thanks @jerryshao @tgravescs .

LiShuMing added 2 commits August 9, 2017 10:45

[SPARK-21660] Yarn ShuffleService failed to start when the chosen dir…

d62405d

…ectory become read-only

Recovery path had already existed but unavailable, set it to null

2077537

LiShuMing changed the title ~~[SPARK-21660] [YARN] [Shuffle] Yarn ShuffleService failed to start when the chosen dir…~~ [SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed to start when the chosen dir… Aug 14, 2017

jerryshao requested changes Aug 14, 2017

View reviewed changes

hzlishuming and others added 3 commits August 19, 2017 17:25

format code and add unit test

6841ca4

default file access should not contain

2e06fdb

should check recovery path access instead of db file

e380c6f

LiShuMing closed this Aug 24, 2017

[SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed to start when the chosen dir… #18905

[SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed to start when the chosen dir… #18905

Uh oh!

Conversation

LiShuMing commented Aug 10, 2017

What changes were proposed in this pull request?

How was this patch tested?

How to mock disk corrupted?

Uh oh!

AmplabJenkins commented Aug 10, 2017

Uh oh!

jerryshao left a comment

Choose a reason for hiding this comment

Uh oh!

jerryshao Aug 14, 2017

Choose a reason for hiding this comment

Uh oh!

jerryshao Aug 14, 2017

Choose a reason for hiding this comment

Uh oh!

jerryshao Aug 14, 2017

Choose a reason for hiding this comment

Uh oh!

jerryshao Aug 14, 2017

Choose a reason for hiding this comment

Uh oh!

LiShuMing commented Aug 15, 2017

Uh oh!

jerryshao commented Aug 18, 2017

Uh oh!

LiShuMing commented Aug 19, 2017

Uh oh!

LiShuMing commented Aug 19, 2017

Uh oh!

jerryshao commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgravescs commented Aug 22, 2017

Uh oh!

LiShuMing commented Aug 24, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jerryshao commented Aug 22, 2017 •

edited

Loading