[SPARK-24149][YARN][FOLLOW-UP] Only get the delegation tokens of the filesystem explicitly specified by the user #21734

wangyum · 2018-07-09T06:49:23Z

What changes were proposed in this pull request?

Our HDFS cluster configured 5 nameservices: nameservices1, nameservices2, nameservices3, nameservices-dev1 and nameservices4, but nameservices-dev1 unstable. So sometimes an error occurred and causing the entire job failed since SPARK-24149:

I think it's best to add a switch here.

How was this patch tested?

manual tests

SparkQA · 2018-07-09T07:05:02Z

Test build #92736 has finished for PR 21734 at commit 8885fff.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-09T07:29:38Z

Test build #92737 has finished for PR 21734 at commit 50ef3c9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-09T08:09:29Z

Test build #92742 has finished for PR 21734 at commit da1e389.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-07-09T08:37:06Z

cc @mgaido91 @vanzin

mgaido91 · 2018-07-09T08:51:46Z

@wangyum I am not sure about this. It seems an env issue to me. Anyway, if you want to access that namespace, you have to add it to the list of file system to access, so the problem is the same. If you don't access it, then it can just be removed from your config. What do you think?

wangyum · 2018-07-09T09:16:26Z

@mgaido91 Yes. it's a env issue. I think it is mainly compatible with the previous Spark. If it fails since SPARK-24149, we only can do is change the hdfs-site.xml. This risk is a bit big.

jerryshao · 2018-07-09T12:16:20Z

Shall we fix this issue in HadoopFSDelegationTokenProvider, maybe we should try catch the delegation token obtain process?

wangyum · 2018-07-10T02:28:38Z

It will spend a lot of time to fetch tokens. I add some print at HadoopFSDelegationTokenProvider:

filesystems.foreach { fs =>
  try {
    logInfo("getting token for: " + fs)
    logWarning("begin")
    fs.addDelegationTokens(renewer, creds)
    logWarning("end")
  } catch {
    case _: ConnectTimeoutException =>
      logError(s"Failed to fetch Delegation Tokens.")
    case e: Throwable =>
      logError("Error while fetch Delegation Tokens.", e)
  }
}

This is the log file.

wangyum · 2018-07-10T02:38:07Z

There is a conflict here. I configured spark.yarn.access.namenodes=hdfs://nameservices1,hdfs://nameservices2, but still fetch all.
This time I change spark.yarn.access.namenodes default to * to fetch all token.

SparkQA · 2018-07-10T02:54:56Z

Test build #92785 has finished for PR 21734 at commit 43ffeaa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2018-07-10T02:55:49Z

a) spark.yarn.access.namenodes is not used for such purpose, I don't think it is meaningful to change this configuration.
b) spark.yarn.access.namenodes is already deprecated.

SparkQA · 2018-07-10T03:29:11Z

Test build #92788 has finished for PR 21734 at commit 61b1339.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-07-10T08:35:11Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala

    val filesystemsToAccess = sparkConf.get(FILESYSTEMS_TO_ACCESS)
-      .map(new Path(_).getFileSystem(hadoopConf))
-      .toSet
+    val isRequestAllDelegationTokens = filesystemsToAccess.isEmpty


this would mean that if you have your running application accessing different namespaces and you want to add a new namespace to connect to, if you just add the namespace you need the application can break as we are not getting anymore the tokens for the other namespaces.
I'd rather follow @jerryshao's comment about avoiding to crash if the renewal fails, this seems to fix your problem and it doesn't hurt other solutions.

The fetch delegation token is inherently heavy. It took 22 seconds to get 5 tokens:

Now the spark.yarn.access.hadoopFileSystems configuration is invalid, always get all tokens.

Users should be able to configure their needed FileSystem for better performance at least.

spark.yarn.access.hadoopFileSystems is not invalid, it is just needed to access external cluster, which is what it was created for. Moreover, if you use viewfs, the same operations are performed under the hood by Hadoop code. So this seems to be a more general performance/scalability issue on the number of namespaces we support.

spark.yarn.access.hadoopFileSystems is not used as what you think. I don't think changing the semantics of spark.yarn.access.hadoopFileSystems is a correct way.

Basically your problem is that not all the nameservices are accessible in federated HDFS, currently the Hadoop token provider will throw an exception and ignore the following FSs. I think it would be better to try-catch and ignore bad cluster, that would be more meaningful compared to this fix.

If you don't want to get all tokens from all the nameservices, I think you should change the hdfs configuration for Spark. Spark assumes that all the nameservices is accessible. Also token acquisition is happened in application submission, it is not a big problem whether the fetch is slow or not.

Then spark.yarn.access.hadoopFileSystems only can be used in filesystems that without HA.
Because HA filesystem must configure 2 namenodes info in hdfs-site.xml.

cc @LantaoJin @suxingfate You should be familiar with this.

@wangyum spark.yarn.access.hadoopFileSystems could be set with HA.
For example:

--conf spark.yarn.access.hadoopFileSystems hdfs://cluster1-ha,hdfs://cluster2-ha
in hdfs-site.xml
<property>
<name>dfs.nameservices</name>
<value>cluster1-ha,cluster2-ha</value>
</property>
<property>
<name>dfs.ha.namenodes.cluster1-ha</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.ha.namenodes.cluster2-ha</name>
<value>nn1,nn2</value>
</property>

vanzin · 2018-08-14T21:55:13Z

I generally dislike adding new configs for things that can be solved without one. Saisai's suggestion seems good enough - catch the error and print a warning about the specific file system that won't be available. User app will fail if it tries to talk to that fs in any case. Maybe throw an exception if you weren't able to get any delegation tokens.

wangyum · 2018-08-15T00:51:44Z

Thanks @vanzin If there is a problem with a filesystem, it will take a long time to retry when get the delegation token.

The new approach is:

Get all the delegation tokens by default.
Only get the delegation tokens of the filesystem explicitly specified in spark.yarn.access.hadoopFileSystems to get better performance.

vanzin · 2018-08-24T21:29:13Z

Ok, since SPARK-24149 hasn't shipped in any release, it sounds ok to change the behavior of the config. I don't love it, but it seems consistent with the previous behavior.

I'd rename isRequestAllDelegationTokens to requestAllDelegationTokens which sounds better.

SparkQA · 2018-08-25T02:43:58Z

Test build #95241 has finished for PR 21734 at commit f3aa2a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-08-27T20:26:31Z

Merging to master.

…filesystem explicitly specified by the user ## What changes were proposed in this pull request? Our HDFS cluster configured 5 nameservices: `nameservices1`, `nameservices2`, `nameservices3`, `nameservices-dev1` and `nameservices4`, but `nameservices-dev1` unstable. So sometimes an error occurred and causing the entire job failed since [SPARK-24149](https://issues.apache.org/jira/browse/SPARK-24149): ![image](https://user-images.githubusercontent.com/5399861/42434779-f10c48fc-8386-11e8-98b0-4d9786014744.png) I think it's best to add a switch here. ## How was this patch tested? manual tests Closes apache#21734 from wangyum/SPARK-24149. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

Add spark.yarn.access.all.hadoopFileSystems

8885fff

accessAllFileSystem -> !accessAllFileSystem

50ef3c9

FILESYSTEMS_TO_ACCESS_ALL -> true

da1e389

NAMENODES_TO_ACCESS default to *

43ffeaa

isRequestAllDelegationTokens = filesystemsToAccess.isEmpty

61b1339

mgaido91 reviewed Jul 10, 2018

View reviewed changes

wangyum changed the title ~~[SPARK-24149][YARN][FOLLOW-UP] Add a config to control automatic namespaces discovery~~ [SPARK-24149][YARN][FOLLOW-UP] Only get the delegation tokens of the filesystem explicitly specified by the user Aug 15, 2018

wangyum added 2 commits August 25, 2018 10:02

Merge remote-tracking branch 'upstream/master' into SPARK-24149

89b9d09

isRequestAllDelegationTokens -> requestAllDelegationTokens

f3aa2a4

asfgit closed this in c3f285c Aug 27, 2018

[SPARK-24149][YARN][FOLLOW-UP] Only get the delegation tokens of the filesystem explicitly specified by the user #21734

[SPARK-24149][YARN][FOLLOW-UP] Only get the delegation tokens of the filesystem explicitly specified by the user #21734

Uh oh!

Conversation

wangyum commented Jul 9, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 9, 2018

Uh oh!

SparkQA commented Jul 9, 2018

Uh oh!

SparkQA commented Jul 9, 2018

Uh oh!

wangyum commented Jul 9, 2018

Uh oh!

mgaido91 commented Jul 9, 2018

Uh oh!

wangyum commented Jul 9, 2018

Uh oh!

jerryshao commented Jul 9, 2018

Uh oh!

wangyum commented Jul 10, 2018

Uh oh!

wangyum commented Jul 10, 2018

Uh oh!

SparkQA commented Jul 10, 2018

Uh oh!

jerryshao commented Jul 10, 2018

Uh oh!

SparkQA commented Jul 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LantaoJin Jul 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented Aug 14, 2018

Uh oh!

wangyum commented Aug 15, 2018

Uh oh!

vanzin commented Aug 24, 2018

Uh oh!

SparkQA commented Aug 25, 2018

Uh oh!

vanzin commented Aug 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LantaoJin Jul 19, 2018 •

edited

Loading