Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jul 9, 2018

What changes were proposed in this pull request?

Our HDFS cluster configured 5 nameservices: nameservices1, nameservices2, nameservices3, nameservices-dev1 and nameservices4, but nameservices-dev1 unstable. So sometimes an error occurred and causing the entire job failed since SPARK-24149:

image

I think it's best to add a switch here.

How was this patch tested?

manual tests

@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92736 has finished for PR 21734 at commit 8885fff.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92737 has finished for PR 21734 at commit 50ef3c9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92742 has finished for PR 21734 at commit da1e389.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Jul 9, 2018

cc @mgaido91 @vanzin

@mgaido91
Copy link
Contributor

mgaido91 commented Jul 9, 2018

@wangyum I am not sure about this. It seems an env issue to me. Anyway, if you want to access that namespace, you have to add it to the list of file system to access, so the problem is the same. If you don't access it, then it can just be removed from your config. What do you think?

@wangyum
Copy link
Member Author

wangyum commented Jul 9, 2018

@mgaido91 Yes. it's a env issue. I think it is mainly compatible with the previous Spark. If it fails since SPARK-24149, we only can do is change the hdfs-site.xml. This risk is a bit big.

@jerryshao
Copy link
Contributor

Shall we fix this issue in HadoopFSDelegationTokenProvider, maybe we should try catch the delegation token obtain process?

@wangyum
Copy link
Member Author

wangyum commented Jul 10, 2018

It will spend a lot of time to fetch tokens. I add some print at HadoopFSDelegationTokenProvider:

filesystems.foreach { fs =>
  try {
    logInfo("getting token for: " + fs)
    logWarning("begin")
    fs.addDelegationTokens(renewer, creds)
    logWarning("end")
  } catch {
    case _: ConnectTimeoutException =>
      logError(s"Failed to fetch Delegation Tokens.")
    case e: Throwable =>
      logError("Error while fetch Delegation Tokens.", e)
  }
}

This is the log file.

@wangyum
Copy link
Member Author

wangyum commented Jul 10, 2018

There is a conflict here. I configured spark.yarn.access.namenodes=hdfs://nameservices1,hdfs://nameservices2, but still fetch all.
This time I change spark.yarn.access.namenodes default to * to fetch all token.

@SparkQA
Copy link

SparkQA commented Jul 10, 2018

Test build #92785 has finished for PR 21734 at commit 43ffeaa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor

a) spark.yarn.access.namenodes is not used for such purpose, I don't think it is meaningful to change this configuration.
b) spark.yarn.access.namenodes is already deprecated.

@SparkQA
Copy link

SparkQA commented Jul 10, 2018

Test build #92788 has finished for PR 21734 at commit 61b1339.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val filesystemsToAccess = sparkConf.get(FILESYSTEMS_TO_ACCESS)
.map(new Path(_).getFileSystem(hadoopConf))
.toSet
val isRequestAllDelegationTokens = filesystemsToAccess.isEmpty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would mean that if you have your running application accessing different namespaces and you want to add a new namespace to connect to, if you just add the namespace you need the application can break as we are not getting anymore the tokens for the other namespaces.
I'd rather follow @jerryshao's comment about avoiding to crash if the renewal fails, this seems to fix your problem and it doesn't hurt other solutions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch delegation token is inherently heavy. It took 22 seconds to get 5 tokens:
image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the spark.yarn.access.hadoopFileSystems configuration is invalid, always get all tokens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should be able to configure their needed FileSystem for better performance at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.yarn.access.hadoopFileSystems is not invalid, it is just needed to access external cluster, which is what it was created for. Moreover, if you use viewfs, the same operations are performed under the hood by Hadoop code. So this seems to be a more general performance/scalability issue on the number of namespaces we support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.yarn.access.hadoopFileSystems is not used as what you think. I don't think changing the semantics of spark.yarn.access.hadoopFileSystems is a correct way.

Basically your problem is that not all the nameservices are accessible in federated HDFS, currently the Hadoop token provider will throw an exception and ignore the following FSs. I think it would be better to try-catch and ignore bad cluster, that would be more meaningful compared to this fix.

If you don't want to get all tokens from all the nameservices, I think you should change the hdfs configuration for Spark. Spark assumes that all the nameservices is accessible. Also token acquisition is happened in application submission, it is not a big problem whether the fetch is slow or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then spark.yarn.access.hadoopFileSystems only can be used in filesystems that without HA.
Because HA filesystem must configure 2 namenodes info in hdfs-site.xml.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @LantaoJin @suxingfate You should be familiar with this.

Copy link
Contributor

@LantaoJin LantaoJin Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum spark.yarn.access.hadoopFileSystems could be set with HA.
For example:

--conf spark.yarn.access.hadoopFileSystems hdfs://cluster1-ha,hdfs://cluster2-ha
in hdfs-site.xml
<property>
<name>dfs.nameservices</name>
<value>cluster1-ha,cluster2-ha</value>
</property>
<property>
<name>dfs.ha.namenodes.cluster1-ha</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.ha.namenodes.cluster2-ha</name>
<value>nn1,nn2</value>
</property>

@vanzin
Copy link
Contributor

vanzin commented Aug 14, 2018

I generally dislike adding new configs for things that can be solved without one. Saisai's suggestion seems good enough - catch the error and print a warning about the specific file system that won't be available. User app will fail if it tries to talk to that fs in any case. Maybe throw an exception if you weren't able to get any delegation tokens.

@wangyum wangyum changed the title [SPARK-24149][YARN][FOLLOW-UP] Add a config to control automatic namespaces discovery [SPARK-24149][YARN][FOLLOW-UP] Only get the delegation tokens of the filesystem explicitly specified by the user Aug 15, 2018
@wangyum
Copy link
Member Author

wangyum commented Aug 15, 2018

Thanks @vanzin If there is a problem with a filesystem, it will take a long time to retry when get the delegation token.

The new approach is:

  • Get all the delegation tokens by default.
  • Only get the delegation tokens of the filesystem explicitly specified in spark.yarn.access.hadoopFileSystems to get better performance.

@vanzin
Copy link
Contributor

vanzin commented Aug 24, 2018

Ok, since SPARK-24149 hasn't shipped in any release, it sounds ok to change the behavior of the config. I don't love it, but it seems consistent with the previous behavior.

I'd rename isRequestAllDelegationTokens to requestAllDelegationTokens which sounds better.

@SparkQA
Copy link

SparkQA commented Aug 25, 2018

Test build #95241 has finished for PR 21734 at commit f3aa2a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Aug 27, 2018

Merging to master.

@asfgit asfgit closed this in c3f285c Aug 27, 2018
bogdanrdc pushed a commit to bogdanrdc/spark that referenced this pull request Aug 28, 2018
…filesystem explicitly specified by the user

## What changes were proposed in this pull request?

Our HDFS cluster configured 5 nameservices: `nameservices1`, `nameservices2`, `nameservices3`, `nameservices-dev1` and `nameservices4`, but `nameservices-dev1` unstable. So sometimes an error occurred and causing the entire job failed since [SPARK-24149](https://issues.apache.org/jira/browse/SPARK-24149):

![image](https://user-images.githubusercontent.com/5399861/42434779-f10c48fc-8386-11e8-98b0-4d9786014744.png)

I think it's best to add a switch here.

## How was this patch tested?

manual tests

Closes apache#21734 from wangyum/SPARK-24149.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants