Skip to content

Conversation

@zhangshuyan0
Copy link

When tokenRenewalInterval is not set, HadoopFSDelegationTokenProvider#getTokenRenewalInterval will fetch some tokens and renew them to get a interval value.

// Get the token renewal interval if it is not set. It will only be called once.
if (tokenRenewalInterval == null) {
tokenRenewalInterval = getTokenRenewalInterval(hadoopConf, fileSystems)
}

These tokens do not call cancel(), resulting in a large number of existing tokens on HDFS not being cleared in a timely manner, causing additional pressure on the HDFS server.

val interval = newExpiration - getIssueDate(tokenKind, identifier)
logInfo(log"Renewal interval is ${MDC(TOTAL_TIME, interval)} for" +
log" token ${MDC(TOKEN_KIND, tokenKind)}")
token.cancel(hadoopConf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested it? it requires some time to propagate the new token from driver to executors, I think such a change may cause auth failures on the executor side

Copy link
Author

@zhangshuyan0 zhangshuyan0 Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review! These tokens have not actually been used. The tokens that are truly distributed to each executor are obtained at Line57:

val fetchCreds = fetchDelegationTokens(getTokenRenewer(sparkConf, hadoopConf), fileSystems,
creds, fsToExclude)
// Get the token renewal interval if it is not set. It will only be called once.
if (tokenRenewalInterval == null) {
tokenRenewalInterval = getTokenRenewalInterval(hadoopConf, fileSystems)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, that makes sense

@github-actions github-actions bot added the CORE label Aug 19, 2024
@Hexiaoqiao
Copy link

Not sure why there is one method named getTokenRenewalInterval, seems that it only wants to get the renewal interval about tokens, right? If true, it has been got when invoke fetchDelegationTokens, and don't need to fetchDelegationTokens again. So IMO, it is enough to fetch delegation tokens at L57-L58.
https://github.com/apache/spark/pull/47800/files#diff-f7ab3d8ae8c3483d5207277bd592b7193e139edae25e111fce88a902b6b77a05L57-L58
Another one choice, give pass credentials to getTokenRenewalInterval, then it will not request again if exists for one nameservice. tokenRenewalInterval = getTokenRenewalInterval(hadoopConf, sparkConf, fileSystems, creds).
Please correct me if something I missed. Thanks.

@pan3793
Copy link
Member

pan3793 commented Aug 19, 2024

it has been got when invoke fetchDelegationTokens, and don't need to fetchDelegationTokens again. So IMO, it is enough to fetch delegation tokens at L57-L58.

@Hexiaoqiao I believe it gets explained in L135-L137

// We cannot use the tokens generated with renewer yarn. Trying to renew
// those will fail with an access control issue. So create new tokens with the logged in
// user as renewer.

Well, for non-YARN mode, e.g.(K8s mode), we do have a chance to save one fetchDelegationTokens call.

private def getTokenRenewer(sparkConf: SparkConf, hadoopConf: Configuration): String = {
val master = sparkConf.get("spark.master", null)
val tokenRenewer = if (master != null && master.contains("yarn")) {
Master.getMasterPrincipal(hadoopConf)
} else {
UserGroupInformation.getCurrentUser().getUserName()
}

@Hexiaoqiao
Copy link

Well, Got it. If that I think @zhangshuyan0 's PR should be one direct solution. Thanks again.

@pan3793
Copy link
Member

pan3793 commented Aug 19, 2024

cc @viirya @sunchao

@viirya
Copy link
Member

viirya commented Aug 19, 2024

Looks reasonable.

@Hexiaoqiao
Copy link

Hi All, any progress here? Thanks.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one nit

val interval = newExpiration - getIssueDate(tokenKind, identifier)
logInfo(log"Renewal interval is ${MDC(TOTAL_TIME, interval)} for" +
log" token ${MDC(TOKEN_KIND, tokenKind)}")
token.cancel(hadoopConf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you add some comments here just to make it easier to understand in future?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion! I have added it.

@mridulm
Copy link
Contributor

mridulm commented Aug 21, 2024

+CC @attilapiros, @tgravescs

@sunchao sunchao closed this in ae05852 Aug 21, 2024
@sunchao sunchao changed the title [SPARK-49300][CORE]Fix Hadoop delegation token leak when tokenRenewalInterval is not set. [SPARK-49300][CORE] Fix Hadoop delegation token leak when tokenRenewalInterval is not set. Aug 21, 2024
@sunchao
Copy link
Member

sunchao commented Aug 21, 2024

Thanks! Merged to master

@sunchao
Copy link
Member

sunchao commented Aug 21, 2024

@zhangshuyan0 do you think we should backport this to branch-3.5 too? if so, feel free to create another PR for the backport. (I tried to just cherry-pick this commit but there are some conflicts)

@zhangshuyan0
Copy link
Author

@zhangshuyan0 do you think we should backport this to branch-3.5 too? if so, feel free to create another PR for the backport. (I tried to just cherry-pick this commit but there are some conflicts)

I think so. I'll create another PR for it.

@mridulm
Copy link
Contributor

mridulm commented Aug 21, 2024

@sunchao, if there are additionally folks tagged for review, it would be good to give them a chance to get to the PR - given Apache Spark has a distributed community.

@sunchao
Copy link
Member

sunchao commented Aug 21, 2024

@mridulm I'm sorry. I didn't know you tagged them for review. Feel free to leave any comments on the PR. We can always address them in follow-ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants