-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14743][YARN] Add a configurable credential manager for Spark running on YARN #14065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #61830 has finished for PR 14065 at commit
|
|
Test build #61896 has finished for PR 14065 at commit
|
| hadoopUtil.obtainTokenForHiveMetastore(sparkConf, freshHadoopConf, tempCreds) | ||
| hadoopUtil.obtainTokenForHBase(sparkConf, freshHadoopConf, tempCreds) | ||
| hdfsTokenProvider(sparkConf).setNameNodesToAccess(sparkConf, Set(dst)) | ||
| hdfsTokenProvider(sparkConf).setTokenRenewer(null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we setting this to null here?
|
I took a quick look through. It might be nice to think about how we could handle other credentials. For instance Apache Kafka currently doesn't have tokens so you need keytab or TGT and jaas conf file. Yes they are adding tokens but in in the mean time how does that work. Are there other services similar to that. Can we handle things other then Tokens? it does appear that I could implement my own ServiceTokenProvider that goes off to really any service and I can put things into the Credentials object as Token or as a Secret so perhaps we are covered here. But perhaps that means we should rename things to be obtainCredentials rather then obtainTokens. Are there specific services you were thinking about here? We could atleast use those as examples to make sure interface fits those. |
|
Test build #61912 has finished for PR 14065 at commit
|
| private[spark] override def startExecutorDelegationTokenRenewer(sparkConf: SparkConf): Unit = { | ||
| tokenRenewer = Some(new ExecutorDelegationTokenUpdater(sparkConf, conf)) | ||
| tokenRenewer.get.updateCredentialsIfRequired() | ||
| configurableTokenManager(sparkConf).delegationTokenUpdater(conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this syntax a little confusing. You're calling configurableTokenManager(sparkConf) in a bunch of different places. To me that looks like either:
- each call is creating a new token manager
- there's some cache of token managers somewhere keyed by the spark configuration passed here
Neither sounds good to me. And the actual implementation is actually neither: there's a single token manager singleton that is instantiated in the first call to configurableTokenManager.
Why doesn't Client instantiate a token manager in its constructor instead? Another option is to have an explicit method in ConfigurableTokenManager to initialize the singleton, although I'm not a fan of singletons in general.
|
Thanks a lot @tgravescs and @vanzin for your suggestions, I will change the codes accordingly, greatly appreciate your comments. |
|
Test build #62148 has finished for PR 14065 at commit
|
|
@tgravescs and @vanzin , these days I did some code refactoring work on this patch. Here listed changes I did compared to previous code:
Please help to review, thanks a lot for your time and greatly appreciate your comments. |
|
Also thinking about one example to land this feature, I think Kafka might be one candidate, they also have delegation token based proposal KIP-48. |
|
Test build #62149 has finished for PR 14065 at commit
|
|
Test build #62219 has finished for PR 14065 at commit
|
dev/.rat-excludes
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding these files like this, isn't it possible to just say META-INF/services/*?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the convention of other services like DataSource which also needs to be excluded from rat check.
|
Test build #63405 has finished for PR 14065 at commit
|
docs/running-on-yarn.md
Outdated
| credential provider. | ||
|
|
||
| ``` | ||
| spark.yarn.access.namenodes hdfs://ireland.example.org:8020/,hdfs://frankfurt.example.org:8020/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it seems like this example should be before the paragraph you're adding.
|
Looks fine. There are some possible enhancements (e.g. what looks like some code repetition in the HDFS provider, neither Hive nor HBase return a token renewal time, etc) but those can be done separately. @tgravescs did you have any remaining comments? |
|
all my original comments were addressed and I won't have time to do another review until next week so I'm good with it if you are. |
|
Test build #63516 has finished for PR 14065 at commit
|
|
Merging to master. |
…e time ## What changes were proposed in this pull request? In #14065, we introduced a configurable credential manager for Spark running on YARN. Also two configs `spark.yarn.credentials.renewalTime` and `spark.yarn.credentials.updateTime` were added, one is for the credential renewer and the other updater. But now we just query `spark.yarn.credentials.renewalTime` by mistake during CREDENTIALS UPDATING, where should be actually `spark.yarn.credentials.updateTime` . This PR fixes this mistake. ## How was this patch tested? existing test cc jerryshao vanzin Author: Kent Yao <[email protected]> Closes #16955 from yaooqinn/cred_update. (cherry picked from commit 7363dde) Signed-off-by: Marcelo Vanzin <[email protected]>
…e time ## What changes were proposed in this pull request? In #14065, we introduced a configurable credential manager for Spark running on YARN. Also two configs `spark.yarn.credentials.renewalTime` and `spark.yarn.credentials.updateTime` were added, one is for the credential renewer and the other updater. But now we just query `spark.yarn.credentials.renewalTime` by mistake during CREDENTIALS UPDATING, where should be actually `spark.yarn.credentials.updateTime` . This PR fixes this mistake. ## How was this patch tested? existing test cc jerryshao vanzin Author: Kent Yao <[email protected]> Closes #16955 from yaooqinn/cred_update.
…e time ## What changes were proposed in this pull request? In apache#14065, we introduced a configurable credential manager for Spark running on YARN. Also two configs `spark.yarn.credentials.renewalTime` and `spark.yarn.credentials.updateTime` were added, one is for the credential renewer and the other updater. But now we just query `spark.yarn.credentials.renewalTime` by mistake during CREDENTIALS UPDATING, where should be actually `spark.yarn.credentials.updateTime` . This PR fixes this mistake. ## How was this patch tested? existing test cc jerryshao vanzin Author: Kent Yao <[email protected]> Closes apache#16955 from yaooqinn/cred_update.
What changes were proposed in this pull request?
Add a configurable token manager for Spark on running on yarn.
Current Problems
Changes In This Proposal
In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes:
ServiceTokenProvideras well asServiceTokenRenewableinterface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface.ConfigurableTokenManagerto manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on.HDFSTokenProvider,HiveTokenProviderandHBaseTokenProviderto keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded.Behavior Changes
For the end user there's no behavior change, we still use the same configuration
spark.yarn.security.tokens.${service}.enabledto decide which token provider is enabled (hbase or hive).For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations:
spark.yarn.security.tokens.test.enabledto truespark.yarn.security.tokens.test.classto the full qualified class name.So we still keep the same semantics as current code while add one new configuration.
Current Status
How was this patch tested?
Unit test and integrated test.
Please suggest and review, any comment is greatly appreciated.