Skip to content

Conversation

@sarvekshayr
Copy link
Contributor

@sarvekshayr sarvekshayr commented Jul 9, 2025

What changes were proposed in this pull request?

The method HAUtils.getCAListWithRetry() currently uses RetryPolicies.retryForeverWithFixedSleep(), which causes ozone admin container create command to retry indefinitely on any failure. When authentication is not set up (i.e., kinit is not done), it retries forever without handling this specific exception.
Fixed the logic to detect AccessControlException in the retry policy to fail fast.

Note: This is only seen on a SCM HA cluster.

What is the link to the Apache JIRA

HDDS-13405

How was this patch tested?

Before the fix

bash-5.1$ OZONE_LOGLEVEL=INFO ozone admin container create
2025-07-08 10:44:48,181 [main] INFO proxy.SCMContainerLocationFailoverProxyProvider: Created fail-over proxy for protocol StorageContainerLocationProtocolPB with 3 nodes: [nodeId=scm2,nodeAddress=scm2.org/172.25.0.117:9860, nodeId=scm1,nodeAddress=scm1.org/172.25.0.116:9860, nodeId=scm3,nodeAddress=scm3.org/172.25.0.118:9860]
2025-07-08 10:44:48,229 [main] INFO proxy.SecretKeyProtocolFailoverProxyProvider: Created fail-over proxy for protocol SecretKeyProtocolScmPB with 3 nodes: [nodeId=scm2,nodeAddress=scm2.org/172.25.0.117:9961, nodeId=scm1,nodeAddress=scm1.org/172.25.0.116:9961, nodeId=scm3,nodeAddress=scm3.org/172.25.0.118:9961]
2025-07-08 10:44:48,402 [main] INFO proxy.SCMSecurityProtocolFailoverProxyProvider: Created fail-over proxy for protocol SCMSecurityProtocolPB with 3 nodes: [nodeId=scm2,nodeAddress=scm2.org/172.25.0.117:9961, nodeId=scm1,nodeAddress=scm1.org/172.25.0.116:9961, nodeId=scm3,nodeAddress=scm3.org/172.25.0.118:9961]
2025-07-08 10:44:48,470 [main] WARN ipc.Client: Exception encountered while connecting to the server scm1.org/172.25.0.116:9961
org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
        at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:179)
        at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:399)
        at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:578)
        at org.apache.hadoop.ipc.Client$Connection.access$2100(Client.java:364)
        at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:799)
        at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:795)
        at java.base/java.security.AccessController.doPrivileged(AccessController.java:714)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:525)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795)
        at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:364)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1649)
        at org.apache.hadoop.ipc.Client.call(Client.java:1473)
        at org.apache.hadoop.ipc.Client.call(Client.java:1426)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)
        at jdk.proxy2/jdk.proxy2.$Proxy22.submitRequest(Unknown Source)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
        at java.base/java.lang.reflect.Method.invoke(Method.java:580)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:437)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:170)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:162)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:100)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:366)
        at jdk.proxy2/jdk.proxy2.$Proxy22.submitRequest(Unknown Source)
        at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:93)
        at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.listCACertificate(SCMSecurityProtocolClientSideTranslatorPB.java:363)
        at org.apache.hadoop.hdds.utils.HAUtils.waitForCACerts(HAUtils.java:374)
        at org.apache.hadoop.hdds.utils.HAUtils.lambda$buildCAX509List$3(HAUtils.java:401)
        at org.apache.hadoop.hdds.utils.RetriableTask.call(RetriableTask.java:55)
        at org.apache.hadoop.hdds.utils.HAUtils.getCAListWithRetry(HAUtils.java:360)
        at org.apache.hadoop.hdds.utils.HAUtils.buildCAX509List(HAUtils.java:401)
        at org.apache.hadoop.hdds.scm.cli.ContainerOperationClient.lambda$newXCeiverClientManager$0(ContainerOperationClient.java:123)
        at org.apache.hadoop.hdds.scm.client.ClientTrustManager.loadCerts(ClientTrustManager.java:148)
        at org.apache.hadoop.hdds.scm.client.ClientTrustManager.<init>(ClientTrustManager.java:110)
        at org.apache.hadoop.hdds.scm.cli.ContainerOperationClient.newXCeiverClientManager(ContainerOperationClient.java:125)
        at org.apache.hadoop.hdds.scm.cli.ContainerOperationClient.getXceiverClientManager(ContainerOperationClient.java:91)
        at org.apache.hadoop.hdds.scm.cli.ContainerOperationClient.createContainer(ContainerOperationClient.java:212)
        at org.apache.hadoop.hdds.scm.cli.container.CreateSubcommand.execute(CreateSubcommand.java:59)
        at org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:39)
        at org.apache.hadoop.hdds.scm.cli.ScmSubcommand.call(ScmSubcommand.java:29)
        at picocli.CommandLine.executeUserObject(CommandLine.java:2031)
        at picocli.CommandLine.access$1500(CommandLine.java:148)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2469)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2461)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2423)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
        at picocli.CommandLine$RunLast.execute(CommandLine.java:2425)
        at org.apache.hadoop.ozone.shell.Shell.lambda$execute$0(Shell.java:95)
        at org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:167)
        at org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:157)
        at org.apache.hadoop.ozone.shell.Shell.execute(Shell.java:95)
        at picocli.CommandLine.execute(CommandLine.java:2174)
        at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:89)
        at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:80)
        at org.apache.hadoop.ozone.admin.OzoneAdmin.main(OzoneAdmin.java:36)
2025-07-08 10:44:48,478 [main] INFO utils.RetriableTask: Execution of task getCAList failed, will be retried in 10000 ms
(retries forever)

After the fix

bash-5.1$ ozone admin container create
java.security.cert.CertificateException: org.apache.hadoop.security.AccessControlException: Permission denied.

Robot test

==============================================================================
Container-Create :: Test ozone admin container create command without kinit...
==============================================================================
Create container without kinit                                        | PASS |
------------------------------------------------------------------------------
Container-Create :: Test ozone admin container create command with... | PASS |
1 test, 1 passed, 0 failed
==============================================================================

HDDS-13405. ozone admin container create runs forever without kinit

New changes
@kerneltime
Copy link
Contributor

Can you add a robot test for this change?

Co-authored-by: Aryan Gupta <[email protected]>
Copy link
Contributor

@aryangupta1998 aryangupta1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @sarvekshayr, LGTM!

@aryangupta1998
Copy link
Contributor

Thanks for the contribution, @sarvekshayr, and for the reviews @jojochuang and @kerneltime!

@aryangupta1998 aryangupta1998 merged commit f0dd236 into apache:master Jul 14, 2025
83 of 84 checks passed
errose28 added a commit to errose28/ozone that referenced this pull request Jul 22, 2025
* master: (90 commits)
  HDDS-13308. OM should expose Ratis config for increasing pending write limits (apache#8668)
  HDDS-8903. Add validation for ozone.om.snapshot.db.max.open.files. (apache#8787)
  HDDS-13429. Custom metadata headers with uppercase characters are not supported (apache#8805)
  HDDS-13448. DeleteBlocksCommandHandler thread stop for normal exception (apache#8816)
  HDDS-13346. Intermittent failure in TestCloseContainer#testContainerChecksumForClosedContainer (apache#8771)
  HDDS-13125. Add metrics for monitoring the SST file pruning threads. (apache#8764)
  HDDS-13367. [Docs] User doc for container balancer. (apache#8726)
  HDDS-13200. OM RocksDB Grafana Dashbroad shows no data on all panels (apache#8577)
  HDDS-13428. Recon - Retrigger of build whole NSSummary tree task submission inconsistency. (apache#8793)
  HDDS-13378. [Docs] Add a Production page under Getting Started (apache#8734)
  HDDS-13403. [Docs] Make feature proposal process more visible. (apache#8758)
  HDDS-11797. Remove cyclic dependency between SCMSafeModeManager and SafeModeRules (apache#8782)
  HDDS-13213. KeyDeletingService should limit task size by both key count and serialized size. (apache#8757)
  HDDS-13387. OMSnapshotCreateRequest logs invalid warning about DefaultReplicationConfig (apache#8760)
  HDDS-13405. ozone admin container create runs forever without kinit (apache#8765)
  HDDS-11514. Set optimal default values for delete configurations based on live cluster testing. (apache#8766)
  HDDS-13376. Add server-side limit note to ozone sh snapshot diff --page-size option (apache#8791)
  HDDS-11679. Support multiple S3Gs in MiniOzoneCluster (apache#8733)
  HDDS-13424. Use lsof instead of fuser to find if file is used in AbstractTestChunkManager (apache#8790)
  HDDS-13427. Bump awssdk to 2.31.78 (apache#8792)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants