Skip to content

Conversation

@devmadhuu
Copy link
Contributor

What changes were proposed in this pull request?

There is a difference between SCM's IncrementalContainerReportHandler and Recon's IncrementalContainerReportHandler , Recon always connects to SCM and verify each container before adding new container to its own containerStateManager cache. This could be a bottle neck if SCM may respond slow and frequent ICR requests may pile up in queue. So as of now, this PR will improve below multiple things:

  • Recon to verify containers in batches from SCM on receive of ICR request from DNs.
    
  • Reduce the scmClient configs for Recon before connecting to SCM:
       - `hdds.scmclient.rpc.timeout` - 1 min (Default value is 15 mins)
       - `hdds.scmclient.failover.max.retry` - 3 (Default value is dynamic and computed, but based on default values, computed value is 15)
        Above 2 SCM client configs will be updated to respective new values as mentioned for recon to connect to SCM. These 2 SCM client configs will be exposed and mapped with new recon configs to be able to adjust independently in recon.
    
      **New configs in Recon:**
           -  `ozone.recon.scmclient.rpc.timeout`
           -  `ozone.recon.scmclient.failover.max.retry`
    
  • Merge the Incremental container report (ICR) to existing list of ICR reports.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9883

How was this patch tested?

Tested using existing Junit tests after updating existing test - TestReconIncrementalContainerReportHandler

@devmadhuu
Copy link
Contributor Author

@sumitagrawl Pls review.

@devmadhuu devmadhuu marked this pull request as ready for review December 15, 2023 06:07
@adoroszlai adoroszlai changed the title HDDS-9883. Recon - Improve the performance of processing of IncrementalContainerReport requests from DN. HDDS-9883. Recon - Improve the performance of processing IncrementalContainerReport from DN Dec 15, 2023
@kerneltime
Copy link
Contributor

cc @tanvipenumudy

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmadhuu Thanks for working over this, have few comments

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmadhuu LGTM

@sumitagrawl sumitagrawl merged commit 6021582 into apache:master Jan 4, 2024
@jojochuang
Copy link
Contributor

I'm getting this exception which loosk related to this PR

2024-02-22 22:33:39,568 WARN [IPC Server handler 22 on 9891]-org.apache.hadoop.ipc.Server: IPC Server handler 22 on 9891, call Call#36 Retry#0 org.apache.hadoop.ozone.protocol.ReconDatanodeProtocol.submitRequest from 10.140.112.130:34368
java.lang.UnsupportedOperationException
at java.util.Collections$UnmodifiableCollection.addAll(Collections.java:1067)
at org.apache.hadoop.hdds.scm.server.SCMDatanodeHeartbeatDispatcher$IncrementalContainerReportFromDatanode.mergeReport(SCMDatanodeHeartbeatDispatcher.java:384)
at org.apache.hadoop.ozone.recon.scm.ReconContainerReportQueue.mergeIcr(ReconContainerReportQueue.java:41)
at org.apache.hadoop.hdds.scm.server.ContainerReportQueue.addIncrementalReport(ContainerReportQueue.java:115)
at org.apache.hadoop.hdds.scm.server.ContainerReportQueue.addValue(ContainerReportQueue.java:173)
at org.apache.hadoop.hdds.scm.server.ContainerReportQueue.add(ContainerReportQueue.java:187)
at org.apache.hadoop.hdds.scm.server.ContainerReportQueue.add(ContainerReportQueue.java:42)
at org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor.onMessage(FixedThreadPoolWithAffinityExecutor.java:178)
at org.apache.hadoop.hdds.server.events.EventQueue.fireEvent(EventQueue.java:220)
at org.apache.hadoop.hdds.scm.server.SCMDatanodeHeartbeatDispatcher.dispatch(SCMDatanodeHeartbeatDispatcher.java:159)
at org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer.sendHeartbeat(SCMDatanodeProtocolServer.java:282)
at org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolServerSideTranslatorPB.processMessage(StorageContainerDatanodeProtocolServerSideTranslatorPB.java:113)
at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
at org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolServerSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolServerSideTranslatorPB.java:92)
at org.apache.hadoop.hdds.protocol.proto.StorageContainerDatanodeProtocolProtos$StorageContainerDatanodeProtocolService$2.callBlockingMethod(StorageContainerDatanodeProtocolProtos.java:45283)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)

@devmadhuu
Copy link
Contributor Author

I'm getting this exception which loosk related to this PR

2024-02-22 22:33:39,568 WARN [IPC Server handler 22 on 9891]-org.apache.hadoop.ipc.Server: IPC Server handler 22 on 9891, call Call#36 Retry#0 org.apache.hadoop.ozone.protocol.ReconDatanodeProtocol.submitRequest from 10.140.112.130:34368
java.lang.UnsupportedOperationException
at java.util.Collections$UnmodifiableCollection.addAll(Collections.java:1067)
at org.apache.hadoop.hdds.scm.server.SCMDatanodeHeartbeatDispatcher$IncrementalContainerReportFromDatanode.mergeReport(SCMDatanodeHeartbeatDispatcher.java:384)
at org.apache.hadoop.ozone.recon.scm.ReconContainerReportQueue.mergeIcr(ReconContainerReportQueue.java:41)
at org.apache.hadoop.hdds.scm.server.ContainerReportQueue.addIncrementalReport(ContainerReportQueue.java:115)
at org.apache.hadoop.hdds.scm.server.ContainerReportQueue.addValue(ContainerReportQueue.java:173)
at org.apache.hadoop.hdds.scm.server.ContainerReportQueue.add(ContainerReportQueue.java:187)
at org.apache.hadoop.hdds.scm.server.ContainerReportQueue.add(ContainerReportQueue.java:42)
at org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor.onMessage(FixedThreadPoolWithAffinityExecutor.java:178)
at org.apache.hadoop.hdds.server.events.EventQueue.fireEvent(EventQueue.java:220)
at org.apache.hadoop.hdds.scm.server.SCMDatanodeHeartbeatDispatcher.dispatch(SCMDatanodeHeartbeatDispatcher.java:159)
at org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer.sendHeartbeat(SCMDatanodeProtocolServer.java:282)
at org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolServerSideTranslatorPB.processMessage(StorageContainerDatanodeProtocolServerSideTranslatorPB.java:113)
at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
at org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolServerSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolServerSideTranslatorPB.java:92)
at org.apache.hadoop.hdds.protocol.proto.StorageContainerDatanodeProtocolProtos$StorageContainerDatanodeProtocolService$2.callBlockingMethod(StorageContainerDatanodeProtocolProtos.java:45283)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)

This issue is being tracked at HDDS-10413

swamirishi pushed a commit to swamirishi/ozone that referenced this pull request Jun 10, 2024
…IncrementalContainerReport from DN (apache#5793)

(cherry picked from commit 6021582)
Change-Id: Ic9fa1c235a2c347cf9375f6a1da9a47f5dcba213
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants