Skip to content

Conversation

@ChenSammi
Copy link
Contributor

What changes were proposed in this pull request?

Currently, when RatisServer is down(mainly due to long GC which exceeds the ratis close threshold), Datanode is still running and in HEALTHY and IN_SERVICE state, which is confusing.

This tasks will shutdown the Datanode after RatisServer is down.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10749

How was this patch tested?

Manual test

@ChenSammi
Copy link
Contributor Author

ChenSammi commented Apr 25, 2024

A normal DN shutdown log, first XceiverServerRatis is stopped, "Stopping XceiverServerRatis 01effdc6-dad1-4bf3-916a-749d9aa7e5e5", then ContainerStateMachine is stopped, "Stopping ContainerStateMachine for group-5EA60976374E".

2024-04-24 17:53:21,589 ERROR ozone.HddsDatanodeService (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 2: SIGINT
2024-04-24 17:53:21,590 INFO  ozone.HddsDatanodeService (StringUtils.java:lambda$startupShutdownMessage$0(144)) - SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down HddsDatanodeService at SAMMICHEN-MB0/0.0.0.0
************************************************************/
2024-04-24 17:53:21,595 INFO  ozoneimpl.OzoneContainer (OzoneContainer.java:stop(482)) - Attempting to stop container services.
2024-04-24 17:53:21,595 WARN  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:handleRemainingSleep(134)) - Background container scan was interrupted.
2024-04-24 17:53:21,595 INFO  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:run(61)) - Thread[ContainerMetadataScanner,5,main] exiting.
2024-04-24 17:53:21,595 INFO  ozoneimpl.BackgroundContainerDataScanner (BackgroundContainerDataScanner.java:shutdown(141)) - ContainerDataScanner(/tmp/datanode1/storage/hdds) is shutting down. 
2024-04-24 17:53:21,595 WARN  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:handleRemainingSleep(134)) - Background container scan was interrupted.
2024-04-24 17:53:21,596 INFO  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:run(61)) - ContainerDataScanner(/tmp/datanode1/storage/hdds, DS-af727dc0-66f9-4db9-8f1f-8ce487a40766) exiting.
2024-04-24 17:53:21,596 INFO  ozoneimpl.OnDemandContainerDataScanner (OnDemandContainerDataScanner.java:shutdownScanner(206)) - On-demand container scanner is shutting down.
2024-04-24 17:53:21,606 INFO  ratis.XceiverServerRatis (XceiverServerRatis.java:stop(604)) - Stopping XceiverServerRatis 01effdc6-dad1-4bf3-916a-749d9aa7e5e5
2024-04-24 17:53:21,606 INFO  server.RaftServer (RaftServerProxy.java:lambda$close$9(416)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: close
2024-04-24 17:53:21,607 INFO  server.RaftServer$Division (RaftServerImpl.java:lambda$close$3(526)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E: shutdown
2024-04-24 17:53:21,607 INFO  server.GrpcService (GrpcService.java:closeImpl(311)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcClientProtocolService now
2024-04-24 17:53:21,607 INFO  util.JmxRegister (JmxRegister.java:unregister(73)) - Successfully un-registered JMX Bean with object name Ratis:service=RaftServer,group=group-5EA60976374E,id=01effdc6-dad1-4bf3-916a-749d9aa7e5e5
2024-04-24 17:53:21,607 INFO  impl.RoleInfo (RoleInfo.java:shutdownLeaderState(94)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-LeaderStateImpl
2024-04-24 17:53:21,610 INFO  server.GrpcService (GrpcService.java:closeImpl(320)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcClientProtocolService successfully
2024-04-24 17:53:21,610 INFO  server.GrpcService (GrpcService.java:closeImpl(311)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server GrpcServerProtocolService now
2024-04-24 17:53:21,611 INFO  server.GrpcService (GrpcService.java:closeImpl(320)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server GrpcServerProtocolService successfully
2024-04-24 17:53:21,611 INFO  server.GrpcService (GrpcService.java:closeImpl(311)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcAdminProtocolService now
2024-04-24 17:53:21,614 INFO  server.GrpcService (GrpcService.java:closeImpl(320)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcAdminProtocolService successfully
2024-04-24 17:53:21,614 INFO  impl.PendingRequests (PendingRequests.java:sendNotLeaderResponses(289)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-PendingRequests: sendNotLeaderResponses
2024-04-24 17:53:21,620 INFO  impl.StateMachineUpdater (StateMachineUpdater.java:stopAndJoin(157)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-StateMachineUpdater: set stopIndex = 2
2024-04-24 17:53:21,620 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:takeSnapshot(359)) - group-5EA60976374E: Taking a snapshot at:(t:2, i:2) file /tmp/datanode1/ratis/e9e7ba3c-7686-4b3a-96fd-5ea60976374e/sm/snapshot.2_2
2024-04-24 17:53:21,621 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:takeSnapshot(370)) - group-5EA60976374E: Finished taking a snapshot at:(t:2, i:2) file:/tmp/datanode1/ratis/e9e7ba3c-7686-4b3a-96fd-5ea60976374e/sm/snapshot.2_2 took: 1 ms
2024-04-24 17:53:21,622 INFO  impl.StateMachineUpdater (StateMachineUpdater.java:takeSnapshot(295)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-StateMachineUpdater: Took a snapshot at index 2
2024-04-24 17:53:21,622 INFO  impl.StateMachineUpdater (StateMachineUpdater.java:lambda$new$0(98)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-StateMachineUpdater: snapshotIndex: updateIncreasingly 0 -> 2
2024-04-24 17:53:21,623 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:close(1150)) - Stopping ContainerStateMachine for group-5EA60976374E.
2024-04-24 17:53:21,623 INFO  server.RaftServer$Division (ServerState.java:close(427)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E: applyIndex: 2
2024-04-24 17:53:21,623 INFO  util.AwaitToRun (AwaitToRun.java:run(49)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-cacheEviction-AwaitToRun-AwaitForSignal is interrupted
2024-04-24 17:53:21,695 INFO  segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(245)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-SegmentedRaftLogWorker close()
2024-04-24 17:53:21,697 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(152)) - JvmPauseMonitor-01effdc6-dad1-4bf3-916a-749d9aa7e5e5: Stopped
2024-04-24 17:53:23,783 INFO  volume.HddsVolume (HddsVolume.java:closeDbStore(470)) - SchemaV3 db is stopped at /tmp/datanode1/storage/hdds/CID-9ba4109c-68b1-4311-9623-42f82149fb80/DS-af727dc0-66f9-4db9-8f1f-8ce487a40766/container.db for volume DS-af727dc0-66f9-4db9-8f1f-8ce487a40766
2024-04-24 17:53:23,783 INFO  utils.BackgroundService (BackgroundService.java:shutdown(160)) - Shutting down service BlockDeletingService
2024-04-24 17:53:23,784 INFO  utils.BackgroundService (BackgroundService.java:shutdown(160)) - Shutting down service StaleRecoveringContainerScrubbingService
2024-04-24 17:53:23,785 INFO  statemachine.DatanodeStateMachine (DatanodeStateMachine.java:stopDaemon(640)) - Ozone container server stopped.
2024-04-24 17:53:23,790 INFO  handler.ContextHandler (ContextHandler.java:doStop(1159)) - Stopped o.e.j.w.WebAppContext@3baf6936{hddsDatanode,/,null,STOPPED}{file:/Users/sammi/workspace/hadoop-ozone/hadoop-hdds/container-service/target/classes/webapps/hddsDatanode}
2024-04-24 17:53:23,794 INFO  server.AbstractConnector (AbstractConnector.java:doStop(383)) - Stopped ServerConnector@4f453e63{HTTP/1.1, (http/1.1)}{SAMMICHEN-MB0:9882}
2024-04-24 17:53:23,794 INFO  server.session (HouseKeeper.java:stopScavenging(149)) - node0 Stopped scavenging
2024-04-24 17:53:23,794 INFO  handler.ContextHandler (ContextHandler.java:doStop(1159)) - Stopped o.e.j.s.ServletContextHandler@1816e24a{static,/static,file:///Users/sammi/workspace/hadoop-ozone/hadoop-hdds/container-service/target/classes/webapps/static,STOPPED}
2024-04-24 17:53:23,795 INFO  ozone.HddsDatanodeClientProtocolServer (HddsDatanodeClientProtocolServer.java:stop(83)) - Stopping the RPC server for Client Protocol
2024-04-24 17:53:23,795 INFO  ipc.Server (Server.java:stop(3523)) - Stopping server on 19864
2024-04-24 17:53:23,796 INFO  ipc.Server (Server.java:run(1434)) - Stopping IPC Server listener on 19864
2024-04-24 17:53:23,796 INFO  ipc.Server (Server.java:run(1567)) - Stopping IPC Server Responder

@ChenSammi
Copy link
Contributor Author

A DN shutdown due to Ratis server is shutdown. First ContainerStateMachine is closed, "Container statemachine is closed by ratis, terminating HddsDatanodeService", then XceiverServerRatis is stopped, "Stopping XceiverServerRatis 01effdc6-dad1-4bf3-916a-749d9aa7e5e5".

2024-04-24 18:06:16,666 WARN  util.JvmPauseMonitor (JvmPauseMonitor.java:detectPause(168)) - JvmPauseMonitor-01effdc6-dad1-4bf3-916a-749d9aa7e5e5: Detected pause in JVM or host machine approximately 93.265s without any GCs.
2024-04-24 18:06:16,666 ERROR server.RaftServer (RaftServerProxy.java:handleJvmPause(237)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: JVM pause detected 93.265s longer than the close-threshold 60s, shutting down ...
2024-04-24 18:06:16,678 INFO  server.RaftServer (RaftServerProxy.java:lambda$close$9(416)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: close
2024-04-24 18:06:16,684 INFO  server.RaftServer$Division (RaftServerImpl.java:lambda$close$3(526)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E: shutdown
2024-04-24 18:06:16,685 INFO  server.GrpcService (GrpcService.java:closeImpl(311)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcClientProtocolService now
2024-04-24 18:06:16,690 INFO  util.JmxRegister (JmxRegister.java:unregister(73)) - Successfully un-registered JMX Bean with object name Ratis:service=RaftServer,group=group-5EA60976374E,id=01effdc6-dad1-4bf3-916a-749d9aa7e5e5
2024-04-24 18:06:16,691 INFO  impl.RoleInfo (RoleInfo.java:shutdownLeaderState(94)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-LeaderStateImpl
2024-04-24 18:06:16,724 INFO  impl.PendingRequests (PendingRequests.java:sendNotLeaderResponses(289)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-PendingRequests: sendNotLeaderResponses
2024-04-24 18:06:16,727 INFO  server.GrpcService (GrpcService.java:closeImpl(320)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcClientProtocolService successfully
2024-04-24 18:06:16,727 INFO  server.GrpcService (GrpcService.java:closeImpl(311)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server GrpcServerProtocolService now
2024-04-24 18:06:16,728 INFO  impl.StateMachineUpdater (StateMachineUpdater.java:stopAndJoin(157)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-StateMachineUpdater: set stopIndex = 4
2024-04-24 18:06:16,729 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:takeSnapshot(359)) - group-5EA60976374E: Taking a snapshot at:(t:3, i:4) file /tmp/datanode1/ratis/e9e7ba3c-7686-4b3a-96fd-5ea60976374e/sm/snapshot.3_4
2024-04-24 18:06:16,729 INFO  server.GrpcService (GrpcService.java:closeImpl(320)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server GrpcServerProtocolService successfully
2024-04-24 18:06:16,729 INFO  server.GrpcService (GrpcService.java:closeImpl(311)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcAdminProtocolService now
2024-04-24 18:06:16,732 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:takeSnapshot(370)) - group-5EA60976374E: Finished taking a snapshot at:(t:3, i:4) file:/tmp/datanode1/ratis/e9e7ba3c-7686-4b3a-96fd-5ea60976374e/sm/snapshot.3_4 took: 4 ms
2024-04-24 18:06:16,733 INFO  server.GrpcService (GrpcService.java:closeImpl(320)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5: shutdown server org.apache.ratis.grpc.server.GrpcAdminProtocolService successfully
2024-04-24 18:06:16,734 INFO  impl.StateMachineUpdater (StateMachineUpdater.java:takeSnapshot(295)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-StateMachineUpdater: Took a snapshot at index 4
2024-04-24 18:06:16,734 INFO  impl.StateMachineUpdater (StateMachineUpdater.java:lambda$new$0(98)) - 01effdc6-dad1-4bf3-916a-749d9aa7e5e5@group-5EA60976374E-StateMachineUpdater: snapshotIndex: updateIncreasingly 2 -> 4
2024-04-24 18:06:16,740 ERROR ratis.ContainerStateMachine (ContainerStateMachine.java:close(1142)) - Container statemachine is closed by ratis, terminating HddsDatanodeService
2024-04-24 18:06:26,754 INFO  ozoneimpl.OzoneContainer (OzoneContainer.java:stop(482)) - Attempting to stop container services.
2024-04-24 18:06:26,754 WARN  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:handleRemainingSleep(134)) - Background container scan was interrupted.
2024-04-24 18:06:26,754 INFO  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:run(61)) - Thread[ContainerMetadataScanner,5,main] exiting.
2024-04-24 18:06:26,755 INFO  ozoneimpl.BackgroundContainerDataScanner (BackgroundContainerDataScanner.java:shutdown(141)) - ContainerDataScanner(/tmp/datanode1/storage/hdds) is shutting down. 
2024-04-24 18:06:26,755 WARN  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:handleRemainingSleep(134)) - Background container scan was interrupted.
2024-04-24 18:06:26,755 INFO  ozoneimpl.AbstractBackgroundContainerScanner (AbstractBackgroundContainerScanner.java:run(61)) - ContainerDataScanner(/tmp/datanode1/storage/hdds, DS-af727dc0-66f9-4db9-8f1f-8ce487a40766) exiting.
2024-04-24 18:06:26,755 INFO  ozoneimpl.OnDemandContainerDataScanner (OnDemandContainerDataScanner.java:shutdownScanner(206)) - On-demand container scanner is shutting down.
2024-04-24 18:06:26,756 INFO  ratis.XceiverServerRatis (XceiverServerRatis.java:stop(604)) - Stopping XceiverServerRatis 01effdc6-dad1-4bf3-916a-749d9aa7e5e5
2024-04-24 18:06:26,757 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(152)) - JvmPauseMonitor-01effdc6-dad1-4bf3-916a-749d9aa7e5e5: Stopped
2024-04-24 18:06:28,892 INFO  volume.HddsVolume (HddsVolume.java:closeDbStore(470)) - SchemaV3 db is stopped at /tmp/datanode1/storage/hdds/CID-9ba4109c-68b1-4311-9623-42f82149fb80/DS-af727dc0-66f9-4db9-8f1f-8ce487a40766/container.db for volume DS-af727dc0-66f9-4db9-8f1f-8ce487a40766
2024-04-24 18:06:28,893 INFO  utils.BackgroundService (BackgroundService.java:shutdown(160)) - Shutting down service BlockDeletingService
2024-04-24 18:06:28,893 INFO  utils.BackgroundService (BackgroundService.java:shutdown(160)) - Shutting down service StaleRecoveringContainerScrubbingService
2024-04-24 18:06:28,894 INFO  statemachine.DatanodeStateMachine (DatanodeStateMachine.java:stopDaemon(640)) - Ozone container server stopped.
2024-04-24 18:06:28,899 INFO  handler.ContextHandler (ContextHandler.java:doStop(1159)) - Stopped o.e.j.w.WebAppContext@5fbdc49b{hddsDatanode,/,null,STOPPED}{file:/Users/sammi/workspace/hadoop-ozone/hadoop-hdds/container-service/target/classes/webapps/hddsDatanode}
2024-04-24 18:06:28,903 INFO  server.AbstractConnector (AbstractConnector.java:doStop(383)) - Stopped ServerConnector@7fc7c4a{HTTP/1.1, (http/1.1)}{SAMMICHEN-MB0:9882}
2024-04-24 18:06:28,903 INFO  server.session (HouseKeeper.java:stopScavenging(149)) - node0 Stopped scavenging
2024-04-24 18:06:28,903 INFO  handler.ContextHandler (ContextHandler.java:doStop(1159)) - Stopped o.e.j.s.ServletContextHandler@76c387f9{static,/static,file:///Users/sammi/workspace/hadoop-ozone/hadoop-hdds/container-service/target/classes/webapps/static,STOPPED}
2024-04-24 18:06:28,904 INFO  ozone.HddsDatanodeClientProtocolServer (HddsDatanodeClientProtocolServer.java:stop(83)) - Stopping the RPC server for Client Protocol
2024-04-24 18:06:28,905 INFO  ipc.Server (Server.java:stop(3523)) - Stopping server on 19864
2024-04-24 18:06:28,905 INFO  ipc.Server (Server.java:run(1434)) - Stopping IPC Server listener on 19864
2024-04-24 18:06:28,905 INFO  ipc.Server (Server.java:run(1567)) - Stopping IPC Server Responder
2024-04-24 18:06:28,908 INFO  util.ExitUtil (ExitUtil.java:terminate(241)) - Exiting with status 1: ExitException
2024-04-24 18:06:28,909 INFO  ozone.HddsDatanodeService (StringUtils.java:lambda$startupShutdownMessage$0(144)) - SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down HddsDatanodeService at SAMMICHEN-MB0/0.0.0.0
************************************************************/

Process finished with exit code 1

Comment on lines 1143 to 1145
// wait a while for other pipeline's ContainerStateMachine.close() called.
try {
Thread.sleep(10000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more reliable way to wait for other pipeline closure other than sleep here?

And what happens if there are still unclosed pipeline after 10 seconds' wait?

Copy link
Contributor Author

@ChenSammi ChenSammi Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the ContainerStateMachine.java, which is called by each pipeline, it doesn't have the knowledge of other pipelines. 10s here is try to let other pipelines have time to close. And it's only memory operation in ContainerStateMachine.close() call. Missed a call to ContainerStateMachine.close() is not a big issue. So Shutdown immediately, wait 5s or 10s, has no big difference. Just think wait a while would be better, like when executor pool is shutdown.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ChenSammi for working on this.

  • Each Raft group, which corresponds to a pipeline membership in Datanode, has its own ContainerStateMachine.
  • When SCM detects a dead datanode, it closes all pipelines associated with it, which triggers close of the state machine in the other two nodes.

So with this patch stopping a datanode or closing a pipeline kills other datanodes.

Note that due with multi-Raft, the effect can be cascading, since datanodes may be associated with different sets of other nodes for each Raft group.

Repro:

cd hadoop-ozone/dist/target/ozone-1.5.0-SNAPSHOT/compose/ozone
OZONE_DATANODES=6 ./run.sh -d
docker-compose exec scm ozone admin safemode wait -t 60
docker-compose ps
docker-compose up -d --no-recreate --scale datanode=5
docker-compose ps
sleep 120
docker-compose ps

Datanodes at the last step:

      Name                    Command               State                                             Ports                                          
-----------------------------------------------------------------------------------------------------------------------------------------------------
ozone_datanode_1   /usr/local/bin/dumb-init - ...   Up       0.0.0.0:33008->19864/tcp,:::33008->19864/tcp, 0.0.0.0:33011->9882/tcp,:::33011->9882/tcp
ozone_datanode_2   /usr/local/bin/dumb-init - ...   Exit 1                                                                                           
ozone_datanode_3   /usr/local/bin/dumb-init - ...   Up       0.0.0.0:33014->19864/tcp,:::33014->19864/tcp, 0.0.0.0:33015->9882/tcp,:::33015->9882/tcp
ozone_datanode_4   /usr/local/bin/dumb-init - ...   Up       0.0.0.0:33006->19864/tcp,:::33006->19864/tcp, 0.0.0.0:33007->9882/tcp,:::33007->9882/tcp
ozone_datanode_5   /usr/local/bin/dumb-init - ...   Exit 1                                                                                           

@ChenSammi
Copy link
Contributor Author

@adoroszlai , I noticed the impact to the integration test too. It looks like terminate the DN in ContainerStateMachine is not a good idea for DN. Let me think if there is other solutions.

@adoroszlai adoroszlai marked this pull request as draft May 3, 2024 09:46
@ChenSammi
Copy link
Contributor Author

ChenSammi commented May 16, 2024

Wait for RATIS release including https://issues.apache.org/jira/browse/RATIS-2066.

@jojochuang
Copy link
Contributor

I was made aware that for OM if Ratis server experiences a long pause, Ratis state machine crashes itself and that shuts down OM: https://issues.apache.org/jira/browse/HDDS-6141

@ChenSammi
Copy link
Contributor Author

I was made aware that for OM if Ratis server experiences a long pause, Ratis state machine crashes itself and that shuts down OM: https://issues.apache.org/jira/browse/HDDS-6141

Both OM and SCM will shutdown itself after a long pause.

@ChenSammi ChenSammi marked this pull request as ready for review July 1, 2024 10:02
@ChenSammi
Copy link
Contributor Author

Manual close the datanode, related datanode log

2024-07-01 17:34:54,237 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:notifyServerShutdown(936)) - group-2D6AB2E224A3 is closed by HddsDatanodeService
2024-07-01 17:34:54,774 INFO  segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - 9c367fb6-68b0-487d-bb10-3e8c0da9b148@group-6CC213E8C815-SegmentedRaftLogWorker close()
2024-07-01 17:34:54,775 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:notifyServerShutdown(936)) - group-6CC213E8C815 is closed by HddsDatanodeService
2024-07-01 17:34:54,805 INFO  segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:close(248)) - 9c367fb6-68b0-487d-bb10-3e8c0da9b148@group-86A881EBB3A5-SegmentedRaftLogWorker close()
2024-07-01 17:34:54,812 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:notifyServerShutdown(936)) - group-86A881EBB3A5 is closed by HddsDatanodeService

Manual pause DN process and then resume the process

2024-07-01 17:56:07,572 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:notifyServerShutdown(896)) - group-60029B7F6B87 is closed by ratis
2024-07-01 17:56:07,585 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:notifyServerShutdown(896)) - group-86A881EBB3A5 is closed by ratis
2024-07-01 17:56:07,586 INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:notifyServerShutdown(896)) - group-6CC213E8C815 is closed by ratis
2024-07-01 17:56:12,580 ERROR ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$notifyServerShutdown$9(916)) - Container statemachine is closed by ratis, terminating HddsDatanodeService. closed(3)/total(3)

@ChenSammi ChenSammi dismissed adoroszlai’s stale review July 1, 2024 10:28

Comment addressed

@ChenSammi
Copy link
Contributor Author

All three failed misc acceptance runs are due to

failed to solve: process "/bin/sh -c sudo yum install -y openssh-clients openssh-server" did not complete successfully: exit code: 1

It cannot tell from the current logs why it failed. @adoroszlai , do you have any idea about this issue?

@ChenSammi
Copy link
Contributor Author

Looks like the problem is

 > [om  2/15] RUN sudo yum install -y openssh-clients openssh-server:                                                                                                                                                                                      
#0 0.519 Loaded plugins: fastestmirror, ovl                                                                                                                                                                                                                
#0 0.783 Determining fastest mirrors                                                                                                                                                                                                                       
#0 1.328 Could not retrieve mirrorlist http://mirrorlist.centos.org/?release=7&arch=aarch64&repo=os&infra=container error was
#0 1.328 14: curl#6 - "Could not resolve host: mirrorlist.centos.org; Unknown error"
#0 1.338 
#0 1.338 
#0 1.338  One of the configured repositories failed (Unknown),
#0 1.338  and yum doesn't have enough cached data to continue. At this point the only
#0 1.338  safe thing yum can do is fail. There are a few ways to work "fix" this:
#0 1.338 
#0 1.338      1. Contact the upstream for the repository and get them to fix the problem.
#0 1.338 
#0 1.338      2. Reconfigure the baseurl/etc. for the repository, to point to a working
#0 1.338         upstream. This is most often useful if you are using a newer
#0 1.338         distribution release than is supported by the repository (and the
#0 1.338         packages for the previous distribution release still work).
#0 1.338 
#0 1.338      3. Run the command with the repository temporarily disabled
#0 1.338             yum --disablerepo=<repoid> ...
#0 1.338 
#0 1.338      4. Disable the repository permanently, so yum won't use it by default. Yum
#0 1.338         will then just ignore the repository until you permanently enable it
#0 1.338         again or use --enablerepo for temporary usage:
#0 1.338 
#0 1.338             yum-config-manager --disable <repoid>
#0 1.338         or
#0 1.338             subscription-manager repos --disable=<repoid>
#0 1.338 
#0 1.338      5. Configure the failing repository to be skipped, if it is unavailable.
#0 1.338         Note that yum will try to contact the repo. when it runs most commands,
#0 1.338         so will have to try and fail each time (and thus. yum will be be much
#0 1.338         slower). If it is a very temporary problem though, this is often a nice
#0 1.338         compromise:
#0 1.338 
#0 1.338             yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
#0 1.338 
#0 1.338 Cannot find a valid baseurl for repo: base/7/aarch64
------
failed to solve: process "/bin/sh -c sudo yum install -y openssh-clients openssh-server" did not complete successfully: exit code: 1

@adoroszlai
Copy link
Contributor

failed to solve: process "/bin/sh -c sudo yum install -y openssh-clients openssh-server" did not complete successfully: exit code: 1

@ChenSammi please see #6893. This should be OK after merging from master.

@jojochuang
Copy link
Contributor

Looks like all comments are addressed. HDDS-11092 is merged into HDDS-7593 so the previous error is no longer seen.

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks @ChenSammi

@ChenSammi ChenSammi merged commit 8e701bf into apache:master Aug 13, 2024
@ChenSammi
Copy link
Contributor Author

Thanks @smengcl @jojochuang @adoroszlai for the review.

ivandika3 pushed a commit to ivandika3/ozone that referenced this pull request Oct 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants