Skip to content

Conversation

@ChenSammi
Copy link
Contributor

Copy link
Contributor

@nandakumar131 nandakumar131 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, the changes look good.

@adoroszlai
Copy link
Contributor

Thanks @ChenSammi for working on this.

Can you please check TestOzoneManagerPrepare? Seems to be failing consistently (also locally).

@ChenSammi
Copy link
Contributor Author

Thanks @ChenSammi for working on this.

Can you please check TestOzoneManagerPrepare? Seems to be failing consistently (also locally).

TestOzoneManagerPrepare is passed locally on my laptop. I just pushed a new patch. Let's see if TestOzoneManagerPrepare still fails in CI.

@ChenSammi
Copy link
Contributor Author

Thanks @nandakumar131 and @adoroszlai for the code review.

@ChenSammi ChenSammi merged commit 28170f2 into apache:master May 4, 2022
@kaijchen
Copy link
Member

kaijchen commented May 5, 2022

Hi @ChenSammi, seems this PR breaks TestOzoneManagerPrepare#testPrepareDownedOM, could you please take a look?

@adoroszlai
Copy link
Contributor

@ChenSammi TestOzoneManagerPrepare was flaky before (HDDS-5990), but it is definitely much worse now than previously.

Seems like stop/start logic is not entirely fail-safe, or test needs additional/adjusted waitFor checks?

2022-05-05 01:41:52,188 [pool-2338-thread-1] ERROR om.OzoneManager (ExitUtils.java:terminate(133)) - Terminating with exit status 1: Failed to start RPC Server.
java.net.BindException: Problem binding to [localhost:41487] java.net.BindException: Address already in use
  ...
  at org.apache.hadoop.ozone.om.OzoneManager.startRpcServer(OzoneManager.java:1084)
  at org.apache.hadoop.ozone.om.OzoneManager.getRpcServer(OzoneManager.java:1054)
  at org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:3366)
  at org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:3254)
  at org.apache.hadoop.ozone.om.OzoneManager.installSnapshotFromLeader(OzoneManager.java:3231)

https://github.com/adoroszlai/ozone-build-results/tree/master/2022/05/05/14826/it-flaky/hadoop-ozone/integration-test

@adoroszlai
Copy link
Contributor

Idea for fixing the bind exception: adoroszlai@65e11fa

Running 10 iterations:
https://github.com/adoroszlai/hadoop-ozone/runs/6302199651

Let's see if it's enough.

@adoroszlai
Copy link
Contributor

adoroszlai commented May 5, 2022

Yes, it seems bind exception is fixed. But testPrepareDownedOM still times out at assertClusterNotPrepared, because the restarted OM does not get the cancel prepare request. It is stuck after installing snapshot, thinking it is still in progress:

2022-05-05 10:01:07,994 [Listener at localhost/49178] INFO  om.OzoneManager (OzoneManager.java:installCheckpoint(3396)) - Install Checkpoint is finished with Term: 2 and Index: 96. Spend 528 ms.
2022-05-05 10:01:07,994 [Listener at localhost/49178] INFO  server.RaftServer$Division (RaftServerImpl.java:lambda$notifyStateMachineToInstallSnapshot$30(1595)) - omNode-2@group-523986131536: StateMachine successfully installed snapshot index 96. Reloading the StateMachine.
...
2022-05-05 10:01:08,682 [grpc-default-executor-2] INFO  server.RaftServer$Division (RaftServerImpl.java:checkInconsistentAppendEntries(1296)) - omNode-2@group-523986131536: Failed appendEntries as snapshot (97) installation is in progress

@adoroszlai
Copy link
Contributor

Upgrade to current Ratis 2.3.0-SNAPSHOT (current master) with Ratis Thirdparty 1.0.0 (pending release) seems to resolve the problem.

@kaijchen
Copy link
Member

kaijchen commented May 5, 2022

I have just run 100x testPrepareDownedOM with this PR reverted, and surprisingly got a clean run: https://github.com/kaijchen/ozone/actions/runs/2275278052

I guess there might be some issues around restarting the omRpcServer.

@adoroszlai
Copy link
Contributor

HDDS-10177 seems to be caused by this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants