Skip to content

Conversation

@adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

TestDecommissionAndMaintenance uses MiniOzoneClusterProvider to provision clusters in the background. Tests intermittently fail due to port conflict.

Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 348.139 s <<< FAILURE! - in org.apache.hadoop.ozone.scm.node.TestDecommissionAndMaintenance
org.apache.hadoop.ozone.scm.node.TestDecommissionAndMaintenance.testNodeWithOpenPipelineCanBeDecommissionedAndRecommissioned  Time elapsed: 159.55 s  <<< ERROR!
java.util.concurrent.TimeoutException: 
...
  at org.apache.hadoop.ozone.MiniOzoneClusterImpl.waitForClusterToBeReady(MiniOzoneClusterImpl.java:218)
  at org.apache.hadoop.ozone.MiniOzoneClusterImpl.restartHddsDatanode(MiniOzoneClusterImpl.java:431)
  at org.apache.hadoop.ozone.scm.node.TestDecommissionAndMaintenance.testNodeWithOpenPipelineCanBeDecommissionedAndRecommissioned(TestDecommissionAndMaintenance.java:234)

The problem is that, while the datanode is stopped, its ports may be reused by some component in a new cluster being provisioned in the background. The original owner of the port fails to start, cluster never becomes ready again.

2023-05-10 07:26:13,629 [EndpointStateMachine task thread for /0.0.0.0:45947 - 0 ] INFO  server.GrpcService (GrpcService.java:startImpl(302)) - 3193002e-fc2b-4cc9-9970-da2531c45e46: GrpcService started, listening on 44925
...
2023-05-10 07:26:37,941 [main] INFO  server.GrpcService (GrpcService.java:closeImpl(320)) - 3193002e-fc2b-4cc9-9970-da2531c45e46: shutdown server GrpcServerProtocolService successfully
...
2023-05-10 07:26:45,485 [EndpointStateMachine task thread for /0.0.0.0:34213 - 0 ] INFO  server.GrpcService (GrpcService.java:startImpl(302)) - 0c852ae0-3c0b-4f2d-b68a-19e305d37000: GrpcService started, listening on 44925
...
2023-05-10 07:26:46,652 [EndpointStateMachine task thread for /0.0.0.0:45947 - 0 ] INFO  ratis.XceiverServerRatis (XceiverServerRatis.java:start(517)) - Starting XceiverServerRatis 3193002e-fc2b-4cc9-9970-da2531c45e46
2023-05-10 07:26:46,658 [EndpointStateMachine task thread for /0.0.0.0:45947 - 0 ] ERROR server.GrpcService (ExitUtils.java:terminate(133)) - Terminating with exit status 1: Failed to start Grpc server
java.io.IOException: Failed to bind to address 0.0.0.0/0.0.0.0:44925
...
Caused by: org.apache.ratis.thirdparty.io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address already in use

This PR replaces random ports with a simple incremental allocation starting at 15000. It applies to all MiniOzoneCluster-based tests.

https://issues.apache.org/jira/browse/HDDS-8581

How was this patch tested?

CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4945442087

100x run of TestDecommissionAndMaintenance:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4944968792

@adoroszlai adoroszlai self-assigned this May 11, 2023
Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change LGTM. I have one concern, but we can monitor and see if this works well.

As we increment through the random ports, there is a chance something else on the host is using the port we try to use, eg some client bound to localhost writing into the cluster, for example. That would then cause the service to fail. I don't think we can completely prevent that, as something could slip in after you have allocated, but it might be possible to check the port is free as we allocate it, and if it is not, then increment the port number and try again. I think I saw some code in the past that did something like this - perhaps it tries to bind the port and then releases it before returning the port number.

I am happy to commit this change and then we can see if something like this occurs before adding such a check.

@adoroszlai
Copy link
Contributor Author

Thanks @sodonnel for the review. I've updated the port range to 15000-32000 based on Wikipedia's lists of ephemeral port ranges in some OSs (32768 seems to be a common lower bound for those).

@sodonnel
Copy link
Contributor

Limiting the port range to under 32k should work ok. We would need a lot of clusters to go from 15k to the max limit!

Change LGTM, so please commit when CI is green.

@adoroszlai adoroszlai merged commit 4d4d31f into apache:master May 12, 2023
@adoroszlai adoroszlai deleted the HDDS-8581 branch May 12, 2023 09:39
errose28 added a commit to errose28/ozone that referenced this pull request May 17, 2023
* master: (78 commits)
  HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.testCloseContainerWithDelayByLeaseManager (apache#4688)
  HDDS-7241. EC: Reconstruction could fail with orphan blocks. (apache#4718)
  HDDS-8577. [Snapshot] Disable compaction log when loading metadata for snapshot (apache#4697)
  HDDS-7080. EC: Offline reconstruction needs better logging (apache#4719)
  HDDS-8626. Config thread pool in ReplicationServer (apache#4715)
  HDDS-8616. Underreplication not fixed if all replicas start decommissioning (apache#4711)
  HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583)
  HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583)
  HDDS-8615. Explicitly show EC block type in 'ozone debug chunkinfo' command output (apache#4706)
  HDDS-8623. Delete duplicate getBucketInfo in OMKeyCommitRequest (apache#4712)
  HDDS-8339. Recon Show the number of keys marked for Deletion in Recon UI. (apache#4519)
  HDDS-8572. Support CodecBuffer for protobuf v3 codecs. (apache#4693)
  HDDS-8010. Improve DN warning message when getBlock does not find the block. (apache#4698)
  HDDS-8621. IOException is never thrown in SCMRatisServer.getRatisRoles(). (apache#4710)
  HDDS-8463. S3 key uniqueness in deletedTable (apache#4660)
  HDDS-8584. Hadoop client write slowly when stream enabled (apache#4703)
  HDDS-7732. EC: Verify block deletion from missing EC containers (apache#4705)
  HDDS-8581. Avoid random ports in integration tests (apache#4699)
  HDDS-8504. ReplicationManager: Pass used and excluded node separately for Under and Mis-Replication (apache#4694)
  HDDS-8576. Close RocksDB instance in RDBStore if RDBStore's initialization fails after RocksDB instance creation (apache#4692)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants