Skip to content

Conversation

@sodonnel
Copy link
Contributor

@sodonnel sodonnel commented Jan 6, 2023

What changes were proposed in this pull request?

Replication Manager writes an info log each time it sends a command to a datanode, but aside from the command name, it doesn't contain any further information about the command. This change is to log more details about the command being sent which may aid debugging issues in the future.

Example new messages after this change:

2023-01-06 17:23:06,080 [main] INFO  replication.ReplicationManager (ReplicationManager.java:sendDatanodeCommand(426)) - Sending command [replicateContainerCommand: containerId: 1, replicaIndex: 1, sourceNodes: [localhost-143.110.182.62/143.110.182.62, localhost-243.150.55.19/243.150.55.19]] for container ContainerInfo{id=#1, state=CLOSED, pipelineID=PipelineID=bdd96da4-f447-45d4-9331-62480e7bf224, stateEnterTime=1970-01-01T00:00:00Z, owner=Ozone} to f8bca9d9-cf71-4e8a-aaed-f48ddc148a73{ip: 37.218.105.163, host: localhost-37.218.105.163, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0, RATIS_DATASTREAM=0], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}

2023-01-06 17:23:05,893 [main] INFO  replication.ReplicationManager (ReplicationManager.java:sendDatanodeCommand(426)) - Sending command [closeContainerCommand: containerID: 1, pipelineID: PipelineID=6dad5a63-dfa1-4621-b8c4-7de81e024cdc, force: true] for container ContainerInfo{id=#1, state=CLOSED, pipelineID=PipelineID=6dad5a63-dfa1-4621-b8c4-7de81e024cdc, stateEnterTime=1970-01-01T00:00:00Z, owner=Ozone} to bb47be9c-0886-4852-8ca7-46799ac36883{ip: 140.122.156.42, host: localhost-140.122.156.42, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0, RATIS_DATASTREAM=0], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}


2023-01-06 17:23:05,938 [main] INFO  replication.ReplicationManager (ReplicationManager.java:sendDatanodeCommand(426)) - Sending command [reconstructECContainersCommand: containerID: 1, replicationConfig: EC/ECReplicationConfig{data=3, parity=2, ecChunkSize=1048576, codec=rs}, sources: [localhost-208.151.31.91/208.151.31.91 replicaIndex: 1, localhost-27.100.52.223/27.100.52.223 replicaIndex: 2, localhost-247.11.160.69/247.11.160.69 replicaIndex: 3], targets: [localhost-220.147.5.156/220.147.5.156, localhost-193.210.182.11/193.210.182.11], missingIndexes: [4, 5]] for container ContainerInfo{id=#1, state=CLOSED, pipelineID=PipelineID=874dd4ae-c222-4f91-b587-81551ef78a20, stateEnterTime=1970-01-01T00:00:00Z, owner=Ozone} to ecd0da3f-68b0-4bf8-85da-b7d3a41100a2{ip: 220.147.5.156, host: localhost-220.147.5.156, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0, RATIS_DATASTREAM=0], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}

2023-01-06 17:23:05,987 [main] INFO  replication.ReplicationManager (ReplicationManager.java:sendDatanodeCommand(426)) - Sending command [deleteContainerCommand: containerID: 1, replicaIndex: 1, force: false] for container ContainerInfo{id=#1, state=CLOSED, pipelineID=PipelineID=4c4b1a01-ebbf-466a-a1f3-45b306533709, stateEnterTime=1970-01-01T00:00:00Z, owner=Ozone} to 7c838ac2-d7bd-4719-ac38-f1ff5adb7869{ip: 122.2.174.81, host: localhost-122.2.174.81, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0, RATIS_DATASTREAM=0], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7739

How was this patch tested?

Logging only change. I ran some tests which caused the logs to be emitted and checked their format (posted above).

* Eg: datanode001.corp/192.168.0.123
*/
public String getHostNameAndIP() {
return getHostName() + "/" + getIpAddress();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to make this (or UUID + this) the default toString, and renaming the existing more verbose toString to something like toDebugString. This way we could avoid having to convert manually in all places, and use the verbose one explicitly if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with adding UUID into the string too. It would solve the problem with mini-cluster where all the nodes are on the same host / IP. I wonder if anything would break if we changed the default datanode details to string? Almost everywhere it is logged, we don't call about all the verbose info it currently includes. We really just want to the host / IP / ID.

@adoroszlai adoroszlai merged commit 07b6759 into apache:master Jan 9, 2023
errose28 added a commit to errose28/ozone that referenced this pull request Jan 9, 2023
* master: (176 commits)
  HDDS-7726. EC: Enhance datanode reconstruction log message (apache#4155)
  HDDS-7739. EC: Increase the information in the RM sending command log message (apache#4153)
  HDDS-7652. Volume Quota not enforced during write when bucket quota is not set (apache#4124)
  HDDS-7628. Intermittent failure in TestOzoneContainerWithTLS (apache#4142)
  HDDS-7695. EC metrics related to replication commands don't add up (apache#4152)
  HDDS-7729. EC: ECContainerReplicaCount should handle pending delete of unhealthy replicas (apache#4146)
  HDDS-7738. SCM terminates when adding container to a closed pipeline (apache#4154)
  HDDS-7243. Remove RequestFeatureValidator from echoRPC method which supports only ValidationCondition.OLDER_CLIENT_REQUESTS (apache#4051)
  HDDS-7708. No check for certificate duration config scenarios. (apache#4149)
  HDDS-7727. EC: SCM unregistered event handler for DatanodeCommandCountUpdated (apache#4147)
  HDDS-7606. Add SCM HA support in intellij run (apache#4058)
  HDDS-7666. EC: Unrecoverable EC containers with some remaining replicas may block decommissioning (apache#4118)
  HDDS-7339. Implement Certificate renewal task for services (apache#3982)
  HDDS-7696. MisReplicationHandler does not consider QUASI_CLOSED replicas as sources (apache#4144)
  HDDS-7714. Docker cluster ozone-om-ha fails during docker-compose up (apache#4137)
  HDDS-7716. Log read requests rejected with permission denied in OM audit (apache#4136)
  HDDS-7588. Intermittent failure in TestObjectStoreWithLegacyFS#testFlatKeyStructureWithOBS (apache#4040)
  HDDS-7633. Compile error with Java 11: package com.sun.jmx.mbeanserver is not visible (apache#4077)
  HDDS-7648. Add a servername tag in UGI metrics. (apache#4094)
  HDDS-7564. Update Ozone version after 1.3.0 release (apache#4115)
  ...
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Feb 6, 2023
… message (apache#4153)

(cherry picked from commit 07b6759)
Change-Id: I613c726be1770aa2d56f01bbb6c8245ee0667f21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants