Skip to content

Conversation

@adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

Include more information in messages from ECReconstructionCoordinatorTask. Source/target datanodes are identified by UUID.

To facilitate the change, move some logic from ECReconstructionCoordinatorTask to ECReconstructionCommandInfo.

https://issues.apache.org/jira/browse/HDDS-7726

How was this patch tested?

$ mvn -Dtest=TestECContainerRecovery clean test
...

$ grep ECReconstructionCommand hadoop-ozone/integration-test/target/surefire-reports/org.apache.hadoop.ozone.container.TestECContainerRecovery-output.txt
... INFO  reconstruction.ECReconstructionCoordinatorTask (ECReconstructionCoordinatorTask.java:run(97)) - Completed ECReconstructionCommand{containerID=1, replication=rs-3-2-1024, missingIndexes=[1], sources={2:dcaef49d-a60e-42a7-91d8-740a95ae2403,3:fa4feb1a-bb51-4c8b-9a79-ba0dae67a626,4:0af75ffe-7d73-4d6f-859d-14da589f77ba,5:770b81eb-e1a4-478e-83ea-2cb8ce8876f1}, targets={1:ca9c6427-8e00-426f-a28b-138bf9583c8e}} in 241 ms
... WARN  reconstruction.ECReconstructionCoordinatorTask (ECReconstructionCoordinatorTask.java:run(99)) - Failed ECReconstructionCommand{containerID=2, replication=rs-3-2-1024, missingIndexes=[1], sources={2:a9005b92-4153-4535-b3c2-6c1d774e586a,3:b29d4ce4-0416-4ff7-afb9-b8968068d7ab,4:edacb8d4-c8b9-424e-8b1c-f2c9be0858a3,5:4cb977e9-d7ed-4833-92b5-c1181ce1db2e}, targets={1:0af75ffe-7d73-4d6f-859d-14da589f77ba}}

Regular CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/3861802806

@adoroszlai adoroszlai self-assigned this Jan 7, 2023
@adoroszlai adoroszlai added the EC label Jan 7, 2023
LOG.warn(
"Failed to complete the reconstruction task for the container: "
+ reconstructionCommandInfo.getContainerID(), e);
LOG.warn("Failed {}", reconstructionCommandInfo, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding the elapsed time to the fail message too, incase it is failing after some long delay or timeout?

@sodonnel
Copy link
Contributor

sodonnel commented Jan 9, 2023

This change looks basically good. I just had one suggestion about adding the elapsed time to the fail message too.

I also wonder about logging the DN UUID. Logging DatanodeDetails.toString() is really too verbose and make the logs quite long, so its a good idea to avoid it. In #4153 I went with logging "DN host / IP" to make it less verbose. If someone is debugging an issue, it might be easier to work with that rather than a UUID, as it saves another lookup to find the host you may want to look at, but it could be argued either way I think. I am fine to leave it as it is I think.

@adoroszlai
Copy link
Contributor Author

I went with logging "DN host / IP" to make it less verbose. If someone is debugging an issue, it might be easier to work with that rather than a UUID, as it saves another lookup to find the host you may want to look at, but it could be argued either way I think. I am fine to leave it as it is I think.

In normal usage hostname is better. UUID might be better for integration tests with mini cluster, where all nodes are on the same host. I guess I'll go with your solution, and maybe we can add a test-specific check later.

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sodonnel sodonnel merged commit a80c6b1 into apache:master Jan 9, 2023
@adoroszlai
Copy link
Contributor Author

Thanks @sodonnel for reviewing and merging this.

@adoroszlai adoroszlai deleted the HDDS-7726 branch January 9, 2023 14:57
errose28 added a commit to errose28/ozone that referenced this pull request Jan 9, 2023
* master: (176 commits)
  HDDS-7726. EC: Enhance datanode reconstruction log message (apache#4155)
  HDDS-7739. EC: Increase the information in the RM sending command log message (apache#4153)
  HDDS-7652. Volume Quota not enforced during write when bucket quota is not set (apache#4124)
  HDDS-7628. Intermittent failure in TestOzoneContainerWithTLS (apache#4142)
  HDDS-7695. EC metrics related to replication commands don't add up (apache#4152)
  HDDS-7729. EC: ECContainerReplicaCount should handle pending delete of unhealthy replicas (apache#4146)
  HDDS-7738. SCM terminates when adding container to a closed pipeline (apache#4154)
  HDDS-7243. Remove RequestFeatureValidator from echoRPC method which supports only ValidationCondition.OLDER_CLIENT_REQUESTS (apache#4051)
  HDDS-7708. No check for certificate duration config scenarios. (apache#4149)
  HDDS-7727. EC: SCM unregistered event handler for DatanodeCommandCountUpdated (apache#4147)
  HDDS-7606. Add SCM HA support in intellij run (apache#4058)
  HDDS-7666. EC: Unrecoverable EC containers with some remaining replicas may block decommissioning (apache#4118)
  HDDS-7339. Implement Certificate renewal task for services (apache#3982)
  HDDS-7696. MisReplicationHandler does not consider QUASI_CLOSED replicas as sources (apache#4144)
  HDDS-7714. Docker cluster ozone-om-ha fails during docker-compose up (apache#4137)
  HDDS-7716. Log read requests rejected with permission denied in OM audit (apache#4136)
  HDDS-7588. Intermittent failure in TestObjectStoreWithLegacyFS#testFlatKeyStructureWithOBS (apache#4040)
  HDDS-7633. Compile error with Java 11: package com.sun.jmx.mbeanserver is not visible (apache#4077)
  HDDS-7648. Add a servername tag in UGI metrics. (apache#4094)
  HDDS-7564. Update Ozone version after 1.3.0 release (apache#4115)
  ...
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Feb 6, 2023
(cherry picked from commit a80c6b1)
Change-Id: I7aeca84a2ebb6c11b3856a7baecd8fe84e520895
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants