Skip to content

Conversation

@sodonnel
Copy link
Contributor

@sodonnel sodonnel commented Jan 6, 2023

What changes were proposed in this pull request?

    "EcReplicationCmdsSentTotal" : 0,
    "EcDeletionCmdsSentTotal" : 259,
    "EcReplicationCmdsCompletedTotal" : 51,
    "EcDeletionCmdsCompletedTotal" : 51,
    "EcReconstructionCmdsSentTotal" : 571,
    "EcReplicationCmdsTimeoutTotal" : 765,
    "EcDeletionCmdsTimeoutTotal" : 204

Total replication commands sent are 0, while timed out are 765.

I think the code is working as intended, but it is confusing.

We have a metric for "EcReplicationCmdsSentTotal" and EcReconstructionCmdsSentTotal. However on completion or timeout we only have a metric EcReplicationCmdsCompletedTotal and EcReplicationCmdsTimeoutTotal - we don't have a reconstruction completed / timeout. This is because we track completion in ContainerReplicaPendingOps, and all it sees is a replica that has been scheduled to be created. It doesn't know if its an simple copy or a reconstruction that is going to create it.

That can explain why "EcReplicationCmdsSentTotal=0" and "EcReplicationCmdsTimeoutTotal=765" - likely all these scheduled commands were actually reconstructions, as we have 571 of those sent.

Why then do we have more ECReplication completed and timed out than scheduled? An EC reconstruction can create multiple new replicas in a single command, and they are tracked as a single command when sent, but then when the commands are completed in pending ops, it counts one per replica. So we can schedule a reconstruction to create 2 new replicas, and we will end up with 1 command sent and 2 in EcReplicationCmdsCompletedTotal.

To make this less confusing I have renamed the "complete" metrics in this PR to be Replicas created / deleted / timed out, rather than commands.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7695

How was this patch tested?

Existing tests should cover this as its just a rename of variables / methods.

@adoroszlai adoroszlai merged commit 4abe983 into apache:master Jan 8, 2023
errose28 added a commit to errose28/ozone that referenced this pull request Jan 9, 2023
* master: (176 commits)
  HDDS-7726. EC: Enhance datanode reconstruction log message (apache#4155)
  HDDS-7739. EC: Increase the information in the RM sending command log message (apache#4153)
  HDDS-7652. Volume Quota not enforced during write when bucket quota is not set (apache#4124)
  HDDS-7628. Intermittent failure in TestOzoneContainerWithTLS (apache#4142)
  HDDS-7695. EC metrics related to replication commands don't add up (apache#4152)
  HDDS-7729. EC: ECContainerReplicaCount should handle pending delete of unhealthy replicas (apache#4146)
  HDDS-7738. SCM terminates when adding container to a closed pipeline (apache#4154)
  HDDS-7243. Remove RequestFeatureValidator from echoRPC method which supports only ValidationCondition.OLDER_CLIENT_REQUESTS (apache#4051)
  HDDS-7708. No check for certificate duration config scenarios. (apache#4149)
  HDDS-7727. EC: SCM unregistered event handler for DatanodeCommandCountUpdated (apache#4147)
  HDDS-7606. Add SCM HA support in intellij run (apache#4058)
  HDDS-7666. EC: Unrecoverable EC containers with some remaining replicas may block decommissioning (apache#4118)
  HDDS-7339. Implement Certificate renewal task for services (apache#3982)
  HDDS-7696. MisReplicationHandler does not consider QUASI_CLOSED replicas as sources (apache#4144)
  HDDS-7714. Docker cluster ozone-om-ha fails during docker-compose up (apache#4137)
  HDDS-7716. Log read requests rejected with permission denied in OM audit (apache#4136)
  HDDS-7588. Intermittent failure in TestObjectStoreWithLegacyFS#testFlatKeyStructureWithOBS (apache#4040)
  HDDS-7633. Compile error with Java 11: package com.sun.jmx.mbeanserver is not visible (apache#4077)
  HDDS-7648. Add a servername tag in UGI metrics. (apache#4094)
  HDDS-7564. Update Ozone version after 1.3.0 release (apache#4115)
  ...
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Feb 6, 2023
…pache#4152)

(cherry picked from commit 4abe983)
Change-Id: I185013ddde68941725fb5160218a400b9494111e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants