HDDS-4131. Container report should update container key count and bytes used if they differ in SCM #1339

sodonnel · 2020-08-19T16:16:02Z

What changes were proposed in this pull request?

In HDDS-4037 it was noted that when blocks are deleted from closed containers, the bytesUsed and Key Count metrics on the SCM container are not updated correctly.

These stats should be updated via the container reports issued by the DNs to SCM periodically. However, in AbstractContainerReportHandler#updateContainerStats, the code assumes the values are always increasing and it will not update them if they are decreasing:

  private void updateContainerStats(final ContainerID containerId,
                                    final ContainerReplicaProto replicaProto)
      throws ContainerNotFoundException {
    if (isHealthy(replicaProto::getState)) {
      final ContainerInfo containerInfo = containerManager
          .getContainer(containerId);

      if (containerInfo.getSequenceId() <
          replicaProto.getBlockCommitSequenceId()) {
        containerInfo.updateSequenceId(
            replicaProto.getBlockCommitSequenceId());
      }
      if (containerInfo.getUsedBytes() < replicaProto.getUsed()) {
        containerInfo.setUsedBytes(replicaProto.getUsed());
      }
      if (containerInfo.getNumberOfKeys() < replicaProto.getKeyCount()) {
        containerInfo.setNumberOfKeys(replicaProto.getKeyCount());
      }
    }
  }

In HDDS-4037 a change was made to the Replication Manager, so it updates the stats. However I don't believe that is the correct place to perform this check, and the issue is caused by the logic shared above.

In this Jira, I have removed the changes to Replication Manager in HDDS-4037 (but retained the other changes in that Jira), ensuring the problem statistics are only updated via the containers reports if they are different in SCM from what is reported.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4131

How was this patch tested?

Small change to existing unit test. Used it to reproduce the problem before making the changed.

ChenSammi · 2020-08-21T08:36:15Z

Hi @sodonnel, since a typical container has 3 replicas and container report are asynchronously, we need a consensus on what's the container size is in SCM. Basically AbstractContainerReportHandler is not the perfect place to handle this because it doesn't have a global view while Replication Manager has.
For OPEN container, I think Math.min(replia1, replica2, replica3) is a safe way, and for CLOSED container, Math.max(replica1, replica2, replica3) is safer.

sodonnel · 2020-08-21T14:14:56Z

Basically AbstractContainerReportHandler is not the perfect place to handle this because it doesn't have a global view while Replication Manager has.

ContainerReportHandler has the same view of all replicas as replication manager, as it has access to the ContainerManager object. I will push a new commit that adjusts the values based on all 3 replicas. I still need to add a test or two for this, but this demonstrates the point hopefully.

Ideally, we should be updating these values in a single place. The containerReportHandler is supposed to do it, but it is not doing it correctly, so we need to fix that, rather than adding new logic elsewhere to work around the bug.

adoroszlai

Thanks @sodonnel for improving code responsibilities / fixing the original bug at the root cause.

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/AbstractContainerReportHandler.java

...erver-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestContainerReportHandler.java

elek · 2020-08-25T08:24:53Z

In HDDS-4037 a change was made to the Replication Manager, so it updates the stats. However I don't believe that is the correct place to perform this check, and the issue is caused by the logic shared above.

Agree, -- based on my understanding -- it's better to do before replication manager:

Replication manager can be more simple and easier to manage
If we do the fix in the replication manager it's possible to have an inconsistent view between the container report and replication manager execution.

sodonnel · 2020-08-25T21:20:17Z

@adoroszlai @ChenSammi Are you happy with this change at this stage? Can we commit it?

ChenSammi

Just one minor inline comment. Good with the rest part.

ChenSammi · 2020-08-27T06:54:56Z

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/SCMContainerManager.java

              SCMException.ResultCodes.FAILED_TO_FIND_CONTAINER);
        }
        containerInfo.updateDeleteTransactionId(entry.getValue());
-        containerInfo.setNumberOfKeys(containerInfoInMem.getNumberOfKeys());


Prefer to keep the KeyCount and UsedKeys persist action here.

After looking at this area a bit more I understand why that is needed now. I have added those two lines back in.

sodonnel · 2020-09-01T07:13:38Z

I think this change is good to commit now? @adoroszlai gave a thumbs up a few days back and I have addressed the only concern @ChenSammi raised.

I will commit tomorrow unless anyone objects before then.

ChenSammi · 2020-09-02T06:10:01Z

+1. Thanks @sodonnel for the contribution.

…es used if they differ in SCM (apache#1339)

Changes to fix the issue

62fc28e

sodonnel mentioned this pull request Aug 19, 2020

HDDS-4023. Delete closed container after all blocks have been deleted. #1338

Merged

sodonnel requested review from ChenSammi and adoroszlai August 19, 2020 16:17

S O'Donnell added 2 commits August 21, 2020 15:34

Consider all replicas when adjusting keyCount and UsedBytes

77ebff1

Added tests

2870dc6

elek self-assigned this Aug 24, 2020

adoroszlai reviewed Aug 24, 2020

View reviewed changes

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/AbstractContainerReportHandler.java Show resolved Hide resolved

...erver-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestContainerReportHandler.java Outdated Show resolved Hide resolved

Fixed swapped values in test

67e1cb2

ChenSammi reviewed Aug 27, 2020

View reviewed changes

S O'Donnell added 2 commits August 28, 2020 10:31

Replace numberOfKeys and getUsedBytes in updateDeleteTransactionId

0f4136c

Trigger build

ad0bebe

ChenSammi merged commit 199512b into apache:master Sep 2, 2020

rakeshadr pushed a commit to rakeshadr/hadoop-ozone that referenced this pull request Sep 3, 2020

HDDS-4131. Container report should update container key count and byt…

24aa0df

…es used if they differ in SCM (apache#1339)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-4131. Container report should update container key count and bytes used if they differ in SCM #1339

HDDS-4131. Container report should update container key count and bytes used if they differ in SCM #1339

sodonnel commented Aug 19, 2020

Uh oh!

ChenSammi commented Aug 21, 2020

Uh oh!

sodonnel commented Aug 21, 2020

Uh oh!

adoroszlai left a comment

Uh oh!

Uh oh!

Uh oh!

elek commented Aug 25, 2020

Uh oh!

sodonnel commented Aug 25, 2020

Uh oh!

ChenSammi left a comment

Uh oh!

ChenSammi Aug 27, 2020 •

edited

Loading

Uh oh!

sodonnel Aug 28, 2020

Uh oh!

sodonnel commented Sep 1, 2020

Uh oh!

ChenSammi commented Sep 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDDS-4131. Container report should update container key count and bytes used if they differ in SCM #1339

HDDS-4131. Container report should update container key count and bytes used if they differ in SCM #1339

Conversation

sodonnel commented Aug 19, 2020

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

ChenSammi commented Aug 21, 2020

Uh oh!

sodonnel commented Aug 21, 2020

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

elek commented Aug 25, 2020

Uh oh!

sodonnel commented Aug 25, 2020

Uh oh!

ChenSammi left a comment

Choose a reason for hiding this comment

Uh oh!

ChenSammi Aug 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sodonnel Aug 28, 2020

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Sep 1, 2020

Uh oh!

ChenSammi commented Sep 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ChenSammi Aug 27, 2020 •

edited

Loading