HDDS-4023. Delete closed container after all blocks have been deleted. #1338

ChenSammi · 2020-08-19T11:11:04Z

https://issues.apache.org/jira/browse/HDDS-4023

sodonnel · 2020-08-19T14:48:01Z

Thanks for working on this @ChenSammi.

I wonder if it would be simpler to remove empty containers as part of Container Report processing? In AbstractContainerReportHandler#updateContainerState, we could check the size and number of keys of the reported containers in the CLOSED branch of the switch statement, and then take action to delete an empty container there? I have a feeling it might be simpler, but I am not sure. The disadvantage of doing it in the Container Report Processing, is that we are dealing with only a single replica at that stage. However if the container is CLOSED in SCM, and a report says it is empty then we should be good to simply remove the container from SCM and issue the delete container command when processing the container report.

In looking at how this all works, I noticed there is a bug in the existing method AbstractContainerReportHandler#updateContainerStats, which might stop your change working:

  private void updateContainerStats(final ContainerID containerId,
                                    final ContainerReplicaProto replicaProto)
      throws ContainerNotFoundException {
    if (isHealthy(replicaProto::getState)) {
      final ContainerInfo containerInfo = containerManager
          .getContainer(containerId);

      if (containerInfo.getSequenceId() <
          replicaProto.getBlockCommitSequenceId()) {
        containerInfo.updateSequenceId(
            replicaProto.getBlockCommitSequenceId());
      }
      if (containerInfo.getUsedBytes() < replicaProto.getUsed()) {
        containerInfo.setUsedBytes(replicaProto.getUsed());
      }
      if (containerInfo.getNumberOfKeys() < replicaProto.getKeyCount()) {
        containerInfo.setNumberOfKeys(replicaProto.getKeyCount());
      }
    }
  }

This logic assumes the bytes used and keyCount is always increasing on the container. However in this case, where blocks are deleted the keyCount and bytesUsed can decrease. Therefore I don't think the stats will ever get updated in SCM via the container reports when blocks are removed.

I then noticed you recently made a change in ReplicationManager to update these counts here:

https://github.com/apache/hadoop-ozone/pull/1295/files#diff-218132dfe72e6e39b9eaaf59737adcf2R780

I think the ReplicationManager part of that change should be reverted and it handled via the ContainerReports by fixing the bug I mentioned above. The only reason we need the Replication Manager change is to work around the bug above. If it was fixed, we don't need that logic in RM.

If we decide to keep your changes inside replicationManager (for this change), then we would need a few more tests added to TestReplicationManager to cover the new scenarios.

I also have a question - when the replicas are all deleted, and we call:

containerManager.updateContainerState(container.containerID(),
          HddsProtos.LifeCycleEvent.CLEANUP);

Does this somehow cause the container to be removed from SCM memory and the persistent store?

sodonnel · 2020-08-19T16:17:28Z

I raised HDDS-4131 / #1339 discuss and address the bytesUsed and KeyCount metric problem I mentioned above.

ChenSammi · 2020-08-21T08:56:57Z

Hi @sodonnel thanks for review the code and initiate the discussion.

I wonder if it would be simpler to remove empty containers as part of Container Report processing? In AbstractContainerReportHandler#updateContainerState, we could check the size and number of keys of the reported containers in the CLOSED branch of the switch statement, and then take action to delete an empty container there? I have a feeling it might be simpler, but I am not sure. The disadvantage of doing it in the Container Report Processing, is that we are dealing with only a single replica at that stage. However if the container is CLOSED in SCM, and a report says it is empty then we should be good to simply remove the container from SCM and issue the delete container command when processing the container report.

In looking at how this all works, I noticed there is a bug in the existing method AbstractContainerReportHandler#updateContainerStats, which might stop your change working:
  private void updateContainerStats(final ContainerID containerId,
                                    final ContainerReplicaProto replicaProto)
      throws ContainerNotFoundException {
    if (isHealthy(replicaProto::getState)) {
      final ContainerInfo containerInfo = containerManager
          .getContainer(containerId);

      if (containerInfo.getSequenceId() <
          replicaProto.getBlockCommitSequenceId()) {
        containerInfo.updateSequenceId(
            replicaProto.getBlockCommitSequenceId());
      }
      if (containerInfo.getUsedBytes() < replicaProto.getUsed()) {
        containerInfo.setUsedBytes(replicaProto.getUsed());
      }
      if (containerInfo.getNumberOfKeys() < replicaProto.getKeyCount()) {
        containerInfo.setNumberOfKeys(replicaProto.getKeyCount());
      }
    }
  }
This logic assumes the bytes used and keyCount is always increasing on the container. However in this case, where blocks are deleted the keyCount and bytesUsed can decrease. Therefore I don't think the stats will ever get updated in SCM via the container reports when blocks are removed.

This logic works for OPEN container, but not for CLOSED container.

I then noticed you recently made a change in ReplicationManager to update these counts here:

https://github.com/apache/hadoop-ozone/pull/1295/files#diff-218132dfe72e6e39b9eaaf59737adcf2R780

I think the ReplicationManager part of that change should be reverted and it handled via the ContainerReports by fixing the bug I mentioned above. The only reason we need the Replication Manager change is to work around the bug above. If it was fixed, we don't need that logic in RM.
You can find my comments in HDDS-4131.

If we decide to keep your changes inside replicationManager (for this change), then we would need a few more tests added to TestReplicationManager to cover the new scenarios.

TestReplicationManager uses Mockito which is not a real end-to-end test. I add a end-to-end test in TestBlockDeletion.

I also have a question - when the replicas are all deleted, and we call:
containerManager.updateContainerState(container.containerID(),
          HddsProtos.LifeCycleEvent.CLEANUP);
Does this somehow cause the container to be removed from SCM memory and the persistent store?

Container of DELETED state currently is still in SCM memory and persistent store. Of course we can just delete the Container from memory and DB when it's state changes to DELETED. But I feel like current way is a more safe way, like a Container trash bin. We can add a new purge DELETED Container action next so user can manually delete these Container when they sure these information are no longer needed. Overall, I'm open to this question, fine with either way.

ChenSammi · 2020-09-04T02:34:52Z

The failed UT is irrelevant, fixed by #1379.

ChenSammi · 2020-09-04T11:34:47Z

Hi, @sodonnel do you have time for another review?

sodonnel · 2020-09-08T10:16:46Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java

+        handleContainerUnderDelete(container, replicas);
+        return;
+      }
+


Should we add another check here:

if (state == LifeCycleState.DELETED) { return }

This will avoid doing any further processing on a container which is expected to have zero replicas?

sodonnel · 2020-09-08T10:23:38Z

As I said before, I have a feeling that it might be better to have a separate service for deleting empty containers, and it would perhaps naturally be part of a "merge small containers" service, which we will need to create eventually. However, the changes to the replication manager here are not too big, and you could argue the current solution fits here OK. Therefore I am OK with keeping this logic in the replication manager. However when it comes to merging small containers in the future, I believe we should create a separate service and not clutter RM with that new feature too.

The patch looks mostly good to me, there are just a couple of areas we need to consider.

What do we do with the containers in DELETED state, and with no replicas? I don't think we should keep them in SCM forever, but we are also worried about removing them. However that said, if the replicas have all been deleted, what use is there in keeping the container in SCM?
Consider if a datanode was dead for some time, and is then restarted. It will start reporting a replica for the DELETED container and with the current logic I don't think we will clean it up as isContainerEmpty(...) will always return false due to it checking for only CLOSED containers. I feel that this should be handled in the ContainerReportHandler code. If a replica is reported for a container which is DELETED in SCM, then we should instruct the datanode to delete the replica. What do you think? The parameter hdds.scm.unknown-container.action currently controls what to do if a DN reports a containers which SCM does not know about. The default is to log a warning, but for a DELETED container, I think we should just delete the replica.
There is another corner case, where the dead node may have a container which is out of sync with the others, as it was dead when the blocks should have been removed. Therefore we had 3 empty container replicas, plus one non-empty dead one. The 3 empty replicas are removed, and the container moved to "DELETED". Then the dead node starts reporting a non-empty replica for the DELETED container. This is actually a wider problem, in that I believe there are scenarios where container replicas can get out of sync with one another and there is nothing to reconcile them and fix it.
In ContainerSafeModeRule, I think we need to consider DELETING and DELETED. If there are a lot of DELETED containers in the cluster, then they will never get replicas and the safemode rule will never get met. I think this could be as simple as excluding DELETING and DELETED when loading the containerMap (right now OPEN and CLOSING are excluded) so we don't wait for replicas on these containers.
In ReplicationManager.isContainerUnderReplicated(...), should we immediately return true if the containerState is DELETING or DELETED?
Are there any other places outside of ReplicationManager and ContainerSafeModeRule where we monitor under / over replicated containers I may have missed?

sodonnel · 2020-09-08T16:05:07Z

Thinking some more about what to do with containers in the DELETED state in SCM. If we are not confident to delete them immediately and automatically, maybe we should provide an command "purge empty containers" which can be run by an admin, or via Recon. We should show in Recon how many empty containers are present in the cluster and then provide an option to remove them.

Alternatively we could have some threshold in SCM, so that when the DELETED containers crosses some threshold, they are all removed.

We could do either of these in a separate jira if this idea makes sense.

ChenSammi · 2020-09-16T12:42:10Z

Thanks @sodonnel for very thoughtful suggestions.

2. Consider if a datanode was dead for some time, and is then restarted. It will start reporting a replica for the DELETED container and with the current logic I don't think we will clean it up as `isContainerEmpty(...)` will always return false due to it checking for only CLOSED containers. I feel that this should be handled in the ContainerReportHandler code. If a replica is reported for a container which is DELETED in SCM, then we should instruct the datanode to delete the replica. What do you think? The parameter hdds.scm.unknown-container.action currently controls what to do if a DN reports a containers which SCM does not know about. The default is to log a warning, but for a DELETED container, I think we should just delete the replica.

Agree. we should just delete the replica in the case.

3. There is another corner case, where the dead node may have a container which is out of sync with the others, as it was dead when the blocks should have been removed. Therefore we had 3 empty container replicas, plus one non-empty dead one. The 3 empty replicas are removed, and the container moved to "DELETED". Then the dead node starts reporting a non-empty replica for the DELETED container. This is actually a wider problem, in that I believe there are scenarios where container replicas can get out of sync with one another and there is nothing to reconcile them and fix it.

The extra stale replica is a real problem, not just in empty container deletion case, but also for non-empty CLOSED container case. After a delete block command is confirmed by 3 datanodes, this delete block command is finished and removed from block deletion log. So if after that, a fourth replica shows up, SCM will not send an new delete block command to this replica.
But this fourth replica will have all remaining delete block commands executed if it's datanode is still on. Maybe we can rely on over replicated container replica deletion logic to remove this extra fourth replica. But it's hard to choose which one to delete if there deleteTransactionId and blockCommitSequenceId are different. Currently handleOverReplicatedContainer doesn't check these two fields of each replica.

4. In ContainerSafeModeRule, I think we need to consider DELETING and DELETED. If there are a lot of DELETED containers in the cluster, then they will never get replicas and the safemode rule will never get met. I think this could be as simple as excluding DELETING and DELETED when loading the containerMap (right now OPEN and CLOSING are excluded) so we don't wait for replicas on these containers.

Good catch!

5. In `ReplicationManager.isContainerUnderReplicated(...)`, should we immediately return true if the containerState is DELETING or DELETED?

Right.

I also agree we should purge these deleted containers after a while. We can have a timer task to periodically remove the DELETED containers from SCM DB. What do you think?

ChenSammi · 2020-09-17T07:30:55Z

After a second thought, deleting the container record in SCM DB immediately while keep it in memory maybe a better and clean choice. So if there is stale container replica, it can be deleted based on in memory information. And next time when SCM start, SCM doesn't need to handle DELETED containers anymore.

ChenSammi · 2020-09-21T03:18:56Z

@sodonnel @elek could you help to take another look?

sodonnel · 2020-09-24T09:44:21Z

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/AbstractContainerReportHandler.java

-    if (isHealthy(replicaProto::getState)) {
-      final ContainerInfo containerInfo = containerManager
-          .getContainer(containerId);
+    if (containerInfo.getState() == HddsProtos.LifeCycleState.DELETED) {


It doesn't seem correct to put the logic to delete the replica for the DELETED container inside updateContainerStats.

In ContainerReportHandler#processContainerReplicas(..) there is logic to delete an unknown container in the exception handler.

Could we extract this into a new method, which is called from the exception handler. Then in AbstractContainerReportHandler#updateContainerState(...) handle the containers which should be deleted in the "case DELETED" branch of the swith statement. It could call that same extracted method - that way the logic to form the DeleteContainer command will be the same for both? It also seems more logical to put the delete inside UpdateContainerState rather than UpdateContainerStats.

sodonnel · 2020-09-24T10:28:43Z

Sorry for the slow reply on this. I have been caught up on some other things.

After a second thought, deleting the container record in SCM DB immediately while keep it in memory maybe a better and clean choice. So if there is stale container replica, it can be deleted based on in memory information.

I think this is a good enough idea for now. If SCM is up for a very long time, perhaps in the future we will want to add a thread to clear all the in memory DELETED containers. One small concern is that if a container goes DELETED and then SCM is restarted soon after. Then a DN is restarted and reports a stale replica, it will just be seen as an unknown container. The default position there, is to log a warning. The config hdds.scm.unknown-container.action controls this. This is all an edge case - most of the time, all DNs should be up anyway.

I left just one comment on a suggested refactor in the container report handler, when dealing with replicas from a DELETED container.

Could you also add a test in TestContainerReportHander to check the logic around deleting a replica from a DELETED container?

ChenSammi · 2020-09-25T12:27:55Z

Sorry for the slow reply on this. I have been caught up on some other things.

After a second thought, deleting the container record in SCM DB immediately while keep it in memory maybe a better and clean choice. So if there is stale container replica, it can be deleted based on in memory information.

I think this is a good enough idea for now. If SCM is up for a very long time, perhaps in the future we will want to add a thread to clear all the in memory DELETED containers. One small concern is that if a container goes DELETED and then SCM is restarted soon after. Then a DN is restarted and reports a stale replica, it will just be seen as an unknown container. The default position there, is to log a warning. The config hdds.scm.unknown-container.action controls this. This is all an edge case - most of the time, all DNs should be up anyway.

I left just one comment on a suggested refactor in the container report handler, when dealing with replicas from a DELETED container.

Could you also add a test in TestContainerReportHander to check the logic around deleting a replica from a DELETED container?

Thanks @sodonnel , a new commit to address the concerns.

sodonnel · 2020-09-28T13:44:57Z

This change looks almost good now. I wonder about two final things:

In updateContainerStats(...) do you think we should return if the container is DELETING or DELETED without making any updates? If this is a stale replica, they it may not be empty and hence could update the container stats incorrectly. If the Container is already in DELETING or DELETED state, then we can ignore any changes to it, as we know the stale replica will get removed anyway.
I am wondering if we could receive a stale replica when the container is DELETING. Then a replica would get added in updateContainerReplica(...). Then later the container will go to DELETED and the state replica will get reported again - at that point we will send a delete command, but the replica will never get removed from memory now I think. Would it make sense to send the delete command for any replicas received when the container is DELETING or DELETED?

ChenSammi · 2020-09-29T03:52:58Z

This change looks almost good now. I wonder about two final things:

1. In `updateContainerStats(...)` do you think we should return if the container is DELETING or DELETED without making any updates? If this is a stale replica, they it may not be empty and hence could update the container stats incorrectly. If the Container is already in DELETING or DELETED state, then we can ignore any changes to it, as we know the stale replica will get removed anyway.

2. I am wondering if we could receive a stale replica when the container is DELETING. Then a replica would get added in `updateContainerReplica(...)`. Then later the container will go to DELETED and the state replica will get reported again - at that point we will send a delete command, but the replica will never get removed from memory now I think. Would it make sense to send the delete command for any replicas received when the container is DELETING or DELETED?

This change looks almost good now. I wonder about two final things:

1. In `updateContainerStats(...)` do you think we should return if the container is DELETING or DELETED without making any updates? If this is a stale replica, they it may not be empty and hence could update the container stats incorrectly. If the Container is already in DELETING or DELETED state, then we can ignore any changes to it, as we know the stale replica will get removed anyway.

2. I am wondering if we could receive a stale replica when the container is DELETING. Then a replica would get added in `updateContainerReplica(...)`. Then later the container will go to DELETED and the state replica will get reported again - at that point we will send a delete command, but the replica will never get removed from memory now I think. Would it make sense to send the delete command for any replicas received when the container is DELETING or DELETED?

There is following logic in ReplicatioManager, which will handle the replicas reported during container state is DELETING.
So we only need to send delete replica command for DELETED container during container report proceess, and let the ReplicationManager handle the DELETING container replica, because when a container in in DELETING state, It's sure that scm send out some replica deletion commands, but we cannot tell replica stale or not in container report.

if (state == LifeCycleState.DELETING) {
handleContainerUnderDelete(container, replicas);
return;
}

@sodonnel

linyiqun · 2020-09-29T08:45:15Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManager.java

+         */
+        if (isContainerEmpty(container, replicas)) {
+          deleteContainerReplicas(container, replicas);
+        }


@ChenSammi , Is there any specific reason that we let ReplicationManager to help clean empty containers? After this, ReplicaManager will do additionally container empty check for all healthy containers. Not sure if this is an efficiency way to put logic here.

I wonder if it would be simpler to remove empty containers as part of Container Report processing? In AbstractContainerReportHandler#updateContainerState, we could check the size and number of keys of the reported containers in the CLOSED branch of the switch statement, and then take action to delete an empty container there? I have a feeling it might be simpler, but I am not sure. The disadvantage of doing it in the Container Report Processing, is that we are dealing with only a single replica at that stage. However if the container is CLOSED in SCM, and a report says it is empty then we should be good to simply remove the container from SCM and issue the delete container command when processing the container report.

Actually I prefer this way as @sodonnel mentioned.

but I am not sure. The disadvantage of doing it in the Container Report Processing, is that we are dealing with only a single replica at that stage

We could also get all replica info and check state in ContainerReportHandler, then send delete container command

I'm okay for current way but just share my thought for this.

sodonnel · 2020-09-29T09:25:40Z

There is following logic in ReplicatioManager, which will handle the replicas reported during container state is DELETING.

Sorry I missed that. You are correct. I am +1 on this change as it is now, so feel free to commit it.

@linyiqun I do agree that I think this could be handled more cleanly and efficiently in the container report handler. However its probably not much of an overhead for replication manager. I am happy for us to commit the change as it is, and we can see how it performs in practice. Worst case we have to refactor the change out of RM into the report handler. What do you think?

linyiqun · 2020-09-29T09:55:56Z

@linyiqun I do agree that I think this could be handled more cleanly and efficiently in the container report handler. However its probably not much of an overhead for replication manager. I am happy for us to commit the change as it is, and we can see how it performs in practice. Worst case we have to refactor the change out of RM into the report handler. What do you think?

+1 for this, @sodonnel .

@ChenSammi , can you add a TODO comment like below while committing this PR? That will be helpful for us to revisit this in the future.
// TODO: container report handling the empty containers.
if (isContainerEmpty(container, replicas)) {
deleteContainerReplicas(container, replicas);
}

+1 from me.

ChenSammi · 2020-09-29T11:29:52Z

Thanks @sodonnel and @linyiqun for the review.

Basically I think report handler is not a good place to handle all the empty container deletion process. It can tell which one is empty , but it lacks of the facilities in ReplicationManager, such as inflightDeletion, such as handle send command to extra replica for DELETING state container, or resend command. In future, when container compation is considered, we can move this container deletion logic to be together.

apache#1338)

elek requested review from elek and sodonnel August 24, 2020 09:42

ChenSammi force-pushed the HDDS-4023 branch from 7fe8dda to 266b92b Compare September 3, 2020 07:26

HDDS-4023. Delete closed container after all blocks have been deleted.

1f165c2

ChenSammi force-pushed the HDDS-4023 branch from 266b92b to 1f165c2 Compare September 4, 2020 03:06

checkstyle

6d0a84a

fix UT

897a4b9

sodonnel reviewed Sep 8, 2020

View reviewed changes

ChenSammi added 2 commits September 17, 2020 17:56

address comments

9f4c625

fix UT

d434ed4

sodonnel reviewed Sep 24, 2020

View reviewed changes

address review comments

fb9afc9

fix checkstyle

cd93423

linyiqun reviewed Sep 29, 2020

View reviewed changes

ChenSammi merged commit 68642c2 into apache:master Sep 29, 2020

errose28 pushed a commit to errose28/ozone that referenced this pull request Oct 6, 2020

HDDS-4023. Delete closed container after all blocks have been deleted. (

497c45b

apache#1338)

captainzmc mentioned this pull request Jan 27, 2021

HDDS-4746. Fix delete container occurs unknown command type. #1845

Closed

HDDS-4023. Delete closed container after all blocks have been deleted. #1338

HDDS-4023. Delete closed container after all blocks have been deleted. #1338

Uh oh!

Conversation

ChenSammi commented Aug 19, 2020

Uh oh!

sodonnel commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sodonnel commented Aug 19, 2020

Uh oh!

ChenSammi commented Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenSammi commented Sep 4, 2020

Uh oh!

ChenSammi commented Sep 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sodonnel Sep 8, 2020

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Sep 8, 2020

Uh oh!

sodonnel commented Sep 8, 2020

Uh oh!

ChenSammi commented Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenSammi commented Sep 17, 2020

Uh oh!

ChenSammi commented Sep 21, 2020

Uh oh!

sodonnel Sep 24, 2020

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Sep 24, 2020

Uh oh!

ChenSammi commented Sep 25, 2020

Uh oh!

sodonnel commented Sep 28, 2020

Uh oh!

ChenSammi commented Sep 29, 2020

Uh oh!

linyiqun Sep 29, 2020

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Sep 29, 2020

Uh oh!

linyiqun commented Sep 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenSammi commented Sep 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sodonnel commented Aug 19, 2020 •

edited

Loading

ChenSammi commented Aug 21, 2020 •

edited

Loading

ChenSammi commented Sep 4, 2020 •

edited

Loading

ChenSammi commented Sep 16, 2020 •

edited

Loading

linyiqun commented Sep 29, 2020 •

edited

Loading