Skip to content

Conversation

@ChenSammi
Copy link
Contributor

@sodonnel
Copy link
Contributor

sodonnel commented Aug 19, 2020

Thanks for working on this @ChenSammi.

I wonder if it would be simpler to remove empty containers as part of Container Report processing? In AbstractContainerReportHandler#updateContainerState, we could check the size and number of keys of the reported containers in the CLOSED branch of the switch statement, and then take action to delete an empty container there? I have a feeling it might be simpler, but I am not sure. The disadvantage of doing it in the Container Report Processing, is that we are dealing with only a single replica at that stage. However if the container is CLOSED in SCM, and a report says it is empty then we should be good to simply remove the container from SCM and issue the delete container command when processing the container report.

In looking at how this all works, I noticed there is a bug in the existing method AbstractContainerReportHandler#updateContainerStats, which might stop your change working:

  private void updateContainerStats(final ContainerID containerId,
                                    final ContainerReplicaProto replicaProto)
      throws ContainerNotFoundException {
    if (isHealthy(replicaProto::getState)) {
      final ContainerInfo containerInfo = containerManager
          .getContainer(containerId);

      if (containerInfo.getSequenceId() <
          replicaProto.getBlockCommitSequenceId()) {
        containerInfo.updateSequenceId(
            replicaProto.getBlockCommitSequenceId());
      }
      if (containerInfo.getUsedBytes() < replicaProto.getUsed()) {
        containerInfo.setUsedBytes(replicaProto.getUsed());
      }
      if (containerInfo.getNumberOfKeys() < replicaProto.getKeyCount()) {
        containerInfo.setNumberOfKeys(replicaProto.getKeyCount());
      }
    }
  }

This logic assumes the bytes used and keyCount is always increasing on the container. However in this case, where blocks are deleted the keyCount and bytesUsed can decrease. Therefore I don't think the stats will ever get updated in SCM via the container reports when blocks are removed.

I then noticed you recently made a change in ReplicationManager to update these counts here:

https://github.com/apache/hadoop-ozone/pull/1295/files#diff-218132dfe72e6e39b9eaaf59737adcf2R780

I think the ReplicationManager part of that change should be reverted and it handled via the ContainerReports by fixing the bug I mentioned above. The only reason we need the Replication Manager change is to work around the bug above. If it was fixed, we don't need that logic in RM.

If we decide to keep your changes inside replicationManager (for this change), then we would need a few more tests added to TestReplicationManager to cover the new scenarios.

I also have a question - when the replicas are all deleted, and we call:

containerManager.updateContainerState(container.containerID(),
          HddsProtos.LifeCycleEvent.CLEANUP);

Does this somehow cause the container to be removed from SCM memory and the persistent store?

@sodonnel
Copy link
Contributor

I raised HDDS-4131 / #1339 discuss and address the bytesUsed and KeyCount metric problem I mentioned above.

@ChenSammi
Copy link
Contributor Author

ChenSammi commented Aug 21, 2020

Hi @sodonnel thanks for review the code and initiate the discussion.

I wonder if it would be simpler to remove empty containers as part of Container Report processing? In AbstractContainerReportHandler#updateContainerState, we could check the size and number of keys of the reported containers in the CLOSED branch of the switch statement, and then take action to delete an empty container there? I have a feeling it might be simpler, but I am not sure. The disadvantage of doing it in the Container Report Processing, is that we are dealing with only a single replica at that stage. However if the container is CLOSED in SCM, and a report says it is empty then we should be good to simply remove the container from SCM and issue the delete container command when processing the container report.

In looking at how this all works, I noticed there is a bug in the existing method AbstractContainerReportHandler#updateContainerStats, which might stop your change working:

  private void updateContainerStats(final ContainerID containerId,
                                    final ContainerReplicaProto replicaProto)
      throws ContainerNotFoundException {
    if (isHealthy(replicaProto::getState)) {
      final ContainerInfo containerInfo = containerManager
          .getContainer(containerId);

      if (containerInfo.getSequenceId() <
          replicaProto.getBlockCommitSequenceId()) {
        containerInfo.updateSequenceId(
            replicaProto.getBlockCommitSequenceId());
      }
      if (containerInfo.getUsedBytes() < replicaProto.getUsed()) {
        containerInfo.setUsedBytes(replicaProto.getUsed());
      }
      if (containerInfo.getNumberOfKeys() < replicaProto.getKeyCount()) {
        containerInfo.setNumberOfKeys(replicaProto.getKeyCount());
      }
    }
  }

This logic assumes the bytes used and keyCount is always increasing on the container. However in this case, where blocks are deleted the keyCount and bytesUsed can decrease. Therefore I don't think the stats will ever get updated in SCM via the container reports when blocks are removed.

This logic works for OPEN container, but not for CLOSED container.

I then noticed you recently made a change in ReplicationManager to update these counts here:

https://github.com/apache/hadoop-ozone/pull/1295/files#diff-218132dfe72e6e39b9eaaf59737adcf2R780

I think the ReplicationManager part of that change should be reverted and it handled via the ContainerReports by fixing the bug I mentioned above. The only reason we need the Replication Manager change is to work around the bug above. If it was fixed, we don't need that logic in RM.
You can find my comments in HDDS-4131.

If we decide to keep your changes inside replicationManager (for this change), then we would need a few more tests added to TestReplicationManager to cover the new scenarios.

TestReplicationManager uses Mockito which is not a real end-to-end test. I add a end-to-end test in TestBlockDeletion.

I also have a question - when the replicas are all deleted, and we call:

containerManager.updateContainerState(container.containerID(),
          HddsProtos.LifeCycleEvent.CLEANUP);

Does this somehow cause the container to be removed from SCM memory and the persistent store?

Container of DELETED state currently is still in SCM memory and persistent store. Of course we can just delete the Container from memory and DB when it's state changes to DELETED. But I feel like current way is a more safe way, like a Container trash bin. We can add a new purge DELETED Container action next so user can manually delete these Container when they sure these information are no longer needed. Overall, I'm open to this question, fine with either way.

@ChenSammi
Copy link
Contributor Author

The failed UT is irrelevant, fixed by #1379.

@ChenSammi
Copy link
Contributor Author

ChenSammi commented Sep 4, 2020

Hi, @sodonnel do you have time for another review?

handleContainerUnderDelete(container, replicas);
return;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add another check here:

if (state == LifeCycleState.DELETED) {
  return
}

This will avoid doing any further processing on a container which is expected to have zero replicas?

@sodonnel
Copy link
Contributor

sodonnel commented Sep 8, 2020

As I said before, I have a feeling that it might be better to have a separate service for deleting empty containers, and it would perhaps naturally be part of a "merge small containers" service, which we will need to create eventually. However, the changes to the replication manager here are not too big, and you could argue the current solution fits here OK. Therefore I am OK with keeping this logic in the replication manager. However when it comes to merging small containers in the future, I believe we should create a separate service and not clutter RM with that new feature too.

The patch looks mostly good to me, there are just a couple of areas we need to consider.

  1. What do we do with the containers in DELETED state, and with no replicas? I don't think we should keep them in SCM forever, but we are also worried about removing them. However that said, if the replicas have all been deleted, what use is there in keeping the container in SCM?

  2. Consider if a datanode was dead for some time, and is then restarted. It will start reporting a replica for the DELETED container and with the current logic I don't think we will clean it up as isContainerEmpty(...) will always return false due to it checking for only CLOSED containers. I feel that this should be handled in the ContainerReportHandler code. If a replica is reported for a container which is DELETED in SCM, then we should instruct the datanode to delete the replica. What do you think? The parameter hdds.scm.unknown-container.action currently controls what to do if a DN reports a containers which SCM does not know about. The default is to log a warning, but for a DELETED container, I think we should just delete the replica.

  3. There is another corner case, where the dead node may have a container which is out of sync with the others, as it was dead when the blocks should have been removed. Therefore we had 3 empty container replicas, plus one non-empty dead one. The 3 empty replicas are removed, and the container moved to "DELETED". Then the dead node starts reporting a non-empty replica for the DELETED container. This is actually a wider problem, in that I believe there are scenarios where container replicas can get out of sync with one another and there is nothing to reconcile them and fix it.

  4. In ContainerSafeModeRule, I think we need to consider DELETING and DELETED. If there are a lot of DELETED containers in the cluster, then they will never get replicas and the safemode rule will never get met. I think this could be as simple as excluding DELETING and DELETED when loading the containerMap (right now OPEN and CLOSING are excluded) so we don't wait for replicas on these containers.

  5. In ReplicationManager.isContainerUnderReplicated(...), should we immediately return true if the containerState is DELETING or DELETED?

  6. Are there any other places outside of ReplicationManager and ContainerSafeModeRule where we monitor under / over replicated containers I may have missed?

@sodonnel
Copy link
Contributor

sodonnel commented Sep 8, 2020

Thinking some more about what to do with containers in the DELETED state in SCM. If we are not confident to delete them immediately and automatically, maybe we should provide an command "purge empty containers" which can be run by an admin, or via Recon. We should show in Recon how many empty containers are present in the cluster and then provide an option to remove them.

Alternatively we could have some threshold in SCM, so that when the DELETED containers crosses some threshold, they are all removed.

We could do either of these in a separate jira if this idea makes sense.

@ChenSammi
Copy link
Contributor Author

ChenSammi commented Sep 16, 2020

Thanks @sodonnel for very thoughtful suggestions.

2. Consider if a datanode was dead for some time, and is then restarted. It will start reporting a replica for the DELETED container and with the current logic I don't think we will clean it up as `isContainerEmpty(...)` will always return false due to it checking for only CLOSED containers. I feel that this should be handled in the ContainerReportHandler code. If a replica is reported for a container which is DELETED in SCM, then we should instruct the datanode to delete the replica. What do you think? The parameter hdds.scm.unknown-container.action currently controls what to do if a DN reports a containers which SCM does not know about. The default is to log a warning, but for a DELETED container, I think we should just delete the replica.

Agree. we should just delete the replica in the case.

3. There is another corner case, where the dead node may have a container which is out of sync with the others, as it was dead when the blocks should have been removed. Therefore we had 3 empty container replicas, plus one non-empty dead one. The 3 empty replicas are removed, and the container moved to "DELETED". Then the dead node starts reporting a non-empty replica for the DELETED container. This is actually a wider problem, in that I believe there are scenarios where container replicas can get out of sync with one another and there is nothing to reconcile them and fix it.

The extra stale replica is a real problem, not just in empty container deletion case, but also for non-empty CLOSED container case. After a delete block command is confirmed by 3 datanodes, this delete block command is finished and removed from block deletion log. So if after that, a fourth replica shows up, SCM will not send an new delete block command to this replica.
But this fourth replica will have all remaining delete block commands executed if it's datanode is still on. Maybe we can rely on over replicated container replica deletion logic to remove this extra fourth replica. But it's hard to choose which one to delete if there deleteTransactionId and blockCommitSequenceId are different. Currently handleOverReplicatedContainer doesn't check these two fields of each replica.

4. In ContainerSafeModeRule, I think we need to consider DELETING and DELETED. If there are a lot of DELETED containers in the cluster, then they will never get replicas and the safemode rule will never get met. I think this could be as simple as excluding DELETING and DELETED when loading the containerMap (right now OPEN and CLOSING are excluded) so we don't wait for replicas on these containers.

Good catch!

5. In `ReplicationManager.isContainerUnderReplicated(...)`, should we immediately return true if the containerState is DELETING or DELETED?

Right.

I also agree we should purge these deleted containers after a while. We can have a timer task to periodically remove the DELETED containers from SCM DB. What do you think?

@ChenSammi
Copy link
Contributor Author

After a second thought, deleting the container record in SCM DB immediately while keep it in memory maybe a better and clean choice. So if there is stale container replica, it can be deleted based on in memory information. And next time when SCM start, SCM doesn't need to handle DELETED containers anymore.

@ChenSammi
Copy link
Contributor Author

@sodonnel @elek could you help to take another look?

if (isHealthy(replicaProto::getState)) {
final ContainerInfo containerInfo = containerManager
.getContainer(containerId);
if (containerInfo.getState() == HddsProtos.LifeCycleState.DELETED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem correct to put the logic to delete the replica for the DELETED container inside updateContainerStats.

In ContainerReportHandler#processContainerReplicas(..) there is logic to delete an unknown container in the exception handler.

Could we extract this into a new method, which is called from the exception handler. Then in AbstractContainerReportHandler#updateContainerState(...) handle the containers which should be deleted in the "case DELETED" branch of the swith statement. It could call that same extracted method - that way the logic to form the DeleteContainer command will be the same for both? It also seems more logical to put the delete inside UpdateContainerState rather than UpdateContainerStats.

@sodonnel
Copy link
Contributor

Sorry for the slow reply on this. I have been caught up on some other things.

After a second thought, deleting the container record in SCM DB immediately while keep it in memory maybe a better and clean choice. So if there is stale container replica, it can be deleted based on in memory information.

I think this is a good enough idea for now. If SCM is up for a very long time, perhaps in the future we will want to add a thread to clear all the in memory DELETED containers. One small concern is that if a container goes DELETED and then SCM is restarted soon after. Then a DN is restarted and reports a stale replica, it will just be seen as an unknown container. The default position there, is to log a warning. The config hdds.scm.unknown-container.action controls this. This is all an edge case - most of the time, all DNs should be up anyway.

I left just one comment on a suggested refactor in the container report handler, when dealing with replicas from a DELETED container.

Could you also add a test in TestContainerReportHander to check the logic around deleting a replica from a DELETED container?

@ChenSammi
Copy link
Contributor Author

Sorry for the slow reply on this. I have been caught up on some other things.

After a second thought, deleting the container record in SCM DB immediately while keep it in memory maybe a better and clean choice. So if there is stale container replica, it can be deleted based on in memory information.

I think this is a good enough idea for now. If SCM is up for a very long time, perhaps in the future we will want to add a thread to clear all the in memory DELETED containers. One small concern is that if a container goes DELETED and then SCM is restarted soon after. Then a DN is restarted and reports a stale replica, it will just be seen as an unknown container. The default position there, is to log a warning. The config hdds.scm.unknown-container.action controls this. This is all an edge case - most of the time, all DNs should be up anyway.

I left just one comment on a suggested refactor in the container report handler, when dealing with replicas from a DELETED container.

Could you also add a test in TestContainerReportHander to check the logic around deleting a replica from a DELETED container?

Thanks @sodonnel , a new commit to address the concerns.

@sodonnel
Copy link
Contributor

This change looks almost good now. I wonder about two final things:

  1. In updateContainerStats(...) do you think we should return if the container is DELETING or DELETED without making any updates? If this is a stale replica, they it may not be empty and hence could update the container stats incorrectly. If the Container is already in DELETING or DELETED state, then we can ignore any changes to it, as we know the stale replica will get removed anyway.

  2. I am wondering if we could receive a stale replica when the container is DELETING. Then a replica would get added in updateContainerReplica(...). Then later the container will go to DELETED and the state replica will get reported again - at that point we will send a delete command, but the replica will never get removed from memory now I think. Would it make sense to send the delete command for any replicas received when the container is DELETING or DELETED?

@ChenSammi
Copy link
Contributor Author

This change looks almost good now. I wonder about two final things:

1. In `updateContainerStats(...)` do you think we should return if the container is DELETING or DELETED without making any updates? If this is a stale replica, they it may not be empty and hence could update the container stats incorrectly. If the Container is already in DELETING or DELETED state, then we can ignore any changes to it, as we know the stale replica will get removed anyway.

2. I am wondering if we could receive a stale replica when the container is DELETING. Then a replica would get added in `updateContainerReplica(...)`. Then later the container will go to DELETED and the state replica will get reported again - at that point we will send a delete command, but the replica will never get removed from memory now I think. Would it make sense to send the delete command for any replicas received when the container is DELETING or DELETED?

This change looks almost good now. I wonder about two final things:

1. In `updateContainerStats(...)` do you think we should return if the container is DELETING or DELETED without making any updates? If this is a stale replica, they it may not be empty and hence could update the container stats incorrectly. If the Container is already in DELETING or DELETED state, then we can ignore any changes to it, as we know the stale replica will get removed anyway.

2. I am wondering if we could receive a stale replica when the container is DELETING. Then a replica would get added in `updateContainerReplica(...)`. Then later the container will go to DELETED and the state replica will get reported again - at that point we will send a delete command, but the replica will never get removed from memory now I think. Would it make sense to send the delete command for any replicas received when the container is DELETING or DELETED?

There is following logic in ReplicatioManager, which will handle the replicas reported during container state is DELETING.
So we only need to send delete replica command for DELETED container during container report proceess, and let the ReplicationManager handle the DELETING container replica, because when a container in in DELETING state, It's sure that scm send out some replica deletion commands, but we cannot tell replica stale or not in container report.

if (state == LifeCycleState.DELETING) {
handleContainerUnderDelete(container, replicas);
return;
}

@sodonnel

*/
if (isContainerEmpty(container, replicas)) {
deleteContainerReplicas(container, replicas);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenSammi , Is there any specific reason that we let ReplicationManager to help clean empty containers? After this, ReplicaManager will do additionally container empty check for all healthy containers. Not sure if this is an efficiency way to put logic here.

I wonder if it would be simpler to remove empty containers as part of Container Report processing? In AbstractContainerReportHandler#updateContainerState, we could check the size and number of keys of the reported containers in the CLOSED branch of the switch statement, and then take action to delete an empty container there? I have a feeling it might be simpler, but I am not sure. The disadvantage of doing it in the Container Report Processing, is that we are dealing with only a single replica at that stage. However if the container is CLOSED in SCM, and a report says it is empty then we should be good to simply remove the container from SCM and issue the delete container command when processing the container report.

Actually I prefer this way as @sodonnel mentioned.

but I am not sure. The disadvantage of doing it in the Container Report Processing, is that we are dealing with only a single replica at that stage

We could also get all replica info and check state in ContainerReportHandler, then send delete container command

I'm okay for current way but just share my thought for this.

@sodonnel
Copy link
Contributor

There is following logic in ReplicatioManager, which will handle the replicas reported during container state is DELETING.

Sorry I missed that. You are correct. I am +1 on this change as it is now, so feel free to commit it.

@linyiqun I do agree that I think this could be handled more cleanly and efficiently in the container report handler. However its probably not much of an overhead for replication manager. I am happy for us to commit the change as it is, and we can see how it performs in practice. Worst case we have to refactor the change out of RM into the report handler. What do you think?

@linyiqun
Copy link
Contributor

linyiqun commented Sep 29, 2020

@linyiqun I do agree that I think this could be handled more cleanly and efficiently in the container report handler. However its probably not much of an overhead for replication manager. I am happy for us to commit the change as it is, and we can see how it performs in practice. Worst case we have to refactor the change out of RM into the report handler. What do you think?

+1 for this, @sodonnel .

@ChenSammi , can you add a TODO comment like below while committing this PR? That will be helpful for us to revisit this in the future.
// TODO: container report handling the empty containers.
if (isContainerEmpty(container, replicas)) {
deleteContainerReplicas(container, replicas);
}

+1 from me.

@ChenSammi
Copy link
Contributor Author

Thanks @sodonnel and @linyiqun for the review.

Basically I think report handler is not a good place to handle all the empty container deletion process. It can tell which one is empty , but it lacks of the facilities in ReplicationManager, such as inflightDeletion, such as handle send command to extra replica for DELETING state container, or resend command. In future, when container compation is considered, we can move this container deletion logic to be together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants