Skip to content

Conversation

@siddhantsangwan
Copy link
Contributor

What changes were proposed in this pull request?

Please check this document for the proposed solutions https://docs.google.com/document/d/1RjXsy88RgzweAi51v8sVBinbGfsmojvJqfUQCeqtrJc/edit?usp=sharing.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11136

How was this patch tested?

Added unit tests.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @siddhantsangwan. Let's make the doc a PDF and link it to the Jira so it is always visible for everyone. Also lets add some description of corner cases and how they are handled somewhere in the PR or doc. Here's a few that would be good to cover:

  • Old replica that missed block deletes comes back and begins reporting after SCM already cleared out the blocks deleted from the current replicas.
  • How a container that was incorrectly in the DELETING state and moved back to CLOSED will eventually get cleared from the system.

Comment on lines 135 to 136
* Bypasses the container state machine to change a container's state from DELETING to CLOSED. This API was
* specially introduced to fix a bug (HDDS-11136), and should NOT be used in any other context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make this description a little more general? We can reference the initial bug, but it's also a decent defensive mechanism against accidental deletion if we end up in this situation another way.

Comment on lines 380 to 381
throw new InvalidContainerStateException("Container " + oldInfo + "must be in DELETING state to " +
"transition to CLOSED.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ContainerInfo object is a bit more verbose than we need here. Also I think we can touch up the message to indicate we are moving it "back" to closed. Otherwise it looks like a container isn't supposed to move to closed from any state other than deleting.

Suggested change
throw new InvalidContainerStateException("Container " + oldInfo + "must be in DELETING state to " +
"transition to CLOSED.");
throw new InvalidContainerStateException("Cannot transition container " + id + " from " + oldState + " back to CLOSED. The container must be in the DELETING state.");

HddsProtos.ContainerInfoProto container = builder.build();
HddsProtos.ContainerID cid = HddsProtos.ContainerID.newBuilder().setId(container.getContainerID()).build();
containerStateManager.addContainer(container);
assertThrows(IOException.class, () -> containerStateManager.transitionDeletingToClosedState(cid));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can test for the more specific InvalidContainerStateException here.

}

@Test
void testTransitionDeletingToClosedState() throws IOException, InvalidStateTransitionException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Can we parameterize this by container type instead of duplicating each line for each container?

@siddhantsangwan
Copy link
Contributor Author

How a container that was incorrectly in the DELETING state and moved back to CLOSED will eventually get cleared from the system.

@errose28 can you elaborate on this please?

@siddhantsangwan
Copy link
Contributor Author

@errose28 and @nandakumar131 thanks for reviewing.

The latest commit addresses review comments. I've added two corner cases to the linked design doc and it should be visible to everyone. I'll attach it to the jira once it's finalised. @errose28 I'll also think about the corner case you mentioned about containers getting cleared once I understand it better.

@siddhantsangwan
Copy link
Contributor Author

The latest commit also adds logic for transitioning a container from deleting to closed if a non-empty quasi-closed replica is reported. This covers the case when only quasi-closed replicas are remaining and others have been lost - the first corner case mentioned in the design document.

Note that we don't have a check for comparing sequence ID of the container as known to SCM with the sequence ID of the quasi-closed replica. I don't think we should have the check because:

  1. If quasi-closed replicas are the only ones left, we need to replicate them even if the sequence ID isn't the latest.
  2. Non open containers only decrease in size, so if the quasi-closed replica is non-empty, we should close the container.

@siddhantsangwan
Copy link
Contributor Author

Added an integration test.

@siddhantsangwan
Copy link
Contributor Author

Regarding test failures for the two integration tests I added in TestContainerReportHandling. Each test passes if I comment the other one out, so they're fine. I think the MiniOzoneCluster is keeping some sort of state which is causing one test to fail when both are run. Looking into it.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @siddhantsangwan. I'm not sure about this new case of using quasi-closed containers to transition out of deleting state as well.

In order for a container to have moved to the DELETING state in the first place, it must have first been in the CLOSED state. This means the majority of its replicas were also CLOSED. If we do not have any more CLOSED replicas then either of these two cases happened:

  • If all CLOSED replicas were all lost, then that is a data loss scenario caused by too many faults in the system.
  • If all of the CLOSED replicas were deleted inorrectly, I would say they have already been impacted by HDDS-8129, and we should not try to save them from remaining partial/lagging replicas in states OPEN, CLOSING, QUASI-CLOSED.

Even though this patch is only trying to handle QUASI-CLOSED replicas right now, I'm grouping QUASI-CLOSED, OPEN, and CLOSING replicas together because:

  • If the SCM state is DELETING then we know these replicas states are lagging behind CLOSED replicas that may or may not still exist.
  • They do not process block deletes. Therefore block deleting service will leave their isEmpty flag to false.
    • This flag may be set to true on restart regardless of container state, but I think that is a separate issue we should fix, such that only CLOSED containers can have isEmtpy = true.

There is extra complexity in trying to save these stale replicas. For example currently the patch does not correctly handle the case were we see CLOSED replicas that are empty and non-CLOSED replicas which are not empty. The empty CLOSED replica confirms the non-empty non-closed one is stale and the container can be deleted. Whether or not the patch does this depends on the order of the container reports received and how those interleave with replication manager runs. IMO if the container reached the CLOSED state and then lost all its CLOSED replicas then it is beyond saving. HDDS-8129 was a data loss bug and this is a best effort recovery of lingering effects. In some cases the damage may have already been done.

Comment on lines 355 to 356
if (replica.getState() == State.CLOSED || replica.getState() == State.QUASI_CLOSED && replica.hasIsEmpty() &&
!replica.getIsEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (replica.getState() == State.CLOSED || replica.getState() == State.QUASI_CLOSED && replica.hasIsEmpty() &&
!replica.getIsEmpty()) {
if ((replica.getState() == State.CLOSED || replica.getState() == State.QUASI_CLOSED)) && replica.hasIsEmpty() &&
!replica.getIsEmpty()) {

Operator precedence would have caused a bug here. Adding parenthesis fixes it but I would usually split something like this out for readability:

boolean replicaStateAllowed = replica.getState() == State.CLOSED || replica.getState() == State.QUASI_CLOSED;
boolean replicaNotEmpty = replica.hasIsEmpty() && !replica.getIsEmpty();
if (replicaStateAllowed && replicaNotEmpty) {

Looks like we need tests that empty replicas in valid states do not trigger the transition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, thanks for catching this!

@siddhantsangwan
Copy link
Contributor Author

siddhantsangwan commented Jul 25, 2024

My intention for including quasi-closed replicas to decide on the container's state is to basically try and prevent complete data loss, even if there has been some data loss already. This is the direction that we had decided on when refactoring the replication manager last year.

In order for a container to have moved to the DELETING state in the first place, it must have first been in the CLOSED state.

Agreed.

If all CLOSED replicas were all lost, then that is a data loss scenario caused by too many faults in the system.

There's data loss, right, but shouldn't we try and salvage any data that is remaining in the quasi-closed replicas?

and we should not try to save them from remaining partial/lagging replicas in states OPEN, CLOSING, QUASI-CLOSED.

Not sure I understand why we shouldn't save lagging quasi-closed replicas here - are you expecting the data race in HDDS-8129 to have completely corrupted these replicas?

Even though this patch is only trying to handle QUASI-CLOSED replicas right now, I'm grouping QUASI-CLOSED, OPEN, and CLOSING replicas together because:

If the SCM state is DELETING then we know these replicas states are lagging behind CLOSED replicas that may or may not still exist.
They do not process block deletes. Therefore block deleting service will leave their isEmpty flag to false.
This flag may be set to true on restart regardless of container state, but I think that is a separate issue we should fix, such that only CLOSED containers can have isEmtpy = true.
There is extra complexity in trying to save these stale replicas. For example currently the patch does not correctly handle the case were we see CLOSED replicas that are empty and non-CLOSED replicas which are not empty. The empty CLOSED replica confirms the non-empty non-closed one is stale and the container can be deleted. Whether or not the patch does this depends on the order of the container reports received and how those interleave with replication manager runs.

I'm not very clear on whether block deletion can directly cause a container state change. As far as I know, a container should transition to DELETING (correctly) only when it has a positive number of replicas and all its replicas known to SCM are closed and empty:

  private boolean isContainerEmptyAndClosed(final ContainerInfo container,
      final Set<ContainerReplica> replicas) {
    return container.getState() == HddsProtos.LifeCycleState.CLOSED &&
        !replicas.isEmpty() &&
        replicas.stream().allMatch(
            r -> r.getState() == ContainerReplicaProto.State.CLOSED &&
                r.isEmpty());
  }

That means if any non-closed replicas exist at all, the container is incorrectly in the deleting state, UNLESS all the closed replicas were empty.

Now consider our situation where a container moved to the deleting state incorrectly in a previous version because of keyCount being 0 when the container wasn't actually empty. Suppose it has a quasi-closed replica that was offline, and all the closed replicas have keyCount = 0 (but not actually empty).

On upgrading to the latest version with our fix, let's say the closed replicas are somehow lost and the quasi-closed replica comes back. At this point, the container is incorrectly in the deleting state because its closed replicas, that are now lost, weren't actually empty. So we should transition it back to CLOSED when the quasi-closed replica is reported. If we don't, we risk losing that replica as well.

For example currently the patch does not correctly handle the case were we see CLOSED replicas that are empty and non-CLOSED replicas which are not empty. The empty CLOSED replica confirms the non-empty non-closed one is stale and the container can be deleted.

If the container has a non-closed non-empty replica, it shouldn't have moved to the DELETING state in the first place according to the code I quoted above. If the non-closed replica was somehow offline and that's why the check in the code passed, and we now transition the container back to CLOSED from DELETING on the basis of that non-closed replica, then we have an orphan container. I think that's an acceptable tradeoff.

@siddhantsangwan
Copy link
Contributor Author

@errose28 thanks for the comprehensive review. I'll correct the bug you pointed out while we continue our discussion.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @siddhantsangwan. I think there's some minor improvements we can do in the tests, but we can do that in a follow up to not block this important change.

After some discussion we decided to only roll back the state if a non-empty closed replica is reported, which was the original plan of this PR. Containers affected by HDDS-8129 must have had 3 closed replicas at some point, and if none of those replicas are present due to the bug, at that point we consider the data affected.

These are the main reasons we will only consider closed replicas in this change:

If a replica was offline and not closed, the state is not an indication of how much data is there.

If we decide to save quasi-closed replicas, we must also decide to save closing and open replicas. A quasi-closed replica could have less data than an open replica depending on when the replica went offline. This broadens the scope of the change which introduces other problems listed below.

Whether or not a non-closed replica would get saved depends on the order of heartbeats received.

If the non-closed replica re-appeared before an empty closed replica, it would be saved because the state would be rolled back from deleting to closed. If the non-closed replica appeared after the SCM state had already been transitioned to deleted because the closed replicas were cleared out, the non-closed replica would be force deleted. This means saving non-closed replicas would not be guaranteed.

Incorrectly saving a non-closed replica will create an unhealthy container reported in SCM, not just orphan data.

If we are just saving closed replicas, the worst case is that we incorrectly roll back the SCM state to closed due to a stale replica that missed block deletes and is now back online. This is some orphan data but it is not reported as an error from ozone admin container report and similar tools.

If we try to save non-closed replicas, we can end up moving the SCM state back to closed, but only having 1 non-closed replica, which we may not be able to quasi-close and finally close. This would get reported as an unhealthy container in SCM and there would be no way to move it to a good state.

} finally {
if (clusterPath != null) {
System.out.println("Deleting path " + clusterPath);
boolean deleted = FileUtil.fullyDelete(clusterPath.toFile());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Usually the mini cluster should clean itself up on close.

/**
* Tests for container report handling.
*/
public class TestContainerReportHandling {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to test HA and non-HA separately? We have lower level tests that check state updates work independent of HA or not. I feel like we can just test in an HA configuration for this higher level change and save on the extra cluster creation and execution.

@errose28 errose28 merged commit 69ba680 into apache:master Jul 26, 2024
@siddhantsangwan
Copy link
Contributor Author

Thanks for the review @errose28 @nandakumar131 . I created https://issues.apache.org/jira/browse/HDDS-11247 to investigate some issues with the integration tests and to handle your related comments.

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Oct 16, 2024
… DELETING state incorrectly (apache#6967)

(cherry picked from commit 69ba680)
ptlrs pushed a commit to ptlrs/ozone that referenced this pull request Mar 8, 2025
vtutrinov pushed a commit to vtutrinov/ozone that referenced this pull request Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants