HDDS-11136. Some containers affected by HDDS-8129 may still be in the DELETING state incorrectly #6967

siddhantsangwan · 2024-07-19T12:36:43Z

What changes were proposed in this pull request?

Please check this document for the proposed solutions https://docs.google.com/document/d/1RjXsy88RgzweAi51v8sVBinbGfsmojvJqfUQCeqtrJc/edit?usp=sharing.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11136

How was this patch tested?

Added unit tests.

…DELETING state incorrectly

errose28

Thanks for working on this @siddhantsangwan. Let's make the doc a PDF and link it to the Jira so it is always visible for everyone. Also lets add some description of corner cases and how they are handled somewhere in the PR or doc. Here's a few that would be good to cover:

Old replica that missed block deletes comes back and begins reporting after SCM already cleared out the blocks deleted from the current replicas.
How a container that was incorrectly in the DELETING state and moved back to CLOSED will eventually get cleared from the system.

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/AbstractContainerReportHandler.java

errose28 · 2024-07-19T19:31:16Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManager.java

+   * Bypasses the container state machine to change a container's state from DELETING to CLOSED. This API was
+   * specially introduced to fix a bug (HDDS-11136), and should NOT be used in any other context.


Maybe we can make this description a little more general? We can reference the initial bug, but it's also a decent defensive mechanism against accidental deletion if we end up in this situation another way.

errose28 · 2024-07-19T19:48:17Z

...server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerStateManagerImpl.java

+          throw new InvalidContainerStateException("Container " + oldInfo + "must be in DELETING state to " +
+            "transition to CLOSED.");


I think the ContainerInfo object is a bit more verbose than we need here. Also I think we can touch up the message to indicate we are moving it "back" to closed. Otherwise it looks like a container isn't supposed to move to closed from any state other than deleting.

Suggested change

throw new InvalidContainerStateException("Container " + oldInfo + "must be in DELETING state to " +

"transition to CLOSED.");

throw new InvalidContainerStateException("Cannot transition container " + id + " from " + oldState + " back to CLOSED. The container must be in the DELETING state.");

errose28 · 2024-07-19T19:55:03Z

...server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestContainerStateManager.java

+    HddsProtos.ContainerInfoProto container = builder.build();
+    HddsProtos.ContainerID cid = HddsProtos.ContainerID.newBuilder().setId(container.getContainerID()).build();
+    containerStateManager.addContainer(container);
+    assertThrows(IOException.class, () -> containerStateManager.transitionDeletingToClosedState(cid));


We can test for the more specific InvalidContainerStateException here.

errose28 · 2024-07-19T19:57:23Z

.../server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestContainerManagerImpl.java

  }

+  @Test
+  void testTransitionDeletingToClosedState() throws IOException, InvalidStateTransitionException {


nit. Can we parameterize this by container type instead of duplicating each line for each container?

...erver-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestContainerReportHandler.java

siddhantsangwan · 2024-07-22T17:58:04Z

How a container that was incorrectly in the DELETING state and moved back to CLOSED will eventually get cleared from the system.

@errose28 can you elaborate on this please?

siddhantsangwan · 2024-07-23T08:17:04Z

@errose28 and @nandakumar131 thanks for reviewing.

The latest commit addresses review comments. I've added two corner cases to the linked design doc and it should be visible to everyone. I'll attach it to the jira once it's finalised. @errose28 I'll also think about the corner case you mentioned about containers getting cleared once I understand it better.

siddhantsangwan · 2024-07-23T10:10:28Z

The latest commit also adds logic for transitioning a container from deleting to closed if a non-empty quasi-closed replica is reported. This covers the case when only quasi-closed replicas are remaining and others have been lost - the first corner case mentioned in the design document.

Note that we don't have a check for comparing sequence ID of the container as known to SCM with the sequence ID of the quasi-closed replica. I don't think we should have the check because:

If quasi-closed replicas are the only ones left, we need to replicate them even if the sequence ID isn't the latest.
Non open containers only decrease in size, so if the quasi-closed replica is non-empty, we should close the container.

siddhantsangwan · 2024-07-23T12:35:47Z

Added an integration test.

siddhantsangwan · 2024-07-24T12:10:12Z

Regarding test failures for the two integration tests I added in TestContainerReportHandling. Each test passes if I comment the other one out, so they're fine. I think the MiniOzoneCluster is keeping some sort of state which is causing one test to fail when both are run. Looking into it.

errose28

Thanks for the updates @siddhantsangwan. I'm not sure about this new case of using quasi-closed containers to transition out of deleting state as well.

In order for a container to have moved to the DELETING state in the first place, it must have first been in the CLOSED state. This means the majority of its replicas were also CLOSED. If we do not have any more CLOSED replicas then either of these two cases happened:

If all CLOSED replicas were all lost, then that is a data loss scenario caused by too many faults in the system.
If all of the CLOSED replicas were deleted inorrectly, I would say they have already been impacted by HDDS-8129, and we should not try to save them from remaining partial/lagging replicas in states OPEN, CLOSING, QUASI-CLOSED.

Even though this patch is only trying to handle QUASI-CLOSED replicas right now, I'm grouping QUASI-CLOSED, OPEN, and CLOSING replicas together because:

If the SCM state is DELETING then we know these replicas states are lagging behind CLOSED replicas that may or may not still exist.
They do not process block deletes. Therefore block deleting service will leave their isEmpty flag to false.
- This flag may be set to true on restart regardless of container state, but I think that is a separate issue we should fix, such that only CLOSED containers can have isEmtpy = true.

There is extra complexity in trying to save these stale replicas. For example currently the patch does not correctly handle the case were we see CLOSED replicas that are empty and non-CLOSED replicas which are not empty. The empty CLOSED replica confirms the non-empty non-closed one is stale and the container can be deleted. Whether or not the patch does this depends on the order of the container reports received and how those interleave with replication manager runs. IMO if the container reached the CLOSED state and then lost all its CLOSED replicas then it is beyond saving. HDDS-8129 was a data loss bug and this is a best effort recovery of lingering effects. In some cases the damage may have already been done.

errose28 · 2024-07-24T19:15:51Z

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/AbstractContainerReportHandler.java

+      if (replica.getState() == State.CLOSED || replica.getState() == State.QUASI_CLOSED && replica.hasIsEmpty() &&
+          !replica.getIsEmpty()) {


Suggested change

if (replica.getState() == State.CLOSED || replica.getState() == State.QUASI_CLOSED && replica.hasIsEmpty() &&

!replica.getIsEmpty()) {

if ((replica.getState() == State.CLOSED || replica.getState() == State.QUASI_CLOSED)) && replica.hasIsEmpty() &&

!replica.getIsEmpty()) {

Operator precedence would have caused a bug here. Adding parenthesis fixes it but I would usually split something like this out for readability:

boolean replicaStateAllowed = replica.getState() == State.CLOSED || replica.getState() == State.QUASI_CLOSED; boolean replicaNotEmpty = replica.hasIsEmpty() && !replica.getIsEmpty(); if (replicaStateAllowed && replicaNotEmpty) {

Looks like we need tests that empty replicas in valid states do not trigger the transition

Oh right, thanks for catching this!

siddhantsangwan · 2024-07-25T07:56:22Z

My intention for including quasi-closed replicas to decide on the container's state is to basically try and prevent complete data loss, even if there has been some data loss already. This is the direction that we had decided on when refactoring the replication manager last year.

In order for a container to have moved to the DELETING state in the first place, it must have first been in the CLOSED state.

Agreed.

If all CLOSED replicas were all lost, then that is a data loss scenario caused by too many faults in the system.

There's data loss, right, but shouldn't we try and salvage any data that is remaining in the quasi-closed replicas?

and we should not try to save them from remaining partial/lagging replicas in states OPEN, CLOSING, QUASI-CLOSED.

Not sure I understand why we shouldn't save lagging quasi-closed replicas here - are you expecting the data race in HDDS-8129 to have completely corrupted these replicas?

Even though this patch is only trying to handle QUASI-CLOSED replicas right now, I'm grouping QUASI-CLOSED, OPEN, and CLOSING replicas together because:

If the SCM state is DELETING then we know these replicas states are lagging behind CLOSED replicas that may or may not still exist.
They do not process block deletes. Therefore block deleting service will leave their isEmpty flag to false.
This flag may be set to true on restart regardless of container state, but I think that is a separate issue we should fix, such that only CLOSED containers can have isEmtpy = true.
There is extra complexity in trying to save these stale replicas. For example currently the patch does not correctly handle the case were we see CLOSED replicas that are empty and non-CLOSED replicas which are not empty. The empty CLOSED replica confirms the non-empty non-closed one is stale and the container can be deleted. Whether or not the patch does this depends on the order of the container reports received and how those interleave with replication manager runs.

I'm not very clear on whether block deletion can directly cause a container state change. As far as I know, a container should transition to DELETING (correctly) only when it has a positive number of replicas and all its replicas known to SCM are closed and empty:

  private boolean isContainerEmptyAndClosed(final ContainerInfo container,
      final Set<ContainerReplica> replicas) {
    return container.getState() == HddsProtos.LifeCycleState.CLOSED &&
        !replicas.isEmpty() &&
        replicas.stream().allMatch(
            r -> r.getState() == ContainerReplicaProto.State.CLOSED &&
                r.isEmpty());
  }

That means if any non-closed replicas exist at all, the container is incorrectly in the deleting state, UNLESS all the closed replicas were empty.

Now consider our situation where a container moved to the deleting state incorrectly in a previous version because of keyCount being 0 when the container wasn't actually empty. Suppose it has a quasi-closed replica that was offline, and all the closed replicas have keyCount = 0 (but not actually empty).

On upgrading to the latest version with our fix, let's say the closed replicas are somehow lost and the quasi-closed replica comes back. At this point, the container is incorrectly in the deleting state because its closed replicas, that are now lost, weren't actually empty. So we should transition it back to CLOSED when the quasi-closed replica is reported. If we don't, we risk losing that replica as well.

For example currently the patch does not correctly handle the case were we see CLOSED replicas that are empty and non-CLOSED replicas which are not empty. The empty CLOSED replica confirms the non-empty non-closed one is stale and the container can be deleted.

If the container has a non-closed non-empty replica, it shouldn't have moved to the DELETING state in the first place according to the code I quoted above. If the non-closed replica was somehow offline and that's why the check in the code passed, and we now transition the container back to CLOSED from DELETING on the basis of that non-closed replica, then we have an orphan container. I think that's an acceptable tradeoff.

siddhantsangwan · 2024-07-25T07:57:12Z

@errose28 thanks for the comprehensive review. I'll correct the bug you pointed out while we continue our discussion.

errose28

Thanks for the updates @siddhantsangwan. I think there's some minor improvements we can do in the tests, but we can do that in a follow up to not block this important change.

After some discussion we decided to only roll back the state if a non-empty closed replica is reported, which was the original plan of this PR. Containers affected by HDDS-8129 must have had 3 closed replicas at some point, and if none of those replicas are present due to the bug, at that point we consider the data affected.

These are the main reasons we will only consider closed replicas in this change:

If a replica was offline and not closed, the state is not an indication of how much data is there.

If we decide to save quasi-closed replicas, we must also decide to save closing and open replicas. A quasi-closed replica could have less data than an open replica depending on when the replica went offline. This broadens the scope of the change which introduces other problems listed below.

Whether or not a non-closed replica would get saved depends on the order of heartbeats received.

If the non-closed replica re-appeared before an empty closed replica, it would be saved because the state would be rolled back from deleting to closed. If the non-closed replica appeared after the SCM state had already been transitioned to deleted because the closed replicas were cleared out, the non-closed replica would be force deleted. This means saving non-closed replicas would not be guaranteed.

Incorrectly saving a non-closed replica will create an unhealthy container reported in SCM, not just orphan data.

If we are just saving closed replicas, the worst case is that we incorrectly roll back the SCM state to closed due to a stale replica that missed block deletes and is now back online. This is some orphan data but it is not reported as an error from ozone admin container report and similar tools.

If we try to save non-closed replicas, we can end up moving the SCM state back to closed, but only having 1 non-closed replica, which we may not be able to quasi-close and finally close. This would get reported as an unhealthy container in SCM and there would be no way to move it to a good state.

errose28 · 2024-07-26T19:27:26Z

...ration-test/src/test/java/org/apache/hadoop/ozone/container/TestContainerReportHandling.java

+    } finally {
+      if (clusterPath != null) {
+        System.out.println("Deleting path " + clusterPath);
+        boolean deleted = FileUtil.fullyDelete(clusterPath.toFile());


Why is this necessary? Usually the mini cluster should clean itself up on close.

errose28 · 2024-07-26T19:45:15Z

...ration-test/src/test/java/org/apache/hadoop/ozone/container/TestContainerReportHandling.java

+/**
+ * Tests for container report handling.
+ */
+public class TestContainerReportHandling {


Is there a reason to test HA and non-HA separately? We have lower level tests that check state updates work independent of HA or not. I feel like we can just test in an HA configuration for this higher level change and save on the extra cluster creation and execution.

siddhantsangwan · 2024-07-29T08:02:18Z

Thanks for the review @errose28 @nandakumar131 . I created https://issues.apache.org/jira/browse/HDDS-11247 to investigate some issues with the integration tests and to handle your related comments.

… DELETING state incorrectly (apache#6967) (cherry picked from commit 69ba680)

…ll be in the DELETING state incorrectly (apache#6967) (apache#82)

…till be in the DELETING state incorrectly (apache#6967)

siddhantsangwan added 2 commits July 19, 2024 17:53

HDDS-1136. Some containers affected by HDDS-8129 may still be in the …

5ba6fd6

…DELETING state incorrectly

Merge branch 'refs/heads/master' into HDDS-11136

cc03549

siddhantsangwan requested review from errose28 and nandakumar131 July 19, 2024 12:36

errose28 reviewed Jul 19, 2024

View reviewed changes

address review comments

6c0c587

add integration test

c189cce

add integration test for SCM HA

869a9e6

errose28 reviewed Jul 24, 2024

View reviewed changes

siddhantsangwan added 4 commits July 25, 2024 16:13

fix operators bug

6b5dacb

split out integration tests into different files

85fb626

fix checkstyle

b8c9512

use only CLOSED replicas to determine state

55a429d

errose28 approved these changes Jul 26, 2024

View reviewed changes

errose28 merged commit 69ba680 into apache:master Jul 26, 2024

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Oct 16, 2024

HDDS-11136. Some containers affected by HDDS-8129 may still be in the…

dd7d1ff

… DELETING state incorrectly (apache#6967) (cherry picked from commit 69ba680)

xichen01 mentioned this pull request Oct 22, 2024

[DO NOT MERGE] Backport some fixes and compatibility commits from master to ozone-1.4 #7341

Merged

ptlrs pushed a commit to ptlrs/ozone that referenced this pull request Mar 8, 2025

CDPD-72020. HDDS-11136. Some containers affected by HDDS-8129 may sti…

3c95701

…ll be in the DELETING state incorrectly (apache#6967) (apache#82)

vtutrinov pushed a commit to vtutrinov/ozone that referenced this pull request Jul 15, 2025

[SDPOZN-1475] HDDS-11136. Some containers affected by HDDS-8129 may s…

564863b

…till be in the DELETING state incorrectly (apache#6967)

		* Bypasses the container state machine to change a container's state from DELETING to CLOSED. This API was
		* specially introduced to fix a bug (HDDS-11136), and should NOT be used in any other context.

		throw new InvalidContainerStateException("Container " + oldInfo + "must be in DELETING state to " +
		"transition to CLOSED.");

		if (replica.getState() == State.CLOSED \|\| replica.getState() == State.QUASI_CLOSED && replica.hasIsEmpty() &&
		!replica.getIsEmpty()) {

HDDS-11136. Some containers affected by HDDS-8129 may still be in the DELETING state incorrectly #6967

HDDS-11136. Some containers affected by HDDS-8129 may still be in the DELETING state incorrectly #6967

Uh oh!

Conversation

siddhantsangwan commented Jul 19, 2024

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

siddhantsangwan commented Jul 22, 2024

Uh oh!

siddhantsangwan commented Jul 23, 2024

Uh oh!

siddhantsangwan commented Jul 23, 2024

Uh oh!

siddhantsangwan commented Jul 23, 2024

Uh oh!

siddhantsangwan commented Jul 24, 2024

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddhantsangwan commented Jul 25, 2024

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

These are the main reasons we will only consider closed replicas in this change:

If a replica was offline and not closed, the state is not an indication of how much data is there.

Whether or not a non-closed replica would get saved depends on the order of heartbeats received.

Incorrectly saving a non-closed replica will create an unhealthy container reported in SCM, not just orphan data.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented Jul 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

siddhantsangwan commented Jul 25, 2024 •

edited

Loading