-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-11136. Some containers affected by HDDS-8129 may still be in the DELETING state incorrectly #6967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @siddhantsangwan. Let's make the doc a PDF and link it to the Jira so it is always visible for everyone. Also lets add some description of corner cases and how they are handled somewhere in the PR or doc. Here's a few that would be good to cover:
- Old replica that missed block deletes comes back and begins reporting after SCM already cleared out the blocks deleted from the current replicas.
- How a container that was incorrectly in the DELETING state and moved back to CLOSED will eventually get cleared from the system.
...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/AbstractContainerReportHandler.java
Show resolved
Hide resolved
| * Bypasses the container state machine to change a container's state from DELETING to CLOSED. This API was | ||
| * specially introduced to fix a bug (HDDS-11136), and should NOT be used in any other context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can make this description a little more general? We can reference the initial bug, but it's also a decent defensive mechanism against accidental deletion if we end up in this situation another way.
| throw new InvalidContainerStateException("Container " + oldInfo + "must be in DELETING state to " + | ||
| "transition to CLOSED."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ContainerInfo object is a bit more verbose than we need here. Also I think we can touch up the message to indicate we are moving it "back" to closed. Otherwise it looks like a container isn't supposed to move to closed from any state other than deleting.
| throw new InvalidContainerStateException("Container " + oldInfo + "must be in DELETING state to " + | |
| "transition to CLOSED."); | |
| throw new InvalidContainerStateException("Cannot transition container " + id + " from " + oldState + " back to CLOSED. The container must be in the DELETING state."); |
| HddsProtos.ContainerInfoProto container = builder.build(); | ||
| HddsProtos.ContainerID cid = HddsProtos.ContainerID.newBuilder().setId(container.getContainerID()).build(); | ||
| containerStateManager.addContainer(container); | ||
| assertThrows(IOException.class, () -> containerStateManager.transitionDeletingToClosedState(cid)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can test for the more specific InvalidContainerStateException here.
| } | ||
|
|
||
| @Test | ||
| void testTransitionDeletingToClosedState() throws IOException, InvalidStateTransitionException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Can we parameterize this by container type instead of duplicating each line for each container?
...erver-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestContainerReportHandler.java
Show resolved
Hide resolved
...erver-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestContainerReportHandler.java
Show resolved
Hide resolved
@errose28 can you elaborate on this please? |
|
@errose28 and @nandakumar131 thanks for reviewing. The latest commit addresses review comments. I've added two corner cases to the linked design doc and it should be visible to everyone. I'll attach it to the jira once it's finalised. @errose28 I'll also think about the corner case you mentioned about containers getting cleared once I understand it better. |
|
The latest commit also adds logic for transitioning a container from deleting to closed if a non-empty quasi-closed replica is reported. This covers the case when only quasi-closed replicas are remaining and others have been lost - the first corner case mentioned in the design document. Note that we don't have a check for comparing sequence ID of the container as known to SCM with the sequence ID of the quasi-closed replica. I don't think we should have the check because:
|
|
Added an integration test. |
|
Regarding test failures for the two integration tests I added in |
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @siddhantsangwan. I'm not sure about this new case of using quasi-closed containers to transition out of deleting state as well.
In order for a container to have moved to the DELETING state in the first place, it must have first been in the CLOSED state. This means the majority of its replicas were also CLOSED. If we do not have any more CLOSED replicas then either of these two cases happened:
- If all CLOSED replicas were all lost, then that is a data loss scenario caused by too many faults in the system.
- If all of the CLOSED replicas were deleted inorrectly, I would say they have already been impacted by HDDS-8129, and we should not try to save them from remaining partial/lagging replicas in states OPEN, CLOSING, QUASI-CLOSED.
Even though this patch is only trying to handle QUASI-CLOSED replicas right now, I'm grouping QUASI-CLOSED, OPEN, and CLOSING replicas together because:
- If the SCM state is DELETING then we know these replicas states are lagging behind CLOSED replicas that may or may not still exist.
- They do not process block deletes. Therefore block deleting service will leave their
isEmptyflag tofalse.- This flag may be set to
trueon restart regardless of container state, but I think that is a separate issue we should fix, such that only CLOSED containers can haveisEmtpy = true.
- This flag may be set to
There is extra complexity in trying to save these stale replicas. For example currently the patch does not correctly handle the case were we see CLOSED replicas that are empty and non-CLOSED replicas which are not empty. The empty CLOSED replica confirms the non-empty non-closed one is stale and the container can be deleted. Whether or not the patch does this depends on the order of the container reports received and how those interleave with replication manager runs. IMO if the container reached the CLOSED state and then lost all its CLOSED replicas then it is beyond saving. HDDS-8129 was a data loss bug and this is a best effort recovery of lingering effects. In some cases the damage may have already been done.
| if (replica.getState() == State.CLOSED || replica.getState() == State.QUASI_CLOSED && replica.hasIsEmpty() && | ||
| !replica.getIsEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (replica.getState() == State.CLOSED || replica.getState() == State.QUASI_CLOSED && replica.hasIsEmpty() && | |
| !replica.getIsEmpty()) { | |
| if ((replica.getState() == State.CLOSED || replica.getState() == State.QUASI_CLOSED)) && replica.hasIsEmpty() && | |
| !replica.getIsEmpty()) { |
Operator precedence would have caused a bug here. Adding parenthesis fixes it but I would usually split something like this out for readability:
boolean replicaStateAllowed = replica.getState() == State.CLOSED || replica.getState() == State.QUASI_CLOSED;
boolean replicaNotEmpty = replica.hasIsEmpty() && !replica.getIsEmpty();
if (replicaStateAllowed && replicaNotEmpty) {Looks like we need tests that empty replicas in valid states do not trigger the transition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, thanks for catching this!
|
My intention for including quasi-closed replicas to decide on the container's state is to basically try and prevent complete data loss, even if there has been some data loss already. This is the direction that we had decided on when refactoring the replication manager last year.
Agreed.
There's data loss, right, but shouldn't we try and salvage any data that is remaining in the quasi-closed replicas?
Not sure I understand why we shouldn't save lagging quasi-closed replicas here - are you expecting the data race in HDDS-8129 to have completely corrupted these replicas?
I'm not very clear on whether block deletion can directly cause a container state change. As far as I know, a container should transition to DELETING (correctly) only when it has a positive number of replicas and all its replicas known to SCM are closed and empty: That means if any non-closed replicas exist at all, the container is incorrectly in the deleting state, UNLESS all the closed replicas were empty. Now consider our situation where a container moved to the deleting state incorrectly in a previous version because of keyCount being 0 when the container wasn't actually empty. Suppose it has a quasi-closed replica that was offline, and all the closed replicas have keyCount = 0 (but not actually empty). On upgrading to the latest version with our fix, let's say the closed replicas are somehow lost and the quasi-closed replica comes back. At this point, the container is incorrectly in the deleting state because its closed replicas, that are now lost, weren't actually empty. So we should transition it back to CLOSED when the quasi-closed replica is reported. If we don't, we risk losing that replica as well.
If the container has a non-closed non-empty replica, it shouldn't have moved to the DELETING state in the first place according to the code I quoted above. If the non-closed replica was somehow offline and that's why the check in the code passed, and we now transition the container back to CLOSED from DELETING on the basis of that non-closed replica, then we have an orphan container. I think that's an acceptable tradeoff. |
|
@errose28 thanks for the comprehensive review. I'll correct the bug you pointed out while we continue our discussion. |
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @siddhantsangwan. I think there's some minor improvements we can do in the tests, but we can do that in a follow up to not block this important change.
After some discussion we decided to only roll back the state if a non-empty closed replica is reported, which was the original plan of this PR. Containers affected by HDDS-8129 must have had 3 closed replicas at some point, and if none of those replicas are present due to the bug, at that point we consider the data affected.
These are the main reasons we will only consider closed replicas in this change:
If a replica was offline and not closed, the state is not an indication of how much data is there.
If we decide to save quasi-closed replicas, we must also decide to save closing and open replicas. A quasi-closed replica could have less data than an open replica depending on when the replica went offline. This broadens the scope of the change which introduces other problems listed below.
Whether or not a non-closed replica would get saved depends on the order of heartbeats received.
If the non-closed replica re-appeared before an empty closed replica, it would be saved because the state would be rolled back from deleting to closed. If the non-closed replica appeared after the SCM state had already been transitioned to deleted because the closed replicas were cleared out, the non-closed replica would be force deleted. This means saving non-closed replicas would not be guaranteed.
Incorrectly saving a non-closed replica will create an unhealthy container reported in SCM, not just orphan data.
If we are just saving closed replicas, the worst case is that we incorrectly roll back the SCM state to closed due to a stale replica that missed block deletes and is now back online. This is some orphan data but it is not reported as an error from ozone admin container report and similar tools.
If we try to save non-closed replicas, we can end up moving the SCM state back to closed, but only having 1 non-closed replica, which we may not be able to quasi-close and finally close. This would get reported as an unhealthy container in SCM and there would be no way to move it to a good state.
| } finally { | ||
| if (clusterPath != null) { | ||
| System.out.println("Deleting path " + clusterPath); | ||
| boolean deleted = FileUtil.fullyDelete(clusterPath.toFile()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? Usually the mini cluster should clean itself up on close.
| /** | ||
| * Tests for container report handling. | ||
| */ | ||
| public class TestContainerReportHandling { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to test HA and non-HA separately? We have lower level tests that check state updates work independent of HA or not. I feel like we can just test in an HA configuration for this higher level change and save on the extra cluster creation and execution.
|
Thanks for the review @errose28 @nandakumar131 . I created https://issues.apache.org/jira/browse/HDDS-11247 to investigate some issues with the integration tests and to handle your related comments. |
… DELETING state incorrectly (apache#6967) (cherry picked from commit 69ba680)
…ll be in the DELETING state incorrectly (apache#6967) (apache#82)
…till be in the DELETING state incorrectly (apache#6967)
What changes were proposed in this pull request?
Please check this document for the proposed solutions https://docs.google.com/document/d/1RjXsy88RgzweAi51v8sVBinbGfsmojvJqfUQCeqtrJc/edit?usp=sharing.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11136
How was this patch tested?
Added unit tests.