HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. #1625

GlenGeng-awx · 2020-11-25T12:43:47Z

What changes were proposed in this pull request?

This change is inspired by avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower.

Consider the follow logic in RM,

      /*
       * We don't take any action if the container is in OPEN state and
       * the container is healthy. If the container is not healthy, i.e.
       * the replicas are not in OPEN state, send CLOSE_CONTAINER command.
       */
      if (state == LifeCycleState.OPEN) {
        if (!isContainerHealthy(container, replicas)) {
          eventPublisher.fireEvent(SCMEvents.CLOSE_CONTAINER, id);
        }
        return;
      }

  private boolean isContainerHealthy(final ContainerInfo container,
                                     final Set<ContainerReplica> replicas) {
    return !isContainerUnderReplicated(container, replicas) &&
        !isContainerOverReplicated(container, replicas) &&
        replicas.stream().allMatch(
            r -> compareState(container.getState(), r.getState()));
  }

  private boolean isContainerUnderReplicated(final ContainerInfo container,
      final Set<ContainerReplica> replicas) {
    if (container.getState() != LifeCycleState.CLOSED &&
        container.getState() != LifeCycleState.QUASI_CLOSED) {
      return false;
    }
    boolean misReplicated = !getPlacementStatus(
        replicas, container.getReplicationFactor().getNumber())
        .isPolicySatisfied();
    return container.getReplicationFactor().getNumber() >
        getReplicaCount(container.containerID(), replicas) || misReplicated;
  }

Consider this scenario: client triggers SCM to allocate one container by allocating blocks, then it crashes, never writes chunks to DN to trigger the creation of the container, thus no replica report for this container.

Previously, ReplicationManager will close such containers, since it is under replicated. This is a reasonable and legacy handling which is used to prevents StaleNodeHandler to take effect in TestDeleteWithSlowFollower.

This following logic is added by Sammi in her PR HDDS-4023. Delete closed container after all blocks have been deleted

    if (container.getState() != LifeCycleState.CLOSED &&
        container.getState() != LifeCycleState.QUASI_CLOSED) {
      return false;
    }

After talked with @ChenSammi , by design, it just needs to explicitly avoid replicating container in DELETING or DELETED state.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4511

How was this patch tested?

CI

linyiqun · 2020-11-26T03:28:55Z

@GlenGeng , not fully get the reason for doing this. If we enable replication for open state containers, how we keep data consistent when container be in replication and meanwhile there is block written in this container.

bshashikant · 2020-11-26T04:59:26Z

@GlenGeng , is the description of the pull request correct?

GlenGeng-awx · 2020-11-26T05:22:03Z

@linyiqun @bshashikant Thanks for the review!

Shashi, I found this issue when trying to debug the test case TestDeleteWithSlowFollower for SCM HA, which is written by you. You can find more details in the jira's description.

The key point is, although this case can pass in master, after HDDS-4023. Delete closed container after all blocks have been deleted., there will be log from stale node hander, which is by design not correct.

This case relies on RM to make the container to CLOSING state, so that SCM can move container state to CLOSED, and not hold the delete blocks command.

There are two solutions to avoid the stale node issue, one is shorten the frequency of RM in the case TestDeleteWithSlowFollower, the other is the above changes.

Actually, this fix would work in the version of Oct 20, which is the latest merge from master to HDDS-2823, But seems some change after Oct 20 affect the case TestDeleteWithSlowFollower again.

GlenGeng-awx · 2020-11-26T05:30:36Z

The description may not be accurate enough, I would update them later. Thanks for pointing out !

GlenGeng-awx · 2020-11-26T06:07:49Z

@linyiqun Hey yiqun, never mind, RM will only send out replicate container command for closed container.

linyiqun · 2020-11-26T16:06:42Z

@linyiqun Hey yiqun, never mind, RM will only send out replicate container command for closed container.

Okay, get it.

GlenGeng-awx · 2020-11-27T05:54:04Z

@ChenSammi Please take a look at this PR. Thanks !

ChenSammi · 2020-11-27T07:47:36Z

The patch LGTM, +1.

Thanks @GlenGeng for investigating and fix the issue, and @linyiqun @bshashikant for the review.

…der OPEN container

…HDDS-4367.

* HDDS-3698-upgrade: HDDS-4429. Create unit test for SimpleContainerDownloader. (apache#1551) HDDS-4461. Reuse compiled binaries in acceptance test (apache#1588) HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. (apache#1625) HDDS-4510. SCM can avoid creating RetriableDatanodeEventWatcher for deletion command ACK (apache#1626) HDDS-3363. Intermittent failure in testContainerImportExport (apache#1618) HDDS-4370. Datanode deletion service can avoid storing deleted blocks. (apache#1620) HDDS-4512. Remove unused netty3 transitive dependency (apache#1627) HDDS-4481. With HA OM can send deletion blocks to SCM multiple times. (apache#1608) HDDS-4487. SCM can avoid using RETRIABLE_DATANODE_COMMAND for datanode deletion commands. (apache#1621) HDDS-4471. GrpcOutputStream length can overflow (apache#1617) HDDS-4308. Fix issue with quota update (apache#1489) HDDS-4392. [DOC] Add Recon architecture to docs (apache#1602) HDDS-4501. Reload OM State fail should terminate OM for any exceptions. (apache#1622) HDDS-4492. CLI flag --quota should default to 'spaceQuota' to preserve backward compatibility. (apache#1609) HDDS-3689. Add various profiles to MiniOzoneChaosCluster to run different modes. (apache#1420) HDDS-4497. Recon File Size Count task throws SQL Exception. (apache#1612)

* master: (40 commits) HDDS-4473. Reduce number of sortDatanodes RPC calls (apache#1610) HDDS-4485. [DOC] add the authentication rules of the Ozone Ranger. (apache#1603) HDDS-4528. Upgrade slf4j to 1.7.30 (apache#1639) HDDS-4424. Update README with information how to report security issues (apache#1548) HDDS-4484. Use RaftServerImpl isLeader instead of periodic leader update logic in OM and isLeaderReady for read/write requests (apache#1638) HDDS-4429. Create unit test for SimpleContainerDownloader. (apache#1551) HDDS-4461. Reuse compiled binaries in acceptance test (apache#1588) HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. (apache#1625) HDDS-4510. SCM can avoid creating RetriableDatanodeEventWatcher for deletion command ACK (apache#1626) HDDS-3363. Intermittent failure in testContainerImportExport (apache#1618) HDDS-4370. Datanode deletion service can avoid storing deleted blocks. (apache#1620) HDDS-4512. Remove unused netty3 transitive dependency (apache#1627) HDDS-4481. With HA OM can send deletion blocks to SCM multiple times. (apache#1608) HDDS-4487. SCM can avoid using RETRIABLE_DATANODE_COMMAND for datanode deletion commands. (apache#1621) HDDS-4471. GrpcOutputStream length can overflow (apache#1617) HDDS-4308. Fix issue with quota update (apache#1489) HDDS-4392. [DOC] Add Recon architecture to docs (apache#1602) HDDS-4501. Reload OM State fail should terminate OM for any exceptions. (apache#1622) HDDS-4492. CLI flag --quota should default to 'spaceQuota' to preserve backward compatibility. (apache#1609) HDDS-3689. Add various profiles to MiniOzoneChaosCluster to run different modes. (apache#1420) ...

GlenGeng-awx closed this Nov 25, 2020

GlenGeng-awx reopened this Nov 25, 2020

GlenGeng-awx changed the title ~~HDDS-4511: ReplicationManager#isContainerUnderReplicated should consider OPEN container~~ [DRAFT]HDDS-4511: ReplicationManager#isContainerUnderReplicated should consider OPEN container Nov 26, 2020

GlenGeng-awx changed the title ~~[DRAFT]HDDS-4511: ReplicationManager#isContainerUnderReplicated should consider OPEN container~~ HDDS-4511: Avoid StaleNodeHandler to take effect in TestDeleteWithSlowFollower. Nov 26, 2020

GlenGeng-awx changed the title ~~HDDS-4511: Avoid StaleNodeHandler to take effect in TestDeleteWithSlowFollower.~~ HDDS-4511: ReplicationManager#isContainerUnderReplicated should consider OPEN container Nov 26, 2020

GlenGeng-awx changed the title ~~HDDS-4511: ReplicationManager#isContainerUnderReplicated should consider OPEN container~~ HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. Nov 26, 2020

GlenGeng-awx closed this Nov 26, 2020

GlenGeng-awx reopened this Nov 26, 2020

GlenGeng-awx closed this Nov 27, 2020

GlenGeng-awx reopened this Nov 27, 2020

GlenGeng-awx closed this Nov 27, 2020

GlenGeng-awx reopened this Nov 27, 2020

Glen Geng added 2 commits November 27, 2020 15:52

HDDS-4511: ReplicationManager#isContainerUnderReplicated should consi…

8ea061f

…der OPEN container

HDDS-4511: fix case TestDeleteWithSlowFollower, which is affected by …

ff1a047

…HDDS-4367.

GlenGeng-awx force-pushed the HDDS-4511 branch from d0dd364 to ff1a047 Compare November 27, 2020 07:53

ChenSammi merged commit 130ba4d into apache:master Nov 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. #1625

HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. #1625

Uh oh!

GlenGeng-awx commented Nov 25, 2020 •

edited

Loading

Uh oh!

linyiqun commented Nov 26, 2020

Uh oh!

bshashikant commented Nov 26, 2020

Uh oh!

GlenGeng-awx commented Nov 26, 2020 •

edited

Loading

Uh oh!

GlenGeng-awx commented Nov 26, 2020

Uh oh!

GlenGeng-awx commented Nov 26, 2020

Uh oh!

linyiqun commented Nov 26, 2020 •

edited

Loading

Uh oh!

GlenGeng-awx commented Nov 27, 2020

Uh oh!

ChenSammi commented Nov 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. #1625

HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. #1625

Uh oh!

Conversation

GlenGeng-awx commented Nov 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

linyiqun commented Nov 26, 2020

Uh oh!

bshashikant commented Nov 26, 2020

Uh oh!

GlenGeng-awx commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GlenGeng-awx commented Nov 26, 2020

Uh oh!

GlenGeng-awx commented Nov 26, 2020

Uh oh!

linyiqun commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GlenGeng-awx commented Nov 27, 2020

Uh oh!

ChenSammi commented Nov 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GlenGeng-awx commented Nov 25, 2020 •

edited

Loading

GlenGeng-awx commented Nov 26, 2020 •

edited

Loading

linyiqun commented Nov 26, 2020 •

edited

Loading