HDDS-5249. Race Condition between Full and Incremental Container Reports #2268

sodonnel · 2021-05-19T17:27:35Z

What changes were proposed in this pull request?

During testing we came across an issue with ICR and FCR handing.

The following log shows the issue:

2021-05-18 13:14:15,394 DEBUG org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica of container #1 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}


2021-05-18 13:14:15,394 DEBUG org.apache.hadoop.hdds.scm.container.IncrementalContainerReportHandler: Processing replica of container #1001 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}


2021-05-18 13:14:15,394 DEBUG org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica of container #2 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-05-18 13:14:15,394 DEBUG org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica of container #3 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-05-18 13:14:15,394 DEBUG org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica of container #4 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
...

In the above log, SCM is processing both an ICR and FCR for the same Datanode at the same time. The FCR does not container container #1001.

The FCR starts first, and it takes a snapshot of the containers on the node via NodeManager.

Then it starts processing the containers one by one.

The ICR then starts, and it added #1001 to the ContainerManager and to the NodeManager.

When the FCR completes, it replaces the list of containers in NodeManager with those in the FCR.

At this point, container #1001 is in the ContainerManager, but it is not listed against the node in NodeManager.

This would get fixed by the next FCR, but then the node goes dead. The dead node handler runs and uses the list of containers in NodeManager to remove all containers for the node. As #1001 is not listed, it is not removed by the DeadNodeManager. This means the container will never been seen as under replicated, as 3 copies will exist forever in the ContainerManager.

This issue is quite tricky to fully fix. There are two issues:

Parallel processing of ICR and FCR can lead to data inconsistency between the ComtainerManager and NodeManager. This is what caused the bug above.
A FCR wiping out a reference to a container recently sent in an ICR, but which is not included in the FCR.

The second issue is less serious, as the next FCR will fix the problem, as the FCRs are produced approximately every 60 seconds by default.

We can fix problem 1 quite easily by synchronising on the datanode when processing FCRs and ICRs, that will ensure the data inconsistency will not happen.

This PR is for issue 1, and we should probably create a followup issue for 2.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5249

How was this patch tested?

Added a new test to reproduce the race condition and verified it passes after the code change.

GlenGeng-awx · 2021-05-20T02:12:13Z

cc @GlenGeng

GlenGeng-awx · 2021-05-24T07:38:47Z

+1. Thanks for the fix of this subtle issue.

BTW, about the issue 2,

A FCR wiping out a reference to a container recently sent in an ICR, but which is not included in the FCR.

Right now the order of FCR and ICR is not guaranteed, we may add some monotonically increasing
sequence number in datanode side so that SCM can figure out a stale FCR report and drop it.

sodonnel · 2021-05-24T10:03:18Z

@GlenGeng Thanks for the review. I will go ahead and commit this one, and raise a new Jira for issue 2.

I think you are correct, in that we may need a monotonically increasing sequence number in the datanode, plus maybe some locking in the DN to guarantee the contents of the FCR.

The hard part, is figuring out what to do with that information on SCM. I wonder if we need to attach the "report sequence number" to every replica in SCM. Then we can say:

The FCR is has sequence 100, and it does not have container 1001.
1001 is in SCM, but added by ICR sequence 101.
Right now, the FCR processing would remove container 1001, but with the sequence we could ensure it does not remove anything created by a newer sequence number.

However this approach adds a new field to every replica which would have a memory overhead, so we need to try to think of a better way.

sodonnel · 2021-05-24T10:46:34Z

HDDS-5267 raised to track / fix the second issue mentioned here.

…ing-upgrade-master-merge * upstream/master: (76 commits) HDDS-5280. Make XceiverClientManager creation when necessary in ContainerOperationClient (apache#2289) HDDS-5272. Make ozonefs.robot execution repeatable (apache#2280) HDDS-5123. Use the pre-created apache/ozone-testkrb5 image during secure acceptance tests (apache#2165) HDDS-4993. Add guardrail for reserved buffer size when DN reads a chunk (apache#2058) HDDS-4936. Change ozone groupId from org.apache.hadoop to org.apache.ozone (apache#2018) HDDS-4043. allow deletion from Trash directory without -skipTrash option (apache#2110) HDDS-4927. Determine over and under utilized datanodes in Container Balancer. (apache#2230) HDDS-5273. Handle unsecure cluster convert to secure cluster for SCM. (apache#2281) HDDS-5158. Add documentation for SCM HA Security. (apache#2205) HDDS-5275. Datanode Report Publisher publishes one extra report after DN shutdown (apache#2283) HDDS-5241. SCM UI should have leader/follower and Primordial SCM information (apache#2260) HDDS-5219. Limit number of bad volumes by dfs.datanode.failed.volumes.tolerated. (apache#2243) HDDS-5252. PipelinePlacementPolicy filter out datanodes with not enough space. (apache#2271) HDDS-5191. Increase default pvc storage size (apache#2219) HDDS-5073. Use ReplicationConfig on client side (apache#2136) HDDS-5250. Build integration tests with Maven cache (apache#2269) HDDS-5236. Require block token for more operations (apache#2254) HDDS-5266 Misspelt words in S3MultipartUploadCommitPartRequest.java line 202 (apache#2279) HDDS-5249. Race Condition between Full and Incremental Container Reports (apache#2268) HDDS-5142. Make generic streaming client/service for container re-replication, data read, scm/om snapshot download (apache#2256) ... Conflicts: hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocol.java hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/protocolPB/StorageContainerLocationProtocolClientSideTranslatorPB.java hadoop-hdds/interface-admin/src/main/proto/ScmAdminProtocol.proto hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/MockNodeManager.java hadoop-ozone/dist/src/main/compose/testlib.sh hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/TestStorageContainerManager.java hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto hadoop-ozone/ozone-manager/pom.xml hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/utils/OzoneManagerRatisUtils.java hadoop-ozone/s3gateway/pom.xml

…rts (apache#2268) (cherry picked from commit ab8f07d) Change-Id: Ie9e4e45abe7a872782ec0966c84ed4dfa0793962

HDDS-5249. Race Condition between Full and Incremental Container Reports

1e95619

sodonnel merged commit ab8f07d into apache:master May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-5249. Race Condition between Full and Incremental Container Reports #2268

HDDS-5249. Race Condition between Full and Incremental Container Reports #2268

Uh oh!

sodonnel commented May 19, 2021

Uh oh!

GlenGeng-awx commented May 20, 2021

Uh oh!

GlenGeng-awx commented May 24, 2021

Uh oh!

sodonnel commented May 24, 2021

Uh oh!

sodonnel commented May 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-5249. Race Condition between Full and Incremental Container Reports #2268

HDDS-5249. Race Condition between Full and Incremental Container Reports #2268

Uh oh!

Conversation

sodonnel commented May 19, 2021

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

GlenGeng-awx commented May 20, 2021

Uh oh!

GlenGeng-awx commented May 24, 2021

Uh oh!

sodonnel commented May 24, 2021

Uh oh!

sodonnel commented May 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants