HDDS-10239. Storage Container Reconciliation. #6121

errose28 · 2024-01-30T01:50:15Z

What changes were proposed in this pull request?

Design proposal for container reconciliation. Please comment inline on the markdown document to ask questions and post feedback. Switch to Rich Diff mode for smoother reading.

What is the link to the Apache JIRA

HDDS-10239

How was this patch tested?

N/A

errose28 · 2024-01-30T01:50:55Z

cc @kerneltime and @sodonnel who also worked on this design and document.

hadoop-hdds/docs/content/design/container-reconciliation.md

szetszwo

@errose28 , thanks for working on this! The design looks good. Please see the comments inlined.

szetszwo · 2024-02-02T18:09:20Z

hadoop-hdds/docs/content/design/container-reconciliation.md

+  deleted: false
+  healthy: true


We may combine deleted and healthy to an enum. BTW, let's call it CHECKSUM_MATCHED instead of healthy.

enum State { CHECKSUM_MATCHED, CHECKSUM_MISMATCHED, DELETED }

Here is what the actual proto change will look like

diff --git a/hadoop-hdds/interface-client/src/main/proto/DatanodeClientProtocol.proto b/hadoop-hdds/interface-client/src/main/proto/DatanodeClientProtocol.proto index 718e2a108c..ed254478e9 100644 --- a/hadoop-hdds/interface-client/src/main/proto/DatanodeClientProtocol.proto +++ b/hadoop-hdds/interface-client/src/main/proto/DatanodeClientProtocol.proto @@ -382,6 +382,7 @@ message ChunkInfo { repeated KeyValue metadata = 4; required ChecksumData checksumData =5; optional bytes stripeChecksum = 6; + optional bool healthy = 7; } message ChunkInfoList { @@ -525,3 +526,28 @@ service IntraDatanodeProtocolService { rpc download (CopyContainerRequestProto) returns (stream CopyContainerResponseProto); rpc upload (stream SendContainerRequest) returns (SendContainerResponse); } + +message MTBlockInfo { // Merkle Tree Block Info + optional BlockData blockData = 1; // The chunks in this should be sorted by order of chunks written. + optional ChecksumData checksumData = 2; // Checksum of the checksum of the chunks. + optional bool deleted = 3; // True if the block is deleted. + optional int64 length = 4; // Length of the block + optional int64 chunkCount = 5; // Number of chunks in the block. +} + +message MTContainerInfo { + // Merkle Tree Container Info + enum FailureCause { + NO_HEALTHY_CHECKSUM_FOUND_WITH_PEERS = 1; // No healthy checksum found with peers. + NO_PEERS_FOUND = 2; // No peers found. + } + optional ContainerDataProto containerData = 1; + repeated MTBlockInfo mtBlockInfo = 2; // Sorted list of blocks. + optional ChecksumData checksumData = 3; // Checksum of the checksum of the blocks + optional bool deleted = 4; // True if the container is deleted. + optional int64 blockCount = 5; // Number of blocks in the container. + optional FailureCause failureCause = 6; // Failure cause if the container is unhealthy. Only used for reporting. + optional int64 reconciliationCount = 7; // Number of times reconciliation was attempted. +}

Please take a look at the updated proto definition.

hadoop-hdds/docs/content/design/container-reconciliation.md

szetszwo · 2024-02-02T18:31:29Z

hadoop-hdds/docs/content/design/container-reconciliation.md

+
+- The `healthy` field in the block checksum being true indicates that all chunks within the block are marked as `healthy = true`.
+
+- We may be able to re-use the `ChunkInfo` and `ChecksumData` messages in `DatanodeClientProtocol.proto` for chunk information and checksums. Other structures in the merkle tree storage will require new objects to be defined.


Let's don't reuse them since we should minimize the size. Also, we should use proto 3 for the new protos. DatanodeClientProtocol.proto is proto 2.

I considered that, the problem is that we cannot reduce size without introducing a whole bunch of new APIs or paying for extra roundtrips. Today we have the infrastructure to create chunk readers and start reading chunks. If we do not pass in the existing structures in the MT, we would either need new APIs to read chunk or have additional round trips. We will experiment with this and see if we can optionally not set some of the fields or cheaply introduce new on the wire structures. One option is to complete https://issues.apache.org/jira/browse/HDDS-10338 as part of this work which would make the proto definitions a lot leaner.

szetszwo · 2024-02-02T18:42:06Z

hadoop-hdds/docs/content/design/container-reconciliation.md

+  algorithm: CRC32
+  checksum: 12345
+  length: 5
+  deleted: false


How to detect if a block is not deleted by mistakes?

We currently assume that there is no block deletion by mistake as the path for deletion of block comes from OM receiving block deletion. Thus, a block is marked as deleted only when OM has deleted the object and the blocks referenced. Protection against accidental deletes should be done via snapshots.

The might not be a valid assumption since, when there is a corruption, it is invalid to assume "something will never be corrupted". A possible cause of block deletion by mistake is a (future) software bug.

We could have another checksum for the deleted blocks. Below is a quick idea:

the deleted-block-checksum sum initially is 0;

When a block b is deleted, update sum with xor, i.e. sum ^= hash(b).

SCM could store a deleted-block-checksum for each container. Then, it is easy to tell if the deleted blocks are correct in a particular container.

A couple of issues

The flag deleted in code should be true only if SCM issued the deletion.

SCM tracking a lineage of hash changes has other memory bloat challenges for SCM which will provide fiction for scaling.

If SCM has a software bug that erroneously sends deletes or Datanode erroneously marks blocks as deleted, protection via MT might not be enough.

The only way to make sure that during reconciliation we know for sure that a block is deleted at the SCM level (not just peer) would introduce a lot of bloat to SCM.

There are invariants in this design that depend on correctness of implementation

Two chunks written by Ratis to two Datanodes cannot do both

Agree with the checksum provided by the client

Disagree with the peer Datanode on the checksum

SCM decides when a block is deleted and it informs the Datanodes (followers) on the decision to delete a block. This is the only path to delete a block in a Datanode and for it to be marked as deleted on the Datanode.

If SCM sends a delete command to one Datanode (follower) it is ok for the other followers to learn about it via peer to peer communication.

SCM tracking a lineage of hash changes has other memory bloat challenges for SCM which will provide fiction for scaling.

This is just xor checkout sum over block id hash (not block data hash) for deleted blocks. The overhead should be similar to an integer deleted block count.

... SCM has a software bug ...

A software bug could be in datatnodes: one datanode has mistakenly marked a block as deleted. SCM reconciles the container with other datanodes. During reconciliation, the other datanodes will also make that block as deleted.

Not sure I follow the recommendation for how to track deletions beyond adding a flag to indicate it is deleted or how the guarantees would be more robust.

We have two checksums mtChecksum and deletedBlockChecksum. When SCM issues a reconciliation, it sends both the checksums to the datanodes:

When the mtChecksum is matched, it means that all the block data (if exist) are correct.

mtChecksum mismatched means that some data is corrupted.

When the deletedBlockChecksum is matched, it means that the deleted block list is good.

deletedBlockChecksum mismatched means that the deleted block list is incorrect.

Without deletedBlockChecksum, mtChecksum matched means both all the block data correct and deleted block list correct since it won't verify the block list.

Why split the logic? If a Container checksum does not match

Check if the container have the same blocks

If blocks differ fix blocks

Check if any blocks have been deleted on the peer.

The additional checksum to track the sum of deleted blocks does not lead to any simplification of the runtime processing and adds a fork in the logic at the root vs. down the tree on a per block basis.

Why split the logic? If a Container checksum does not match

It is for the other case. If we don't split it, the deleted list could be different when container checksums are matched.

It could hit the problem described in #6121 (comment)

A software bug could be in datatnodes: one datanode has mistakenly marked a block as deleted. SCM reconciles the container with other datanodes. During reconciliation, the other datanodes will also make that block as deleted.

hadoop-hdds/docs/content/design/container-reconciliation.md

szetszwo · 2024-02-02T18:56:12Z

hadoop-hdds/docs/content/design/container-reconciliation.md

+
+#### On Container Close
+
+- Container checksum is calculated synchonously from existing checksums. This calculation must finish before the close is acked back to SCM.


Since block ids are increasing, why not computing the partial container checksum when the container is open?

We can, it would be best to do this once we can measure the performance hit of doing it as part of write or for an open container.

This will be tricky to implement, because multiple blocks can be written to a container in parallel and the checksum will require the same order across all nodes. Block IDs are handed out to clients in an increasing order, but there is no guarantee that the client(s) write them to the container in that order.

hadoop-hdds/docs/content/design/container-reconciliation.md

prashantpogde

Similarly to closed containers do we have any goals or limitations for open containers ?

kerneltime · 2024-02-09T19:37:36Z

Similarly to closed containers do we have any goals or limitations for open containers ?

The overall goal is to simplify the post open state handing. Once containers are either open or post open we can revisit how to avoid closing containers and only treating them as full. This should lead to another order of simplification but for this design doc it is out of scope.

Change-Id: I3353341238b089df6f83cb36c742b5fb9ada86a7

…cription (remove from event handling) Change-Id: I55c5051cda11db545426b8732d18c85f8b40812c

Change-Id: I45dc4c3180fc0f0aff882fc32227be09acdaf92d

Change-Id: I43f33c1d40e08e4135c5361876388ef2024e794e

Change-Id: I198c5999d139df802aa352ee40ce14a6af0a0428

Change-Id: I17e6506bf0de6dffbe312fa867f9adb098ae7b81

errose28 · 2024-02-21T23:44:30Z

hadoop-hdds/docs/content/design/container-reconciliation.md

+5. Datanode 2 deletes block 1 and adds a tombstone entry for it since the cluster is running software v2 which has container reconciliation.
+6. Since container checksums disregard deleted blocks, container 1 will be seen as matching from SCM's point of view. However, periodic reconciliation requests for closed containers will still eventually ask these two replicas to reconcile.
+7. Datanode 1 learns that block 1 was deleted from datanode 2, so it moves the block metadata to the deleted table in RocksDB
+    - A delete transaction entry with an ID key would need to be created to do this. Currently these are only received from SCM and not created by the datanode.


Right now the delete IDs are created from SCM's sequence ID generator. There was previously an assumption that the delete transaction IDs would be processed by the datanode in ascending order but I think we removed that check.

Pros of re-using the delete table instead of deleting the block directly is that the block deleting service code path is reused. Cons of not deleting the block directly when reconciling is that the unique delete transaction ID can now come from two different sources.

hadoop-hdds/docs/content/design/container-reconciliation.md

xichen01 · 2024-02-26T07:34:21Z

hadoop-hdds/docs/content/design/container-reconciliation.md

+    - If we stop here, then block 1 will remain orphaned on datanode 1.
+5. Datanode 2 deletes block 1 and adds a tombstone entry for it since the cluster is running software v2 which has container reconciliation.
+6. Since container checksums disregard deleted blocks, container 1 will be seen as matching from SCM's point of view. However, periodic reconciliation requests for closed containers will still eventually ask these two replicas to reconcile.
+7. Datanode 1 learns that block 1 was deleted from datanode 2, so it moves the block metadata to the deleted table in RocksDB


How does Datanode 1 learn, Is this a separate API? Datanode 2 tells Datanode 1 that the Block is a deleted Block.

There is a bool deleted flag in the merkle tree proto that will be updated for each block delete as a tombstone entry. When datanodes use the proposed API to pull the Merkle tree, they can see that their peer deleted the block so it is not missing.

Understood. Datanode 1 needs to wait for SCM to send the "reconcileContainer" API call to it. Once this call is received, Datanode 1 will retrieve the Merkle tree from Datanode 2. This allows Datanode 1 to know that block 1 has been deleted, right?
However, this process may be time-consuming, as SCM might not sending "reconcileContainer" solely due to an orphan block.

@xichen01 this is good point, I think similar to a node level scan we will need a SCM level scan where every container in the system is reconciled once in x number of days even if the checksums match. The other option is what @szetszwo recommended is to have a separate delete hash in the Merkle Tree to detect missing deletions.

Yes we should not depend on reconcile purely for deleting blocks. Main block delete flow will still be active. There are corner cases where the current delete flow can leave orphan blocks, so it would be nice to have for the reconciliation process to handle this, although it is not a primary goal.

hadoop-hdds/docs/content/design/container-reconciliation.md

xichen01 · 2024-02-26T09:09:08Z

hadoop-hdds/docs/content/design/container-reconciliation.md

+### SCM sets up the reconciliation process as follows:
+
+1. SCM triggers a reconciliation on `DN 1` for container 12 with replicas on `DN 2` and `DN 3`.
+   1. `SCM -> reconcileContainer(Container #12, DN2, DN3) -> DN 1` 


Do we need to consider the issue of repeated sending, especially after HA switch or SCM restart?

Yes. We can handle this similarly to how retries or duplicate commands of container replication are handled on datanodes currently. I can add more details to the doc on this.

Duplicate commands can be deduplicated on the Datanode side. For SCM ordering containers it will need to have some prioritization for containers as they are reported via heartbeats.

xichen01 · 2024-02-26T09:30:04Z

hadoop-hdds/docs/content/design/container-reconciliation.md

+1. Datanode 1 in v1 deletes block 1 in container 1. This does not leave a tombstone entry because v1 does not have container reconciliation.
+2. Cluster is upgraded to v2.
+3. Reconciliation is triggered between datanodes 1 and 2 for container 1. Datanode 2 has not yet deleted block 1.
+4. Datanode 1 will add block 1 to its container when reconciling with datanode 2.


If the delTX of Datanode 2 is greater than Datanode 1, Datanode 1 can confirm that there is no need to restore Block 1 by deleting the tombstone of Datanode 2.
So maybe you can do this by waiting for all replicas' delTX to be greater than a certain value (perhaps the largest delTX among all replicas at the beginning of reconciliation)

kerneltime · 2024-02-26T21:12:33Z

Thanks @xichen01 for the review.

hadoop-hdds/docs/content/design/container-reconciliation.md

umamaheswararao · 2024-02-28T16:49:03Z

hadoop-hdds/docs/content/design/container-reconciliation.md

+4. If container not found, return error.
+5. The merkle tree returned by this call will represent the status of the data on disk (post scanner scrub).
+
+## Sample scenarios


This might have discussed somewhere above, but I got the following question:

What do we do when only one container available in quasi closed state?

When only 2 quasi closed containers are available.

When 2 quasi closed containers are available and post reconciliation, if old closed container comes back. ( we discussed this in offline chat, but it would be good to capture those scenarios)
With # 2 like cases, how do we avoid falsely treating container as good? ( if we reconcile from 2 quasi-closed states, then it may not be guaranteed we have all chunks right?) Could we please discuss those scenarios and clarify. Thanks

We will dive deeper into the state changes for container post reconciliation in the SCM doc around reconciliation.
You raise a valid point that is beneficial to remember if a container was QUASI_CLOSED if it cannot be compared to a CLOSED container.

ChenSammi · 2024-03-04T12:47:28Z

hadoop-hdds/docs/content/design/container-reconciliation.md

+
+### Reconcile loop once the merkle trees are obtained from all/most replicas:
+
+1. `DN 1` checks if any blocks are missing. For each missing Block:


What's the criteria to determine whether a block is missing, or merely an extra not needed block? The question is what's the rule to decide which replica is the baseline replica.

If any new blocks are added locally, container metadata in rocksdb need be updated accordingly.

The reconciliation process does not take into account if a block is referenced or not by OM. If the block was written to a container it will be replicated across as a safety measure. Recon can at a later point identify unwanted blocks.

Yes when adding a block the metadata for it will also need to be updated.

ChenSammi · 2024-03-04T13:20:11Z

Thanks @errose28 , @kerneltime and @sodonnel for working on this. Except the benefits mentioned in the document, I believe there is another big benefit which is we can significantly reduce the IO resources used in container replica recovery(container replication) , by copying the whole container replica to copying a few related blocks. It will help to improve the cluster wise overall stability and performance.

* HDDS-10239-container-reconciliation: HDDS-10372. SCM and Datanode communication for reconciliation (apache#6506) HDDS-10239. Storage Container Reconciliation. (apache#6121)

Add reconciliation doc

cdd2f05

adoroszlai added documentation Improvements or additions to documentation design labels Jan 30, 2024

kerneltime assigned szetszwo, arp7, adoroszlai and siddhantsangwan Jan 31, 2024

kerneltime requested review from adoroszlai, siddhantsangwan and szetszwo January 31, 2024 03:43

adoroszlai assigned kerneltime, sodonnel and errose28 and unassigned szetszwo, arp7, adoroszlai and siddhantsangwan Jan 31, 2024

adoroszlai requested a review from arp7 January 31, 2024 06:53

sodonnel reviewed Jan 31, 2024

View reviewed changes

hadoop-hdds/docs/content/design/container-reconciliation.md Show resolved Hide resolved

sodonnel reviewed Jan 31, 2024

View reviewed changes

hadoop-hdds/docs/content/design/container-reconciliation.md Outdated Show resolved Hide resolved

sodonnel reviewed Jan 31, 2024

View reviewed changes

hadoop-hdds/docs/content/design/container-reconciliation.md Outdated Show resolved Hide resolved

szetszwo reviewed Feb 2, 2024

View reviewed changes

prashantpogde reviewed Feb 7, 2024

View reviewed changes

kerneltime added 5 commits February 15, 2024 23:44

Update with more details and deletion logic

05ebfde

Change-Id: I3353341238b089df6f83cb36c742b5fb9ada86a7

Move the reconciliation loop and get container hashes to the main des…

fd27b2d

…cription (remove from event handling) Change-Id: I55c5051cda11db545426b8732d18c85f8b40812c

Minor edits and corrections

4ef7029

Change-Id: I45dc4c3180fc0f0aff882fc32227be09acdaf92d

Minor edits and corrections

f255bfc

Change-Id: I43f33c1d40e08e4135c5361876388ef2024e794e

Minor edits and corrections

3bc10ff

Change-Id: I198c5999d139df802aa352ee40ce14a6af0a0428

kerneltime and others added 2 commits February 16, 2024 00:38

Minor edits and corrections

33c371d

Change-Id: I17e6506bf0de6dffbe312fa867f9adb098ae7b81

Minor clarifications and expanded backwards compat section

f0ffdc2

errose28 commented Feb 21, 2024

View reviewed changes

xichen01 reviewed Feb 26, 2024

View reviewed changes

kerneltime reviewed Feb 28, 2024

View reviewed changes

hadoop-hdds/docs/content/design/container-reconciliation.md Outdated Show resolved Hide resolved

Update hadoop-hdds/docs/content/design/container-reconciliation.md

b823be9

kerneltime reviewed Feb 28, 2024

View reviewed changes

hadoop-hdds/docs/content/design/container-reconciliation.md Outdated Show resolved Hide resolved

Update hadoop-hdds/docs/content/design/container-reconciliation.md

1ede0d4

umamaheswararao reviewed Feb 28, 2024

View reviewed changes

ChenSammi reviewed Mar 4, 2024

View reviewed changes

errose28 added the container-reconciliation label Apr 10, 2024

kerneltime changed the base branch from master to HDDS-10239-container-reconciliation May 23, 2024 16:40

kerneltime merged commit a4bb94e into apache:HDDS-10239-container-reconciliation May 23, 2024


		- The `healthy` field in the block checksum being true indicates that all chunks within the block are marked as `healthy = true`.

		- We may be able to re-use the `ChunkInfo` and `ChecksumData` messages in `DatanodeClientProtocol.proto` for chunk information and checksums. Other structures in the merkle tree storage will require new objects to be defined.


		#### On Container Close

		- Container checksum is calculated synchonously from existing checksums. This calculation must finish before the close is acked back to SCM.


		### Reconcile loop once the merkle trees are obtained from all/most replicas:

		1. `DN 1` checks if any blocks are missing. For each missing Block:

HDDS-10239. Storage Container Reconciliation. #6121

HDDS-10239. Storage Container Reconciliation. #6121

Uh oh!

Conversation

errose28 commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

errose28 commented Jan 30, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kerneltime Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szetszwo Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

prashantpogde left a comment

Choose a reason for hiding this comment

Uh oh!

kerneltime commented Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

errose28 commented Jan 30, 2024 •

edited

Loading

kerneltime Feb 9, 2024 •

edited

Loading

szetszwo Feb 13, 2024 •

edited

Loading

kerneltime commented Feb 9, 2024 •

edited

Loading

ChenSammi Mar 4, 2024 •

edited

Loading

ChenSammi commented Mar 4, 2024 •

edited

Loading