Skip to content

Conversation

@errose28
Copy link
Contributor

@errose28 errose28 commented Jan 30, 2024

What changes were proposed in this pull request?

Design proposal for container reconciliation. Please comment inline on the markdown document to ask questions and post feedback. Switch to Rich Diff mode for smoother reading.

What is the link to the Apache JIRA

HDDS-10239

How was this patch tested?

N/A

@errose28
Copy link
Contributor Author

cc @kerneltime and @sodonnel who also worked on this design and document.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@errose28 , thanks for working on this! The design looks good. Please see the comments inlined.

Comment on lines 108 to 109
deleted: false
healthy: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may combine deleted and healthy to an enum. BTW, let's call it CHECKSUM_MATCHED instead of healthy.

enum State { 
  CHECKSUM_MATCHED,
  CHECKSUM_MISMATCHED,
  DELETED
}

Copy link
Contributor

@kerneltime kerneltime Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what the actual proto change will look like

 diff --git a/hadoop-hdds/interface-client/src/main/proto/DatanodeClientProtocol.proto b/hadoop-hdds/interface-client/src/main/proto/DatanodeClientProtocol.proto
 index 718e2a108c..ed254478e9 100644
 --- a/hadoop-hdds/interface-client/src/main/proto/DatanodeClientProtocol.proto
 +++ b/hadoop-hdds/interface-client/src/main/proto/DatanodeClientProtocol.proto
 @@ -382,6 +382,7 @@ message ChunkInfo {
    repeated KeyValue metadata = 4;
    required ChecksumData checksumData =5;
    optional bytes stripeChecksum = 6;
 +  optional bool healthy = 7;
  }
  
  message ChunkInfoList {
 @@ -525,3 +526,28 @@ service IntraDatanodeProtocolService {
    rpc download (CopyContainerRequestProto) returns (stream CopyContainerResponseProto);
    rpc upload (stream SendContainerRequest) returns (SendContainerResponse);
  }
 +
 +message MTBlockInfo { // Merkle Tree Block Info
 +  optional BlockData blockData = 1; // The chunks in this should be sorted by order of chunks written.
 +  optional ChecksumData checksumData = 2; // Checksum of the checksum of the chunks.
 +  optional bool deleted = 3; // True if the block is deleted.
 +  optional int64 length = 4; // Length of the block
 +  optional int64 chunkCount = 5; // Number of chunks in the block.
 +}
 +
 +message MTContainerInfo {
 +  // Merkle Tree Container Info
 +  enum FailureCause {
 +    NO_HEALTHY_CHECKSUM_FOUND_WITH_PEERS = 1; // No healthy checksum found with peers.
 +    NO_PEERS_FOUND = 2; // No peers found.
 +  }
 +  optional ContainerDataProto containerData = 1;
 +  repeated MTBlockInfo mtBlockInfo = 2; // Sorted list of blocks.
 +  optional ChecksumData checksumData = 3; // Checksum of the checksum of the blocks
 +  optional bool deleted = 4; // True if the container is deleted.
 +  optional int64 blockCount = 5; // Number of blocks in the container.
 +  optional FailureCause failureCause = 6; // Failure cause if the container is unhealthy. Only used for reporting.
 +  optional int64 reconciliationCount = 7; // Number of times reconciliation was attempted.
 +}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the updated proto definition.


- The `healthy` field in the block checksum being true indicates that all chunks within the block are marked as `healthy = true`.

- We may be able to re-use the `ChunkInfo` and `ChecksumData` messages in `DatanodeClientProtocol.proto` for chunk information and checksums. Other structures in the merkle tree storage will require new objects to be defined.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's don't reuse them since we should minimize the size. Also, we should use proto 3 for the new protos. DatanodeClientProtocol.proto is proto 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that, the problem is that we cannot reduce size without introducing a whole bunch of new APIs or paying for extra roundtrips. Today we have the infrastructure to create chunk readers and start reading chunks. If we do not pass in the existing structures in the MT, we would either need new APIs to read chunk or have additional round trips. We will experiment with this and see if we can optionally not set some of the fields or cheaply introduce new on the wire structures. One option is to complete https://issues.apache.org/jira/browse/HDDS-10338 as part of this work which would make the proto definitions a lot leaner.

algorithm: CRC32
checksum: 12345
length: 5
deleted: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to detect if a block is not deleted by mistakes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently assume that there is no block deletion by mistake as the path for deletion of block comes from OM receiving block deletion. Thus, a block is marked as deleted only when OM has deleted the object and the blocks referenced. Protection against accidental deletes should be done via snapshots.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The might not be a valid assumption since, when there is a corruption, it is invalid to assume "something will never be corrupted". A possible cause of block deletion by mistake is a (future) software bug.

We could have another checksum for the deleted blocks. Below is a quick idea:

  1. the deleted-block-checksum sum initially is 0;
  2. When a block b is deleted, update sum with xor, i.e. sum ^= hash(b).

SCM could store a deleted-block-checksum for each container. Then, it is easy to tell if the deleted blocks are correct in a particular container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of issues

  1. The flag deleted in code should be true only if SCM issued the deletion.
  2. SCM tracking a lineage of hash changes has other memory bloat challenges for SCM which will provide fiction for scaling.
  3. If SCM has a software bug that erroneously sends deletes or Datanode erroneously marks blocks as deleted, protection via MT might not be enough.
  4. The only way to make sure that during reconciliation we know for sure that a block is deleted at the SCM level (not just peer) would introduce a lot of bloat to SCM.

There are invariants in this design that depend on correctness of implementation

  1. Two chunks written by Ratis to two Datanodes cannot do both
    1. Agree with the checksum provided by the client
    2. Disagree with the peer Datanode on the checksum
  2. SCM decides when a block is deleted and it informs the Datanodes (followers) on the decision to delete a block. This is the only path to delete a block in a Datanode and for it to be marked as deleted on the Datanode.
    1. If SCM sends a delete command to one Datanode (follower) it is ok for the other followers to learn about it via peer to peer communication.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. SCM tracking a lineage of hash changes has other memory bloat challenges for SCM which will provide fiction for scaling.

This is just xor checkout sum over block id hash (not block data hash) for deleted blocks. The overhead should be similar to an integer deleted block count.

  1. ... SCM has a software bug ...

A software bug could be in datatnodes: one datanode has mistakenly marked a block as deleted. SCM reconciles the container with other datanodes. During reconciliation, the other datanodes will also make that block as deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow the recommendation for how to track deletions beyond adding a flag to indicate it is deleted or how the guarantees would be more robust.

Copy link
Contributor

@szetszwo szetszwo Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two checksums mtChecksum and deletedBlockChecksum. When SCM issues a reconciliation, it sends both the checksums to the datanodes:

  • When the mtChecksum is matched, it means that all the block data (if exist) are correct.
    • mtChecksum mismatched means that some data is corrupted.
  • When the deletedBlockChecksum is matched, it means that the deleted block list is good.
    • deletedBlockChecksum mismatched means that the deleted block list is incorrect.

Without deletedBlockChecksum, mtChecksum matched means both all the block data correct and deleted block list correct since it won't verify the block list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why split the logic? If a Container checksum does not match

  1. Check if the container have the same blocks
    1. If blocks differ fix blocks
  2. Check if any blocks have been deleted on the peer.

The additional checksum to track the sum of deleted blocks does not lead to any simplification of the runtime processing and adds a fork in the logic at the root vs. down the tree on a per block basis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why split the logic? If a Container checksum does not match

It is for the other case. If we don't split it, the deleted list could be different when container checksums are matched.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could hit the problem described in #6121 (comment)

A software bug could be in datatnodes: one datanode has mistakenly marked a block as deleted. SCM reconciles the container with other datanodes. During reconciliation, the other datanodes will also make that block as deleted.


#### On Container Close

- Container checksum is calculated synchonously from existing checksums. This calculation must finish before the close is acked back to SCM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since block ids are increasing, why not computing the partial container checksum when the container is open?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, it would be best to do this once we can measure the performance hit of doing it as part of write or for an open container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be tricky to implement, because multiple blocks can be written to a container in parallel and the checksum will require the same order across all nodes. Block IDs are handed out to clients in an increasing order, but there is no guarantee that the client(s) write them to the container in that order.

Copy link
Contributor

@prashantpogde prashantpogde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to closed containers do we have any goals or limitations for open containers ?

@kerneltime
Copy link
Contributor

kerneltime commented Feb 9, 2024

Similarly to closed containers do we have any goals or limitations for open containers ?

The overall goal is to simplify the post open state handing. Once containers are either open or post open we can revisit how to avoid closing containers and only treating them as full. This should lead to another order of simplification but for this design doc it is out of scope.

Change-Id: I3353341238b089df6f83cb36c742b5fb9ada86a7
…cription (remove from event handling)

Change-Id: I55c5051cda11db545426b8732d18c85f8b40812c
Change-Id: I45dc4c3180fc0f0aff882fc32227be09acdaf92d
Change-Id: I43f33c1d40e08e4135c5361876388ef2024e794e
Change-Id: I198c5999d139df802aa352ee40ce14a6af0a0428
5. Datanode 2 deletes block 1 and adds a tombstone entry for it since the cluster is running software v2 which has container reconciliation.
6. Since container checksums disregard deleted blocks, container 1 will be seen as matching from SCM's point of view. However, periodic reconciliation requests for closed containers will still eventually ask these two replicas to reconcile.
7. Datanode 1 learns that block 1 was deleted from datanode 2, so it moves the block metadata to the deleted table in RocksDB
- A delete transaction entry with an ID key would need to be created to do this. Currently these are only received from SCM and not created by the datanode.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the delete IDs are created from SCM's sequence ID generator. There was previously an assumption that the delete transaction IDs would be processed by the datanode in ascending order but I think we removed that check.

Pros of re-using the delete table instead of deleting the block directly is that the block deleting service code path is reused. Cons of not deleting the block directly when reconciling is that the unique delete transaction ID can now come from two different sources.

- If we stop here, then block 1 will remain orphaned on datanode 1.
5. Datanode 2 deletes block 1 and adds a tombstone entry for it since the cluster is running software v2 which has container reconciliation.
6. Since container checksums disregard deleted blocks, container 1 will be seen as matching from SCM's point of view. However, periodic reconciliation requests for closed containers will still eventually ask these two replicas to reconcile.
7. Datanode 1 learns that block 1 was deleted from datanode 2, so it moves the block metadata to the deleted table in RocksDB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does Datanode 1 learn, Is this a separate API? Datanode 2 tells Datanode 1 that the Block is a deleted Block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bool deleted flag in the merkle tree proto that will be updated for each block delete as a tombstone entry. When datanodes use the proposed API to pull the Merkle tree, they can see that their peer deleted the block so it is not missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Datanode 1 needs to wait for SCM to send the "reconcileContainer" API call to it. Once this call is received, Datanode 1 will retrieve the Merkle tree from Datanode 2. This allows Datanode 1 to know that block 1 has been deleted, right?
However, this process may be time-consuming, as SCM might not sending "reconcileContainer" solely due to an orphan block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xichen01 this is good point, I think similar to a node level scan we will need a SCM level scan where every container in the system is reconciled once in x number of days even if the checksums match. The other option is what @szetszwo recommended is to have a separate delete hash in the Merkle Tree to detect missing deletions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should not depend on reconcile purely for deleting blocks. Main block delete flow will still be active. There are corner cases where the current delete flow can leave orphan blocks, so it would be nice to have for the reconciliation process to handle this, although it is not a primary goal.

### SCM sets up the reconciliation process as follows:

1. SCM triggers a reconciliation on `DN 1` for container 12 with replicas on `DN 2` and `DN 3`.
1. `SCM -> reconcileContainer(Container #12, DN2, DN3) -> DN 1`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to consider the issue of repeated sending, especially after HA switch or SCM restart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can handle this similarly to how retries or duplicate commands of container replication are handled on datanodes currently. I can add more details to the doc on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate commands can be deduplicated on the Datanode side. For SCM ordering containers it will need to have some prioritization for containers as they are reported via heartbeats.

1. Datanode 1 in v1 deletes block 1 in container 1. This does not leave a tombstone entry because v1 does not have container reconciliation.
2. Cluster is upgraded to v2.
3. Reconciliation is triggered between datanodes 1 and 2 for container 1. Datanode 2 has not yet deleted block 1.
4. Datanode 1 will add block 1 to its container when reconciling with datanode 2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the delTX of Datanode 2 is greater than Datanode 1, Datanode 1 can confirm that there is no need to restore Block 1 by deleting the tombstone of Datanode 2.
So maybe you can do this by waiting for all replicas' delTX to be greater than a certain value (perhaps the largest delTX among all replicas at the beginning of reconciliation)

@kerneltime
Copy link
Contributor

Thanks @xichen01 for the review.

4. If container not found, return error.
5. The merkle tree returned by this call will represent the status of the data on disk (post scanner scrub).

## Sample scenarios
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might have discussed somewhere above, but I got the following question:

  1. What do we do when only one container available in quasi closed state?
  2. When only 2 quasi closed containers are available.
  3. When 2 quasi closed containers are available and post reconciliation, if old closed container comes back. ( we discussed this in offline chat, but it would be good to capture those scenarios)
    With # 2 like cases, how do we avoid falsely treating container as good? ( if we reconcile from 2 quasi-closed states, then it may not be guaranteed we have all chunks right?) Could we please discuss those scenarios and clarify. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will dive deeper into the state changes for container post reconciliation in the SCM doc around reconciliation.
You raise a valid point that is beneficial to remember if a container was QUASI_CLOSED if it cannot be compared to a CLOSED container.


### Reconcile loop once the merkle trees are obtained from all/most replicas:

1. `DN 1` checks if any blocks are missing. For each missing Block:
Copy link
Contributor

@ChenSammi ChenSammi Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the criteria to determine whether a block is missing, or merely an extra not needed block? The question is what's the rule to decide which replica is the baseline replica.

If any new blocks are added locally, container metadata in rocksdb need be updated accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The reconciliation process does not take into account if a block is referenced or not by OM. If the block was written to a container it will be replicated across as a safety measure. Recon can at a later point identify unwanted blocks.
  2. Yes when adding a block the metadata for it will also need to be updated.

@ChenSammi
Copy link
Contributor

ChenSammi commented Mar 4, 2024

Thanks @errose28 , @kerneltime and @sodonnel for working on this. Except the benefits mentioned in the document, I believe there is another big benefit which is we can significantly reduce the IO resources used in container replica recovery(container replication) , by copying the whole container replica to copying a few related blocks. It will help to improve the cluster wise overall stability and performance.

@kerneltime kerneltime changed the base branch from master to HDDS-10239-container-reconciliation May 23, 2024 16:40
@kerneltime kerneltime merged commit a4bb94e into apache:HDDS-10239-container-reconciliation May 23, 2024
errose28 added a commit to errose28/ozone that referenced this pull request Jun 6, 2024
* HDDS-10239-container-reconciliation:
  HDDS-10372. SCM and Datanode communication for reconciliation (apache#6506)
  HDDS-10239. Storage Container Reconciliation. (apache#6121)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

container-reconciliation design documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.