Skip to content

Conversation

@errose28
Copy link
Contributor

@errose28 errose28 commented Nov 26, 2024

What changes were proposed in this pull request?

Allow the container scanner to update the container Merkle Tree with the currently calculated hashes of what is on disk. If the tree does not already exist, it will be created. Even if the container is unhealthy, the merkle tree should still be updated to reflect the current state of the container. Hashes stored in RocksDB that were given by the client during write will not be modified.

Specific changes:

  • KeyValueContainerCheck has been modified to support building the merkle tree from the data the scanner sees.
  • Reworked the container checksum update methods in KeyValueHandler.
    • In particular added a public updateContainerChecksum method for scanners to call so that ICRs are sent for checksum updates even if the container's state does not change.
    • This also allows the KeyValueHandler to manage the in-memory data checksum update in the ContainerData object, so we don't need to do it from within ContainerChecksumTreeManager anymore.
  • A new ContainerScanHelper class was created to consolidate duplicate code among the scanners.
  • ContainerMerkleTree can now be built to contain blocks without any chunks, which is necessary to represent states like an empty block file.

What is the link to the Apache JIRA

HDDS-10374

How was this patch tested?

This patch contains unit, integration, and acceptance test additions and modifications. Major ones to note:

  • Unit tests for all container scanners updated to test that failure to update checksum does not impact moving a container to unhealthy.
  • The new KeyValueHandler#updateContainerChecksum method has a unit test added to check both ICR and disk updates happen correctly.
  • The TestContainerReconciliationWithMockDatanodes suite has been enabled after HDDS-12980. Add unit test framework for reconciliation. #8402
  • Container scanner integration tests have been updated to check container checksums after corruption is injected.
  • Existing tests have been migrated to use the on-demand scanner to generate merkle trees instead of manually building them. This removes numerous TODOs from the code.

More robust testing of fault combinations will be added later using the framework being developed under HDDS-11942

@errose28 errose28 marked this pull request as ready for review January 9, 2025 17:45
} finally {
container.writeUnlock();
}
createContainerMerkleTree(container);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a container moves from OPEN to UNHEALTHY state we should try to build a merkle tree with whatever data we have at the moment before the scanner builds the actual merkle tree. If for some reason (either metadata/data error) we are unable to build it, then we can log the exception and move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when we have an unhealthy container we should build the tree from the actual data, but this method only builds the tree from the metadata. I'm renaming it in the next commit to make that clearer. We should not depend on just the metadata if we suspect corruption. The way this should work is open -> unhealthy would trigger an on-demand scan of the container, but that code path is not in place right now. We may want to add that to our branch actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should add the on-demand scan here to generate the Merkle tree.

Copy link
Member

@aswinshakil aswinshakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the change looks good to me. I have left a minor comment to my previous comment.

.setContainerMerkleTree(treeProto);
write(data, checksumInfoBuilder.build());
LOG.debug("Data merkle tree for container {} updated", containerID);
// If write succeeds, update the checksum in memory. Otherwise 0 will be used to indicate the metadata failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a way to tell if a checksum failed vs. scanner has not yet run? Should failure to generate checksum == -1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, the in-memory hash should be what is written to disk as it is an in-memory cache. If updating the Merkle tree is failing, we can have a metric + log message and/or a status report to SCM or a new Admin API to Datanode to query the Merkle tree (I think there are going to be multiple debug scenarios where we would like to query a Datanode what it knows independent of SCM reports).

…anner-builds-mt

* HDDS-10239-container-reconciliation: (646 commits)
  HDDS-11763. Implement container repair logic within datanodes. (apache#7474)
  HDDS-11887. Recon - Identify container replicas difference based on content checksums (apache#7942)
  HDDS-12397. Persist putBlock for closed container. (apache#7943)
  HDDS-12361. Mark testGetSnapshotDiffReportJob as flaky
  HDDS-12375. Random object created and used only once (apache#7933)
  HDDS-12335. Fix ozone admin namespace summary to give complete output (apache#7908)
  HDDS-12364. Require Override annotation for overridden methods (apache#7923)
  HDDS-11530. Support listMultipartUploads max uploads and markers (apache#7817)
  HDDS-11867. Remove code paths for non-Ratis SCM. (apache#7911)
  HDDS-12363. Add import options to IntelliJ IDEA style settings (apache#7921)
  HDDS-12215. Mark testContainerStateMachineRestartWithDNChangePipeline as flaky
  HDDS-12331. BlockOutputStream.failedServers is not thread-safe (apache#7885)
  HDDS-12188. Move server-only upgrade classes from hdds-common to hdds-server-framework (apache#7903)
  HDDS-12362. Remove temporary checkstyle suppression file (apache#7920)
  HDDS-12343. Fix spotbugs warnings in Recon (apache#7902)
  HDDS-12286. Fix license headers and imports for ozone-tools (apache#7919)
  HDDS-12284. Fix license headers and imports for ozone-s3-secret-store (apache#7917)
  HDDS-12275. Fix license headers and imports for ozone-integration-test (apache#7904)
  HDDS-12164. Rename and deprecate DFSConfigKeysLegacy config keys (apache#7803)
  HDDS-12283. Fix license headers and imports for ozone-recon-codegen (apache#7916)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeWriter.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeTestUtils.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/TestContainerMerkleTreeWriter.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueContainerCheck.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestBackgroundContainerDataScanner.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestOnDemandContainerDataScanner.java
hadoop-ozone/dist/src/main/smoketest/admincli/container.robot
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/dn/scanner/TestBackgroundContainerDataScannerIntegration.java
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/dn/scanner/TestBackgroundContainerMetadataScannerIntegration.java
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/dn/scanner/TestOnDemandContainerDataScannerIntegration.java
@errose28 errose28 marked this pull request as draft April 11, 2025 17:15
Copy link
Contributor

@hemantk-12 hemantk-12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @errose28 for the patch.

StringUtils.bytes2Hex(expected.asReadOnlyByteBuffer()),
StringUtils.bytes2Hex(actual.asReadOnlyByteBuffer()),
block.getBlockID());
chunkHealthy = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can break the loop here because smallest unit is chunk.

MISSING_CONTAINER_FILE,
MISSING_CHUNKS_DIR,
MISSING_CHUNK_FILE,
MISSING_DATA_FILE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Is this for the cases when *.block file is missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I changed the name so it is agnostic of container layout.

errose28 added 6 commits May 14, 2025 18:58
…anner-builds-mt

* HDDS-10239-container-reconciliation: (433 commits)
  Revert "HDDS-11232. Spare InfoBucket RPC call for the FileSystem#getFileStatus calls for the general case. (apache#6988)" (apache#8358)
  HDDS-12993. Use DatanodeID in ReconNodeManager. (apache#8456)
  HDDS-13024. HTTP connect to 0.0.0.0 failed (apache#8439)
  HDDS-12914. Bump awssdk to 2.31.40, test ResumableFileDownload (apache#8455)
  HDDS-4677. Document Ozone Ports and Connection End Points (apache#8226)
  HDDS-13044. Remove DatanodeDetails#getUuid usages from hdds-common/client/container-service (apache#8462)
  HDDS-12568. Implement MiniOzoneCluster.Service for Recon (apache#8452)
  HDDS-13046. Add .vscode to .gitignore (apache#8461)
  HDDS-13022. Split up exclusive size tracking for key and directory cleanup in SnapshotInfo (apache#8433)
  HDDS-12966. DBDefinitionFactory should not throw InvalidArnException (apache#8454)
  HDDS-12948. [Snapshot] Increase `ozone.om.fs.snapshot.max.limit` default value to 10k (apache#8376)
  HDDS-11904. Fix HealthyPipelineSafeModeRule logic. (apache#8386)
  HDDS-12989. Throw CodecException for the Codec byte[] methods (apache#8444)
  HDDS-11298. Support owner field in the S3 listBuckets request (apache#8315)
  HDDS-12810. Check and reserve space atomically in VolumeChoosingPolicy (apache#8360)
  HDDS-13016. Add a getAllNodeCount() method to NodeManager. (apache#8445)
  HDDS-12299. Merge OzoneAclConfig into OmConfig (apache#8383)
  HDDS-12501. Remove OMKeyInfo from SST files in backup directory. (apache#8214)
  HDDS-13021. Make callId unique for each request in AbstractKeyDeletingService (apache#8432)
  HDDS-13000. [Snapshot] listSnapshotDiffJobs throws NPE. (apache#8418)
  ...
Copy link
Member

@aswinshakil aswinshakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that struck my mind is that it's not related to this PR, but we can have it as a separate JIRA. We shouldn't allow reconciliation of a container that is already reconciling. This can lead to some other bugs where we update the metadata or the data in parallel.

@errose28
Copy link
Contributor Author

errose28 commented May 19, 2025

We shouldn't allow reconciliation of a container that is already reconciling. This can lead to some other bugs where we update the metadata or the data in parallel.

Most of this should be handled for us by the ReplicationSupervisor and tested generally here. However this depends on how we define the uniqueness of the reconciliation commands and here I think we do have a problem as you mentioned, because we can get multiple simultaneous reconciliation tasks for the same container if the peer list is different. We should fix this in a follow-up so that ReconcileContainerCommand equality is based only on container ID, and add a test for this that is unique to reconciliation commands.

Update: Filed HDDS-13086 for this.

@errose28
Copy link
Contributor Author

Thanks for the review @aswinshakil. All comments should be addressed. I still need to stabilize this test on our branch before trying to get green CI with this change.

@aswinshakil
Copy link
Member

The changes for the scan gap look good. I see there are some checkstyle-related CI failures.

Copy link
Member

@aswinshakil aswinshakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestContainerCommandReconciliation#testContainerChecksumChunkCorruption seems to be failing, I tried to look into it. There seems to be an issue with the scan container logic. Can you take a look at it?

@errose28
Copy link
Contributor Author

errose28 commented Jun 3, 2025

@aswinshakil I added the test fixes you suggested for the flaky tests on the branch to this PR:

  • TestFilePerBlockStrategy running out of space
  • TestContainerCommandReconciliation#testDataChecksumReportedAtSCM timing out when SCM was restarted

I also fixed TestContainerCommandReconciliation#testContainerChecksumChunkCorruption which started failing in this PR. The test was using a non-standard way of introducing corruption by changing the stored client checksums of the replica, not the data, which is something we can't recover from. I changed it to corrupt the actual data instead.

@aswinshakil aswinshakil marked this pull request as ready for review June 4, 2025 10:50
Copy link
Member

@aswinshakil aswinshakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @errose28. LGTM+1, we can get this merged.

@errose28 errose28 merged commit 9a445ed into apache:HDDS-10239-container-reconciliation Jun 4, 2025
53 checks passed
@errose28
Copy link
Contributor Author

errose28 commented Jun 4, 2025

Thanks for the reviews @aswinshakil @kerneltime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants