-
Notifications
You must be signed in to change notification settings - Fork 587
HDDS-8782. Improve Volume Scanner Health checks. #4867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Change was lost in the cherry-pick process
Subdirs will now be created on upgraded clusters as well.
* tmp-dir-refactor: Checkstyle Add startup/shutdown tests to TestHddsVolume
* tmp-dir-refactor: Fix assignment to tmp dir name fields
* master: (73 commits) HDDS-8587. Test that CertificateClient can store multiple rootCA certificates (apache#4852) HDDS-8801. ReplicationManager: Add metric to count how often replication is throttled (apache#4864) HDDS-8477. Unit test for Snapdiff using tombstone entries (apache#4678) HDDS-7507. [Snapshot] Implement List Snapshot API Pagination (apache#4065) (apache#4861) HDDS-8373. Document that setquota doesn't accept decimals (apache#4856) HDDS-8779. Recon - Expose flag for enable/disable of heatmap. (apache#4845) HDDS-8677. Ozone admin OM CLI command for block tokens (apache#4760) HDDS-8164. Authorize secret key APIs (apache#4597) HDDS-7945. Integrate secret keys to SCM snapshot (apache#4549) HDDS-8003. E2E integration test cases for block tokens (apache#4547) HDDS-7831. Use symmetric secret key to sign and verify token (apache#4417) HDDS-7830. SCM API for OM and Datanode to get secret keys (apache#4345) HDDS-7734. Implement symmetric SecretKeys lifescycle management in SCM (apache#4194) HDDS-8679. Add dedicated, configurable thread pool for OM gRPC server (apache#4771) HDDS-8790. Split EC acceptance tests (apache#4855) HDDS-8714. TestScmHAFinalization: mark testFinalizationWithRestart as flaky, enable other test cases HDDS-8787. Reduce ozone sh calls in robot tests (apache#4854) HDDS-8774. Log allocation stack trace for leaked CodecBuffer (apache#4840) HDDS-8729. Add metric for count of blocks pending deletion on datanode (apache#4800) HDDS-8780. Leak of ManagedChannel in HASecurityUtils (apache#4850) ...
* tmp-dir-refactor: (73 commits) HDDS-8587. Test that CertificateClient can store multiple rootCA certificates (apache#4852) HDDS-8801. ReplicationManager: Add metric to count how often replication is throttled (apache#4864) HDDS-8477. Unit test for Snapdiff using tombstone entries (apache#4678) HDDS-7507. [Snapshot] Implement List Snapshot API Pagination (apache#4065) (apache#4861) HDDS-8373. Document that setquota doesn't accept decimals (apache#4856) HDDS-8779. Recon - Expose flag for enable/disable of heatmap. (apache#4845) HDDS-8677. Ozone admin OM CLI command for block tokens (apache#4760) HDDS-8164. Authorize secret key APIs (apache#4597) HDDS-7945. Integrate secret keys to SCM snapshot (apache#4549) HDDS-8003. E2E integration test cases for block tokens (apache#4547) HDDS-7831. Use symmetric secret key to sign and verify token (apache#4417) HDDS-7830. SCM API for OM and Datanode to get secret keys (apache#4345) HDDS-7734. Implement symmetric SecretKeys lifescycle management in SCM (apache#4194) HDDS-8679. Add dedicated, configurable thread pool for OM gRPC server (apache#4771) HDDS-8790. Split EC acceptance tests (apache#4855) HDDS-8714. TestScmHAFinalization: mark testFinalizationWithRestart as flaky, enable other test cases HDDS-8787. Reduce ozone sh calls in robot tests (apache#4854) HDDS-8774. Log allocation stack trace for leaked CodecBuffer (apache#4840) HDDS-8729. Add metric for count of blocks pending deletion on datanode (apache#4800) HDDS-8780. Leak of ManagedChannel in HASecurityUtils (apache#4850) ...
* tmp-dir-refactor: Checkstyle
Required for container scanner integration.
sumitagrawl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@errose28 Plz check cleanup of temp files created. others seems ok.
...iner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/DiskCheckUtil.java
Show resolved
Hide resolved
* master: (96 commits) HDDS-8586 Recon. - API for Count of deletePending keys and amount of data mapped to such keys. (apache#4923) HDDS-8908. Intermittent failure in TestBlockDeletion#testBlockDeletion (apache#4958) HDDS-8910. Replace LockManager with striped lock in ContainerStateManager (apache#4962) HDDS-8917. Move protobuf conversion out of the lock in PipelineStateManagerImpl (apache#4965) HDDS-8825. Use apache/hadoop 3.3.5 docker image (apache#4963) HDDS-8906. Avoid stream when getting in-service healthy nodes (apache#4960) HDDS-8907. Store volume count when storage report is updated (apache#4957) HDDS-8905. PipelineManager metrics should not be synchronized (apache#4959) HDDS-8553. Improve scanner integration tests. (apache#4936) HDDS-8854. Avoid unnecessary DatanodeDetails creation for NodeStateManager lookup (apache#4925) HDDS-8315. [Snapshot] Added unit tests for SnapshotDiffManager (apache#4716) HDDS-7968. [Snapshot] Improve KeyDeletingService to reclaim eligible key blocks in snapshot's deletedTable (apache#4935) HDDS-8838. Update default datanode check empty containter on disk to false (apache#4937) HDDS-8763. Support RocksDB iterator with ByteBuffer. (apache#4942) HDDS-8543. FSO directory should reflect bucket/cluster default replication (apache#4947) HDDS-8898. Replication limit should not be less than reconstruction weight (apache#4954) HDDS-8739. Snapdiff should return complete absolute path in Diff Entry (apache#4823) HDDS-8908. Mark TestBlockDeletion#testBlockDeletion as flaky HDDS-8534. Support asynchronous service logging (apache#4663) HDDS-8879. Cleanup SecurityConfig and related class initialization (apache#4921) ...
Some tests failing
* tmp-dir-refactor: (99 commits) HDDS-8586 Recon. - API for Count of deletePending keys and amount of data mapped to such keys. (apache#4923) Fix SCM HA finalization compat test HDDS-8908. Intermittent failure in TestBlockDeletion#testBlockDeletion (apache#4958) HDDS-8910. Replace LockManager with striped lock in ContainerStateManager (apache#4962) HDDS-8917. Move protobuf conversion out of the lock in PipelineStateManagerImpl (apache#4965) HDDS-8825. Use apache/hadoop 3.3.5 docker image (apache#4963) HDDS-8906. Avoid stream when getting in-service healthy nodes (apache#4960) HDDS-8907. Store volume count when storage report is updated (apache#4957) HDDS-8905. PipelineManager metrics should not be synchronized (apache#4959) HDDS-8553. Improve scanner integration tests. (apache#4936) HDDS-8854. Avoid unnecessary DatanodeDetails creation for NodeStateManager lookup (apache#4925) HDDS-8315. [Snapshot] Added unit tests for SnapshotDiffManager (apache#4716) HDDS-7968. [Snapshot] Improve KeyDeletingService to reclaim eligible key blocks in snapshot's deletedTable (apache#4935) HDDS-8838. Update default datanode check empty containter on disk to false (apache#4937) HDDS-8763. Support RocksDB iterator with ByteBuffer. (apache#4942) HDDS-8543. FSO directory should reflect bucket/cluster default replication (apache#4947) HDDS-8898. Replication limit should not be less than reconstruction weight (apache#4954) HDDS-8739. Snapdiff should return complete absolute path in Diff Entry (apache#4823) HDDS-8908. Mark TestBlockDeletion#testBlockDeletion as flaky HDDS-8534. Support asynchronous service logging (apache#4663) ...
* master: HDDS-8914. Datanode may fail to start due to duplicate VolumeInfoMetrics (apache#4966) HDDS-8921. Add support for EC in Freon SCM block generator (apache#4982) HDDS-8927. Metadata scanner should not scan unhealthy containers. (apache#4976) HDDS-8929. Avoid list allocation for pipeline search (apache#4980) HDDS-8778. Support recursive volume delete using Ozone sh command. (apache#4842) HDDS-8885. Quota repair count enable quota feature for old bucket/volume. (apache#4941) HDDS-8771. Refactor volume level tmp directory for generic usage. (apache#4838) HDDS-8922. Random EC read pipeline ID causes XceiverClient cache churn (apache#4971)
sumitagrawl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@errose28 Please check few comments given...
...c/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
Outdated
Show resolved
Hide resolved
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java
Show resolved
Hide resolved
...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
Outdated
Show resolved
Hide resolved
sumitagrawl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@errose28 LGTM +1
* master: HDDS-8555. [Snapshot] When snapshot feature is disabled, block OM startup if there are still snapshots in the system (apache#4994) HDDS-8782. Improve Volume Scanner Health checks. (apache#4867) HDDS-8447. Datanodes should not process container deletes for failed volumes. (apache#4901) HDDS-5869. Added support for stream on S3Gateway write path (apache#4970) HDDS-8859. [Snapshot] Return failure message to client for a failed snapshot diff jobs (apache#4993) HDDS-8939. [Snapshot] isBlockLocationSame check should be skipped if object is not OmKeyInfo. (apache#4991) HDDS-8923. Expose XceiverClient cache stats as metrics (apache#4979) HDDS-8913. ContainerManagerImpl: reduce processing while locked (apache#4967) HDDS-8935. [Snapshot] Fallback to full diff if getDetlaFiles from compaction DAG fails (apache#4986) HDDS-8911. Update Hadoop to 3.3.6 (apache#4985) HDDS-8931. Allow EC PipelineChoosingPolicy to be defined separately from Ratis (apache#4983) HDDS-8895. Support dynamic change of ozone.readonly.administrators in SCM (apache#4977) HDDS-6814. Make OM service ID optional for `ozone s3` commands if only one is defined in config (apache#4953) HDDS-8925. BaseFreonGenerator may not complete if last attempts fail (apache#4975) HDDS-7100. Container scanner incorrectly marks containers unhealthy when DN is shutdown (apache#4951) HDDS-8919. Allow EC pipelines to be created and then added to PipelineManager in two steps (apache#4968) HDDS-8901. Enable mTLS for InterSCMGrpcProtocol. (apache#4964) Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/interfaces/Container.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/ContainerTestUtils.java
What changes were proposed in this pull request?
The volume scanner/disk checker currently only checks filesystem permissions and directory existence. It should also do write, sync, and read back from a file as well to touch the actual hardware and not just information in the OS cache.
This PR switches from using the
DiskCheckerclass from Hadoop to a new Ozone specificDiskCheckUtilthat we have more control over. The Hadoop implementation was removed for the following reasons:The following criteria are used to determine if a volume has failed. If anyone has suggestions for better heuristics please let me know.
What is the link to the Apache JIRA
HDDS-8782
How was this patch tested?