HDDS-13092. Container scanner should trigger volume scan when marking a container unhealthy #8603

Tejaskriya · 2025-06-11T06:30:57Z

What changes were proposed in this pull request?

If any of the container scanners (background or on-demand, data or metadata) find an unhealthy container, they should trigger an on-demand volume scan to check if the underlying volume has a larger issue beyond that container.

This PR triggers a volume check for when a container is marked unhealthy by the scanners

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13092

How was this patch tested?

Added unit test coverage

…en marking a container unhealthy

Tejaskriya · 2025-06-11T07:15:25Z

@ptlrs Could you please review this PR?

adoroszlai · 2025-06-11T08:32:41Z

nit: any scan "triggered" by some event is by definition an "on-demand" scan (vs. background scan, which runs on schedule). So can we omit "on-demand" from the title?

Tejaskriya · 2025-06-11T08:37:39Z

@adoroszlai makes sense, I'll change the title. Thanks!

aryangupta1998

Thanks for the patch @Tejaskriya.
IIUC, containers of a volume are scanned in a single iteration. If a volume has multiple unhealthy containers, we may end up triggering a volume scan multiple times for the same volume using:
StorageVolumeUtil.onFailure(containerData.getVolume());
To avoid redundant volume scans, we can track which volumes have already been scanned in the current iteration. One way is to pass a Set to scanContainer():

public void scanContainer(Container<?> c, Set<Path> volumesAlreadyChecked)

Then, within scanContainer(), only trigger the volume scan if it hasn't already been triggered:

Path volumePath = containerData.getVolume().getStorageDir().getPath();
if (volumesAlreadyChecked.add(volumePath)) {
    LOG.info("Triggering a volume scan for volume [{}] as unhealthy container [{}] was on it.",
        volumePath, containerId);
    StorageVolumeUtil.onFailure(containerData.getVolume());
}

Tejaskriya · 2025-06-18T05:31:19Z

@aryangupta1998 in the scheduling logic, we have throttling. In org.apache.hadoop.ozone.container.common.volume.ThrottledAsyncChecker#schedule, if a request for multiple scans of the same volume comes within a timeframe, then it is skipped.
This durations seems to be defaulting to 10m:
@Config(key = "disk.check.min.gap", defaultValue = "10m",

Tejaskriya · 2025-07-08T05:45:35Z

@ptlrs requesting a review for this PR

ptlrs

Thanks @Tejaskriya for the PR, it mostly looks good.

...r-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerScanHelper.java

Tejaskriya · 2025-07-10T09:58:23Z

Thanks for the review and approval @ptlrs . I have made the changes.
@aryangupta1998 could you please review and approve the PR if it seems good?

aryangupta1998

LGTM!

...ain/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerMetadataScanner.java

...r-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerScanHelper.java

...est/java/org/apache/hadoop/ozone/container/ozoneimpl/TestBackgroundContainerDataScanner.java

Tejaskriya · 2025-07-15T05:31:18Z

@ptlrs @errose28 @aryangupta1998 could you please review the patch again?

aryangupta1998

Thanks for updating the patch @Tejaskriya, LGTM!

errose28

Just one minor comment here and in this thread then I think this is good to go.

...r-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerScanHelper.java

...est/java/org/apache/hadoop/ozone/container/ozoneimpl/TestBackgroundContainerDataScanner.java

.../src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestOnDemandContainerScanner.java

ptlrs

Thanks for the updates @Tejaskriya.

errose28

Thanks for working on this @Tejaskriya

adoroszlai · 2025-07-22T04:12:46Z

Please feel free remove co-author information when the only contribution is merging master into the PR branch.

* master: (730 commits) HDDS-13083. Handle cases where block deletion generates tree file before scanner (apache#8565) HDDS-12982. Reduce log level for snapshot validation failure (apache#8851) HDDS-13396. Documentation: Improve the top-level overview page for new users. (apache#8753) HDDS-13176. containerIds table value format change to proto from string (apache#8589) HDDS-13449. Incorrect Interrupt Handling for DirectoryDeletingService and KeyDeletingService (apache#8817) HDDS-2453. Add Freon tests for S3 MPU Keys (apache#8803) HDDS-13237. Container data checksum should contain block IDs. (apache#8773) HDDS-13489. Fix SCMBlockdeleting unnecessary iteration in corner case. (apache#8847) HDDS-13464. Make ozone.snapshot.filtering.service.interval reconfigurable (apache#8825) HDDS-13473. Amend validation for OZONE_OM_SNAPSHOT_DB_MAX_OPEN_FILES (apache#8829) HDDS-13435. Add an OzoneManagerAuthorizer interface (apache#8840) HDDS-8565. Recon memory leak in NSSummary (apache#8823). HDDS-12852. Implement a sliding window counter utility (apache#8498) HDDS-12000. Add unit test for RatisContainerSafeModeRule and ECContainerSafeModeRule (apache#8801) HDDS-13092. Container scanner should trigger volume scan when marking a container unhealthy (apache#8603) HDDS-13070. OM Follower changes to create and place sst files from hardlink file. (apache#8761) HDDS-13482. Mark testWriteStateMachineDataIdempotencyWithClosedContainer as flaky HDDS-13481. Fix success latency metric in SCM panels of deletion grafana dashboard (apache#8835) HDDS-13468. Update default value of ozone.scm.ha.dbtransactionbuffer.flush.interval. (apache#8834) HDDS-13410. Control block deletion for each DN from SCM. (apache#8767) ... hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReplicaInfo.java hadoop-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java hadoop-ozone/cli-admin/src/test/java/org/apache/hadoop/hdds/scm/cli/container/TestReconcileSubcommand.java

… a container unhealthy (apache#8603) Co-authored-by: Doroszlai, Attila <[email protected]>

HDDS-13092. Container scanner should trigger on-demand volume scan wh…

6c53f89

…en marking a container unhealthy

Tejaskriya added the scanners Changes related to datanode container and volume scanners label Jun 11, 2025

Tejaskriya marked this pull request as ready for review June 11, 2025 07:14

Tejaskriya changed the title ~~HDDS-13092. Container scanner should trigger on-demand volume scan when marking a container unhealthy~~ HDDS-13092. Container scanner should trigger volume scan when marking a container unhealthy Jun 11, 2025

aryangupta1998 reviewed Jun 12, 2025

View reviewed changes

Tejaskriya requested a review from aryangupta1998 June 18, 2025 06:33

Merge remote-tracking branch 'origin/master' into HDDS-13092

e15b545

adoroszlai marked this pull request as draft June 20, 2025 10:17

adoroszlai marked this pull request as ready for review June 20, 2025 11:50

Merge remote-tracking branch 'origin/master' into HDDS-13092

72ddfeb

Tejaskriya marked this pull request as draft July 8, 2025 05:11

fix implementation after merging master

d75ab18

Tejaskriya marked this pull request as ready for review July 8, 2025 05:45

ptlrs approved these changes Jul 10, 2025

View reviewed changes

...r-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerScanHelper.java Outdated Show resolved Hide resolved

Move null check to before log

5e01b91

aryangupta1998 approved these changes Jul 10, 2025

View reviewed changes

errose28 reviewed Jul 10, 2025

View reviewed changes

Tejaskriya added 7 commits July 14, 2025 11:22

rewrite test, fix logs and other comments

15bdc21

Merge remote-tracking branch 'origin' into HDDS-13092

cf48ca9

Merge remote-tracking branch 'origin' into HDDS-13092

64e6d60

fix merge issues

ae88309

fix tests

af87c8e

fix tests

60ed84d

remove final modifier changes in test

a249fa9

remove final modifier changes in test

2debf20

Tejaskriya requested review from aryangupta1998, errose28 and ptlrs July 15, 2025 05:30

aryangupta1998 approved these changes Jul 17, 2025

View reviewed changes

errose28 reviewed Jul 17, 2025

View reviewed changes

...r-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerScanHelper.java Outdated Show resolved Hide resolved

...est/java/org/apache/hadoop/ozone/container/ozoneimpl/TestBackgroundContainerDataScanner.java Show resolved Hide resolved

revert test changes, move trigger of vol scan

ad2e379

errose28 reviewed Jul 18, 2025

View reviewed changes

.../src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestOnDemandContainerScanner.java Show resolved Hide resolved

ptlrs approved these changes Jul 19, 2025

View reviewed changes

errose28 approved these changes Jul 21, 2025

View reviewed changes

errose28 merged commit 498a9c1 into apache:master Jul 21, 2025
81 of 82 checks passed

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jul 31, 2025

HDDS-13092. Container scanner should trigger volume scan when marking…

e4c4085

… a container unhealthy (apache#8603) Co-authored-by: Doroszlai, Attila <[email protected]>

HDDS-13092. Container scanner should trigger volume scan when marking a container unhealthy #8603

HDDS-13092. Container scanner should trigger volume scan when marking a container unhealthy #8603

Uh oh!

Conversation

Tejaskriya commented Jun 11, 2025

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Tejaskriya commented Jun 11, 2025

Uh oh!

adoroszlai commented Jun 11, 2025

Uh oh!

Tejaskriya commented Jun 11, 2025

Uh oh!

aryangupta1998 left a comment

Choose a reason for hiding this comment

Uh oh!

Tejaskriya commented Jun 18, 2025

Uh oh!

Tejaskriya commented Jul 8, 2025

Uh oh!

ptlrs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Tejaskriya commented Jul 10, 2025

Uh oh!

aryangupta1998 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tejaskriya commented Jul 15, 2025

Uh oh!

aryangupta1998 left a comment

Choose a reason for hiding this comment

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptlrs left a comment

Choose a reason for hiding this comment

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adoroszlai commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants