-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-13092. Container scanner should trigger volume scan when marking a container unhealthy #8603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…en marking a container unhealthy
|
@ptlrs Could you please review this PR? |
|
nit: any scan "triggered" by some event is by definition an "on-demand" scan (vs. background scan, which runs on schedule). So can we omit "on-demand" from the title? |
|
@adoroszlai makes sense, I'll change the title. Thanks! |
aryangupta1998
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch @Tejaskriya.
IIUC, containers of a volume are scanned in a single iteration. If a volume has multiple unhealthy containers, we may end up triggering a volume scan multiple times for the same volume using:
StorageVolumeUtil.onFailure(containerData.getVolume());
To avoid redundant volume scans, we can track which volumes have already been scanned in the current iteration. One way is to pass a Set to scanContainer():
public void scanContainer(Container<?> c, Set<Path> volumesAlreadyChecked)
Then, within scanContainer(), only trigger the volume scan if it hasn't already been triggered:
Path volumePath = containerData.getVolume().getStorageDir().getPath();
if (volumesAlreadyChecked.add(volumePath)) {
LOG.info("Triggering a volume scan for volume [{}] as unhealthy container [{}] was on it.",
volumePath, containerId);
StorageVolumeUtil.onFailure(containerData.getVolume());
}
|
@aryangupta1998 in the scheduling logic, we have throttling. In |
|
@ptlrs requesting a review for this PR |
ptlrs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Tejaskriya for the PR, it mostly looks good.
...r-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerScanHelper.java
Outdated
Show resolved
Hide resolved
|
Thanks for the review and approval @ptlrs . I have made the changes. |
aryangupta1998
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
...ain/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerMetadataScanner.java
Outdated
Show resolved
Hide resolved
...r-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerScanHelper.java
Outdated
Show resolved
Hide resolved
...r-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerScanHelper.java
Outdated
Show resolved
Hide resolved
...r-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerScanHelper.java
Outdated
Show resolved
Hide resolved
...est/java/org/apache/hadoop/ozone/container/ozoneimpl/TestBackgroundContainerDataScanner.java
Show resolved
Hide resolved
|
@ptlrs @errose28 @aryangupta1998 could you please review the patch again? |
aryangupta1998
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the patch @Tejaskriya, LGTM!
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one minor comment here and in this thread then I think this is good to go.
...r-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerScanHelper.java
Outdated
Show resolved
Hide resolved
...est/java/org/apache/hadoop/ozone/container/ozoneimpl/TestBackgroundContainerDataScanner.java
Show resolved
Hide resolved
.../src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestOnDemandContainerScanner.java
Show resolved
Hide resolved
ptlrs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @Tejaskriya.
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @Tejaskriya
|
Please feel free remove co-author information when the only contribution is merging |
* master: (730 commits) HDDS-13083. Handle cases where block deletion generates tree file before scanner (apache#8565) HDDS-12982. Reduce log level for snapshot validation failure (apache#8851) HDDS-13396. Documentation: Improve the top-level overview page for new users. (apache#8753) HDDS-13176. containerIds table value format change to proto from string (apache#8589) HDDS-13449. Incorrect Interrupt Handling for DirectoryDeletingService and KeyDeletingService (apache#8817) HDDS-2453. Add Freon tests for S3 MPU Keys (apache#8803) HDDS-13237. Container data checksum should contain block IDs. (apache#8773) HDDS-13489. Fix SCMBlockdeleting unnecessary iteration in corner case. (apache#8847) HDDS-13464. Make ozone.snapshot.filtering.service.interval reconfigurable (apache#8825) HDDS-13473. Amend validation for OZONE_OM_SNAPSHOT_DB_MAX_OPEN_FILES (apache#8829) HDDS-13435. Add an OzoneManagerAuthorizer interface (apache#8840) HDDS-8565. Recon memory leak in NSSummary (apache#8823). HDDS-12852. Implement a sliding window counter utility (apache#8498) HDDS-12000. Add unit test for RatisContainerSafeModeRule and ECContainerSafeModeRule (apache#8801) HDDS-13092. Container scanner should trigger volume scan when marking a container unhealthy (apache#8603) HDDS-13070. OM Follower changes to create and place sst files from hardlink file. (apache#8761) HDDS-13482. Mark testWriteStateMachineDataIdempotencyWithClosedContainer as flaky HDDS-13481. Fix success latency metric in SCM panels of deletion grafana dashboard (apache#8835) HDDS-13468. Update default value of ozone.scm.ha.dbtransactionbuffer.flush.interval. (apache#8834) HDDS-13410. Control block deletion for each DN from SCM. (apache#8767) ... hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReplicaInfo.java hadoop-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java hadoop-ozone/cli-admin/src/test/java/org/apache/hadoop/hdds/scm/cli/container/TestReconcileSubcommand.java
… a container unhealthy (apache#8603) Co-authored-by: Doroszlai, Attila <[email protected]>
What changes were proposed in this pull request?
If any of the container scanners (background or on-demand, data or metadata) find an unhealthy container, they should trigger an on-demand volume scan to check if the underlying volume has a larger issue beyond that container.
This PR triggers a volume check for when a container is marked unhealthy by the scanners
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13092
How was this patch tested?
Added unit test coverage