Skip to content

Conversation

@ptlrs
Copy link
Contributor

@ptlrs ptlrs commented May 21, 2025

What changes were proposed in this pull request?

Please describe your PR in detail:

  • This PR implements a generic timer-based sliding window class that can be used for checks with intermittent failures.
  • The window stores only the failed results to limit the in-memory data structures.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12852

How was this patch tested?

Added new unit tests
CI: https://github.com/ptlrs/ozone/actions/runs/16235740233

@errose28 errose28 added the scanners Changes related to datanode container and volume scanners label May 22, 2025
@jojochuang jojochuang requested a review from errose28 June 2, 2025 16:58
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ptlrs for the patch.

I suggest removing the concept of failure from SlidingWindow, and let it simply track a limited number of event timestamps.

Rename members accordingly:

  • failureTolerance -> maxSize
  • failureTimestamps -> timestamps
  • isFailed() -> isEmpty() (inverting meaning)
  • removeExpiredFailures() -> removeExpired()
  • getFailureCount() -> size() (taken from follow-up PR)

Code tracking volume failures using SlidingWindow should add() only for failed check results, instead of letting the SlidingWindow decide based on an argument.

Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @ptlrs, Please find my comment below

@ptlrs ptlrs requested review from Tejaskriya and adoroszlai July 13, 2025 03:24
@ptlrs
Copy link
Contributor Author

ptlrs commented Jul 13, 2025

Hi @adoroszlai @Tejaskriya, the comments have been addressed. Could you please review this PR again.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ptlrs for updating the patch. Implementation looks good, I think the test can be improved a bit.

@ptlrs
Copy link
Contributor Author

ptlrs commented Jul 13, 2025

Thanks for the review @adoroszlai. I have updated the tests.

@ptlrs ptlrs requested a review from adoroszlai July 13, 2025 08:16
@adoroszlai adoroszlai marked this pull request as ready for review July 13, 2025 09:31
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ptlrs for updating the patch.

@adoroszlai adoroszlai changed the title HDDS-12852. Implement a sliding window based failure counter utility HDDS-12852. Implement a sliding window counter utility Jul 13, 2025
Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @ptlrs , LGTM

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @ptlrs. I have some comments in addition to what has already been left.

@ptlrs ptlrs requested a review from errose28 July 18, 2025 05:30
@ptlrs
Copy link
Contributor Author

ptlrs commented Jul 18, 2025

@errose28 the PR has been updated with the changes. The isFull method is now called isExceeded. We also now allow a window of size 0.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one comment on functionality left then I think this is ready to merge.

@ptlrs ptlrs requested a review from errose28 July 18, 2025 21:11
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @ptlrs. Next we can start using this to simplify the volume scanners' failure criteria. Should be good to merge once CI finishes.

@ptlrs
Copy link
Contributor Author

ptlrs commented Jul 21, 2025

Thanks for the extensive reviews @adoroszlai @errose28 @Tejaskriya.

@Tejaskriya Tejaskriya merged commit e50365a into apache:master Jul 22, 2025
42 checks passed
@Tejaskriya
Copy link
Contributor

Thanks for the patch @ptlrs and the reviews @adoroszlai @errose28

errose28 added a commit to errose28/ozone that referenced this pull request Jul 30, 2025
* master: (730 commits)
  HDDS-13083. Handle cases where block deletion generates tree file before scanner (apache#8565)
  HDDS-12982. Reduce log level for snapshot validation failure (apache#8851)
  HDDS-13396. Documentation: Improve the top-level overview page for new users. (apache#8753)
  HDDS-13176. containerIds table value format change to proto from string (apache#8589)
  HDDS-13449. Incorrect Interrupt Handling for DirectoryDeletingService and KeyDeletingService (apache#8817)
  HDDS-2453. Add Freon tests for S3 MPU Keys (apache#8803)
  HDDS-13237. Container data checksum should contain block IDs. (apache#8773)
  HDDS-13489. Fix SCMBlockdeleting unnecessary iteration in corner case. (apache#8847)
  HDDS-13464. Make ozone.snapshot.filtering.service.interval reconfigurable (apache#8825)
  HDDS-13473. Amend validation for OZONE_OM_SNAPSHOT_DB_MAX_OPEN_FILES (apache#8829)
  HDDS-13435. Add an OzoneManagerAuthorizer interface (apache#8840)
  HDDS-8565. Recon memory leak in NSSummary (apache#8823).
  HDDS-12852. Implement a sliding window counter utility (apache#8498)
  HDDS-12000. Add unit test for RatisContainerSafeModeRule and ECContainerSafeModeRule (apache#8801)
  HDDS-13092. Container scanner should trigger volume scan when marking a container unhealthy (apache#8603)
  HDDS-13070. OM Follower changes to create and place sst files from hardlink file. (apache#8761)
  HDDS-13482. Mark testWriteStateMachineDataIdempotencyWithClosedContainer as flaky
  HDDS-13481. Fix success latency metric in SCM panels of deletion grafana dashboard (apache#8835)
  HDDS-13468. Update default value of ozone.scm.ha.dbtransactionbuffer.flush.interval. (apache#8834)
  HDDS-13410. Control block deletion for each DN from SCM. (apache#8767)
  ...

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReplicaInfo.java
hadoop-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java
hadoop-ozone/cli-admin/src/test/java/org/apache/hadoop/hdds/scm/cli/container/TestReconcileSubcommand.java
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scanners Changes related to datanode container and volume scanners

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants