-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-11514. Set optimal default values for delete configurations based on live cluster testing. #8766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d on live cluster testing.
|
@aryangupta1998 Thanks for the patch.
Could you share some of these results? What are the tradeoffs that need to be considered to decide on what is optimal? |
|
@ivandika3, sharing some results from my testing: These three properties are interdependent:
Test setup: HA cluster with 10 Datanodes
Based on my observations:
Considering the default Ratis log appender size of 32 MB, I tuned the configs as follows:
Results:
OM’s 50k key batch consumed ~10 MB, and SCM’s batch used ~5 MB — both safely within Ratis limits. I also tested with:
This worked fine too, but I avoided setting such high defaults. Additionally, I’ve introduced new metrics, added them to Grafana, and created a lightweight dashboard to track deletion progress. |
ivandika3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aryangupta1998 Thanks for sharing the results. The configs look reasonable. LGTM +1.
ashishkumar50
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aryangupta1998 for the improvement, LGTM.
|
Thanks @aryangupta1998 for the patch and @ashishkumar50 for the review. |
|
@aryangupta1998 did you also check directory deleting service for both deep and wide filesystem trees? |
* master: (90 commits) HDDS-13308. OM should expose Ratis config for increasing pending write limits (apache#8668) HDDS-8903. Add validation for ozone.om.snapshot.db.max.open.files. (apache#8787) HDDS-13429. Custom metadata headers with uppercase characters are not supported (apache#8805) HDDS-13448. DeleteBlocksCommandHandler thread stop for normal exception (apache#8816) HDDS-13346. Intermittent failure in TestCloseContainer#testContainerChecksumForClosedContainer (apache#8771) HDDS-13125. Add metrics for monitoring the SST file pruning threads. (apache#8764) HDDS-13367. [Docs] User doc for container balancer. (apache#8726) HDDS-13200. OM RocksDB Grafana Dashbroad shows no data on all panels (apache#8577) HDDS-13428. Recon - Retrigger of build whole NSSummary tree task submission inconsistency. (apache#8793) HDDS-13378. [Docs] Add a Production page under Getting Started (apache#8734) HDDS-13403. [Docs] Make feature proposal process more visible. (apache#8758) HDDS-11797. Remove cyclic dependency between SCMSafeModeManager and SafeModeRules (apache#8782) HDDS-13213. KeyDeletingService should limit task size by both key count and serialized size. (apache#8757) HDDS-13387. OMSnapshotCreateRequest logs invalid warning about DefaultReplicationConfig (apache#8760) HDDS-13405. ozone admin container create runs forever without kinit (apache#8765) HDDS-11514. Set optimal default values for delete configurations based on live cluster testing. (apache#8766) HDDS-13376. Add server-side limit note to ozone sh snapshot diff --page-size option (apache#8791) HDDS-11679. Support multiple S3Gs in MiniOzoneCluster (apache#8733) HDDS-13424. Use lsof instead of fuser to find if file is used in AbstractTestChunkManager (apache#8790) HDDS-13427. Bump awssdk to 2.31.78 (apache#8792) ...
…d on live cluster testing. (apache#8766)
…rations based on live cluster testing. (apache#8766) (cherry picked from commit 3171688) Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java Change-Id: I5e5becdd8f53d25d7325fd6b1081bf2fe81e6b84
What changes were proposed in this pull request?
Background deletion configurations were initially chosen based on educated guesses and lacked validation on real-world workloads. I have now tested large-scale deletions involving millions of files and directories, on a real cluster. Based on the results, I'm updating the relevant property values to better reflect practical performance and scalability.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11514
How was this patch tested?
Tested Manually.