-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-13378. [Docs] Add a Production page under Getting Started #8734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit introduces a new documentation page outlining the requirements and best practices for deploying Apache Ozone in a production environment. The guide covers: - System, storage, network, and security requirements. - Recommended configurations for the Linux kernel, local file system, and Ozone itself. This addresses HDDS-13378. Change-Id: I907eff74c755b9232400900bce6b8878d41402fe
Change-Id: I1e883c996a3aeb4fa34812f35ba80974b6b939c3
|
|
||
| ### Storage Requirements | ||
|
|
||
| * **Metadata Storage**: Use SAS SSD or NVMe SSD for metadata (RocksDB and Ratis) to ensure optimal performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OM and SCM Storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we mention s3g and Recon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can also mention that for Datanode, only Ratis directory should be put in the SSD volume for write performance reason. The container DB can be put in data volumes (e.g. disk) since they are colocated with the data volumes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to the best of my knowledge.
OM SCM and Datanode would require SSD for metadata. Recon has an internal db component and it would require NVMe for production cluster. S3 gateway and httpfs does not have a storage/db/metadata component. They wouldn't need SSDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S3 gateway and httpfs does not have a storage/db/metadata component. They wouldn't need SSDs?
Yeah. Those should be fine without one.
Change-Id: Iac5df570d0f0b963ca4be548c63f2650871e98d3
|
What about memory requirement? Cloudera's page suggests 31GB heap per role, and host memory size of 256GB or more. |
|
@jojochuang I think we can follow the current Cloudera's memory recommendation. Our Datanode memory is set larger (64GB) due to some heap issues previously and our datanode machines are co-located with other components (e.g. Yarn / Presto). Our OM memory was also set larger to around (generally 64GB, but can be 96GB), but the 96GB was due to some internal S3A client bug that triggered a lot of concurrent lists in OM. We might need to take into account the non-heap memory (so total memory can be 2x the heap, with some buffer for the OS memory). |
ivandika3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch, left some questions and suggestions.
|
|
||
| ### Ozone Configuration | ||
|
|
||
| * **Monitoring**: Install Prometheus and Grafana for monitoring the Ozone cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also add some ELK (FileBeat) or other log ingestion framework for the audit logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Care to suggest a line of two? I don't have experience with them. Though we also use Apache Ranger for audits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
** Audit Log Collection **: Configure the audit-log4j2.properties and use the log collector (e.g. FileBeat or OTel Collector) to ingest and transform the audit logs to a log platform such as ElasticSearch.
|
Thanks @ivandika3 I'm staying away from copying directly from Cloudera documentation. However, the recommends in here are based on our experience working with customers. (with my Cloudera's hat on, I'd prefer to write Cloudera docs based on Apache docs, rather than the other way around :) ) |
|
On Memory and heap: However, we don't have the recommended settings for S3 Gateway and Httpfs Server! Our internal test clusters are generally configured with 31GB heap for both roles though. |
Change-Id: Icd6c032dfb311ccf117b529e0c8625704d2167c4
|
Updated the doc to include a general recommendation. There are a lot of variables in here, and this is meant to be a high level guidelines. We may want to refine it in the future based on sizes (e.g. 1 billion keys, 100 PB data) or use cases (AI, data processing, or archive) |
ivandika3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info and update. LGTM +1.
|
Merged. Thanks @ivandika3 and @ChenSammi |
* master: (90 commits) HDDS-13308. OM should expose Ratis config for increasing pending write limits (apache#8668) HDDS-8903. Add validation for ozone.om.snapshot.db.max.open.files. (apache#8787) HDDS-13429. Custom metadata headers with uppercase characters are not supported (apache#8805) HDDS-13448. DeleteBlocksCommandHandler thread stop for normal exception (apache#8816) HDDS-13346. Intermittent failure in TestCloseContainer#testContainerChecksumForClosedContainer (apache#8771) HDDS-13125. Add metrics for monitoring the SST file pruning threads. (apache#8764) HDDS-13367. [Docs] User doc for container balancer. (apache#8726) HDDS-13200. OM RocksDB Grafana Dashbroad shows no data on all panels (apache#8577) HDDS-13428. Recon - Retrigger of build whole NSSummary tree task submission inconsistency. (apache#8793) HDDS-13378. [Docs] Add a Production page under Getting Started (apache#8734) HDDS-13403. [Docs] Make feature proposal process more visible. (apache#8758) HDDS-11797. Remove cyclic dependency between SCMSafeModeManager and SafeModeRules (apache#8782) HDDS-13213. KeyDeletingService should limit task size by both key count and serialized size. (apache#8757) HDDS-13387. OMSnapshotCreateRequest logs invalid warning about DefaultReplicationConfig (apache#8760) HDDS-13405. ozone admin container create runs forever without kinit (apache#8765) HDDS-11514. Set optimal default values for delete configurations based on live cluster testing. (apache#8766) HDDS-13376. Add server-side limit note to ozone sh snapshot diff --page-size option (apache#8791) HDDS-11679. Support multiple S3Gs in MiniOzoneCluster (apache#8733) HDDS-13424. Use lsof instead of fuser to find if file is used in AbstractTestChunkManager (apache#8790) HDDS-13427. Bump awssdk to 2.31.78 (apache#8792) ...
…e#8734) Generated-by: Google Gemini Cli with Gemini 2.5 Pro.
What changes were proposed in this pull request?
HDDS-13378. [Docs] Add a Production page under Getting Started
Please describe your PR in detail:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13378
How was this patch tested?
User doc.