HDDS-13378. [Docs] Add a Production page under Getting Started #8734

jojochuang · 2025-07-03T18:46:06Z

What changes were proposed in this pull request?

HDDS-13378. [Docs] Add a Production page under Getting Started

Please describe your PR in detail:

Add a page to list the requirements and best practices of a successful production deployment.
Generated-by: Google Gemini Cli with Gemini 2.5 Pro. Prompt:

Read https://issues.apache.org/jira/browse/HDDS-13378 and implement it.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13378

How was this patch tested?

User doc.

This commit introduces a new documentation page outlining the requirements and best practices for deploying Apache Ozone in a production environment. The guide covers: - System, storage, network, and security requirements. - Recommended configurations for the Linux kernel, local file system, and Ozone itself. This addresses HDDS-13378. Change-Id: I907eff74c755b9232400900bce6b8878d41402fe

Change-Id: I1e883c996a3aeb4fa34812f35ba80974b6b939c3

ChenSammi · 2025-07-08T08:24:57Z

hadoop-hdds/docs/content/start/ProductionDeployment.md

+
+### Storage Requirements
+
+*   **Metadata Storage**: Use SAS SSD or NVMe SSD for metadata (RocksDB and Ratis) to ensure optimal performance.


OM and SCM Storage?

Shall we mention s3g and Recon?

Can also mention that for Datanode, only Ratis directory should be put in the SSD volume for write performance reason. The container DB can be put in data volumes (e.g. disk) since they are colocated with the data volumes.

updated to the best of my knowledge.

OM SCM and Datanode would require SSD for metadata. Recon has an internal db component and it would require NVMe for production cluster. S3 gateway and httpfs does not have a storage/db/metadata component. They wouldn't need SSDs?

@jojochuang

S3 gateway and httpfs does not have a storage/db/metadata component. They wouldn't need SSDs?

Yeah. Those should be fine without one.

Change-Id: Iac5df570d0f0b963ca4be548c63f2650871e98d3

jojochuang · 2025-07-10T17:45:09Z

What about memory requirement? Cloudera's page suggests 31GB heap per role, and host memory size of 256GB or more.
@ivandika3 what's your experience? 256GB isn't too extreme for high-end servers, but I've been wondering if it's required.

ivandika3 · 2025-07-12T13:49:23Z

@jojochuang I think we can follow the current Cloudera's memory recommendation. Our Datanode memory is set larger (64GB) due to some heap issues previously and our datanode machines are co-located with other components (e.g. Yarn / Presto). Our OM memory was also set larger to around (generally 64GB, but can be 96GB), but the 96GB was due to some internal S3A client bug that triggered a lot of concurrent lists in OM.

We might need to take into account the non-heap memory (so total memory can be 2x the heap, with some buffer for the OS memory).

ivandika3

Thanks for the patch, left some questions and suggestions.

hadoop-hdds/docs/content/start/ProductionDeployment.md

ivandika3 · 2025-07-12T13:54:38Z

hadoop-hdds/docs/content/start/ProductionDeployment.md

+
+### Ozone Configuration
+
+*   **Monitoring**: Install Prometheus and Grafana for monitoring the Ozone cluster.


We can also add some ELK (FileBeat) or other log ingestion framework for the audit logs.

Care to suggest a line of two? I don't have experience with them. Though we also use Apache Ranger for audits.

** Audit Log Collection **: Configure the audit-log4j2.properties and use the log collector (e.g. FileBeat or OTel Collector) to ingest and transform the audit logs to a log platform such as ElasticSearch.

hadoop-hdds/docs/content/start/ProductionDeployment.md

jojochuang · 2025-07-12T17:13:09Z

Thanks @ivandika3 I'm staying away from copying directly from Cloudera documentation. However, the recommends in here are based on our experience working with customers. (with my Cloudera's hat on, I'd prefer to write Cloudera docs based on Apache docs, rather than the other way around :) )

jojochuang · 2025-07-12T17:40:39Z

On Memory and heap:
Cloudera's recommendation is listed in: https://docs.cloudera.com/cdp-private-cloud-base/7.3.1/cdp-private-cloud-base-installation/topics/cdpdc-ozone.html

However, we don't have the recommended settings for S3 Gateway and Httpfs Server! Our internal test clusters are generally configured with 31GB heap for both roles though.

Change-Id: Icd6c032dfb311ccf117b529e0c8625704d2167c4

jojochuang · 2025-07-12T17:53:08Z

Updated the doc to include a general recommendation. There are a lot of variables in here, and this is meant to be a high level guidelines. We may want to refine it in the future based on sizes (e.g. 1 billion keys, 100 PB data) or use cases (AI, data processing, or archive)

ivandika3

Thanks for the info and update. LGTM +1.

jojochuang · 2025-07-14T14:43:42Z

Merged. Thanks @ivandika3 and @ChenSammi

* master: (90 commits) HDDS-13308. OM should expose Ratis config for increasing pending write limits (apache#8668) HDDS-8903. Add validation for ozone.om.snapshot.db.max.open.files. (apache#8787) HDDS-13429. Custom metadata headers with uppercase characters are not supported (apache#8805) HDDS-13448. DeleteBlocksCommandHandler thread stop for normal exception (apache#8816) HDDS-13346. Intermittent failure in TestCloseContainer#testContainerChecksumForClosedContainer (apache#8771) HDDS-13125. Add metrics for monitoring the SST file pruning threads. (apache#8764) HDDS-13367. [Docs] User doc for container balancer. (apache#8726) HDDS-13200. OM RocksDB Grafana Dashbroad shows no data on all panels (apache#8577) HDDS-13428. Recon - Retrigger of build whole NSSummary tree task submission inconsistency. (apache#8793) HDDS-13378. [Docs] Add a Production page under Getting Started (apache#8734) HDDS-13403. [Docs] Make feature proposal process more visible. (apache#8758) HDDS-11797. Remove cyclic dependency between SCMSafeModeManager and SafeModeRules (apache#8782) HDDS-13213. KeyDeletingService should limit task size by both key count and serialized size. (apache#8757) HDDS-13387. OMSnapshotCreateRequest logs invalid warning about DefaultReplicationConfig (apache#8760) HDDS-13405. ozone admin container create runs forever without kinit (apache#8765) HDDS-11514. Set optimal default values for delete configurations based on live cluster testing. (apache#8766) HDDS-13376. Add server-side limit note to ozone sh snapshot diff --page-size option (apache#8791) HDDS-11679. Support multiple S3Gs in MiniOzoneCluster (apache#8733) HDDS-13424. Use lsof instead of fuser to find if file is used in AbstractTestChunkManager (apache#8790) HDDS-13427. Bump awssdk to 2.31.78 (apache#8792) ...

…e#8734) Generated-by: Google Gemini Cli with Gemini 2.5 Pro.

jojochuang added 2 commits July 3, 2025 11:37

Update Hugo header.

1c41951

Change-Id: I1e883c996a3aeb4fa34812f35ba80974b6b939c3

jojochuang requested review from ChenSammi, kerneltime and nandakumar131 July 3, 2025 18:46

jojochuang added documentation Improvements or additions to documentation AI-gen labels Jul 3, 2025

jojochuang requested a review from smengcl July 3, 2025 19:56

ChenSammi reviewed Jul 8, 2025

View reviewed changes

docs: Update production deployment guide based on feedback

ae74cd7

Change-Id: Iac5df570d0f0b963ca4be548c63f2650871e98d3

jojochuang requested review from ChenSammi and ivandika3 July 11, 2025 23:08

ivandika3 reviewed Jul 12, 2025

View reviewed changes

docs: HDDS-13378. Add production deployment recommendations

ed1285e

Change-Id: Icd6c032dfb311ccf117b529e0c8625704d2167c4

jojochuang requested a review from ivandika3 July 13, 2025 17:56

ivandika3 approved these changes Jul 14, 2025

View reviewed changes

jojochuang merged commit d15e8a6 into apache:master Jul 14, 2025
14 checks passed

jojochuang added a commit to jojochuang/ozone that referenced this pull request Jul 31, 2025

HDDS-13378. [Docs] Add a Production page under Getting Started (apach…

14886ac

…e#8734) Generated-by: Google Gemini Cli with Gemini 2.5 Pro.


		### Storage Requirements

		* Metadata Storage: Use SAS SSD or NVMe SSD for metadata (RocksDB and Ratis) to ensure optimal performance.


		### Ozone Configuration

		* Monitoring: Install Prometheus and Grafana for monitoring the Ozone cluster.

HDDS-13378. [Docs] Add a Production page under Getting Started #8734

HDDS-13378. [Docs] Add a Production page under Getting Started #8734

Conversation

jojochuang commented Jul 3, 2025

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jojochuang commented Jul 10, 2025

Uh oh!

ivandika3 commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivandika3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jojochuang commented Jul 12, 2025

Uh oh!

jojochuang commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jojochuang commented Jul 12, 2025

Uh oh!

ivandika3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jojochuang commented Jul 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ivandika3 commented Jul 12, 2025 •

edited

Loading

jojochuang commented Jul 12, 2025 •

edited

Loading