Skip to content

Conversation

@jojochuang
Copy link
Contributor

What changes were proposed in this pull request?

HDDS-13112. [Docs] Add a section "Automatic Snapshot Installation for Stale Ozone Managers" to OM HA doc

Please describe your PR in detail:

  • Generated-by: ChatGPT, prompt:
https://ozone.apache.org/docs/edge/feature/om-ha.html

 

It is written in the user doc that OM bootstrap happens when adding a new OM. Ratis would trigger notifyInstallSnapshotFromLeader() if a follower OM falls behind the leader OM. We should update the doc to include this condition too.

https://github.com/apache/ratis/blob/c1da37cb455bbf94da267b3f2b9bf3884017e1ca/ratis-server-api/src/main/java/org/apache/ratis/server/leader/LogAppender.java#L109

/**
 * Should this {@link LogAppender} send a snapshot to the follower?
 *
 * @return the snapshot if it should install a snapshot; otherwise, return null.
 */
default SnapshotInfo shouldInstallSnapshot() {
  // we should install snapshot if the follower needs to catch up and:
  // 1. there is no local log entry but there is snapshot
  // 2. or the follower's next index is smaller than the log start index
  // 3. or the follower is bootstrapping and has not installed any snapshot yet 

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13112

How was this patch tested?

User doc.

@jojochuang jojochuang added the documentation Improvements or additions to documentation label Jun 10, 2025
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jojochuang for the patch. I find the content generated by ChatGPT too verbose. It keeps repeating the same information from prior paragraphs.

Copy link
Contributor Author

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adoroszlai accepted the suggestions.

Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jojochuang Thanks for the documentation.

I think it is worth adding how to handle an issue that we encountered before that because of the a huge OM DB, while the follower's install the snapshot from leader (i.e. download the OM DB), the leader's Raft log for the particular snapshot index has already been purged. So after the OM follower finished downloading the OM DB, the leader cannot send the purged logs and request to follower to redownload the OM DB.

Currently, this can be handled by setting the configurations introduced in HDDS-8131. So either:

  1. Set ozone.om.ratis.log.purge.preservation.log.num to a high enough value (e.g. 1000000) so that the OM leader will not the purge the last N logs.
  2. Set ozone.om.ratis.log.purge.upto.snapshot.index to false, which causes OM leader logs to never be purged until all the follower already catch up.

You can refer to the ticket for a full explanation and tradeoffs.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jojochuang , thanks for working on this! Please see the comments inlined.


1. Leader determines that the follower is too far behind.
2. Leader notifies the follower to catch up via snapshot.
3. The follower downloads and installs the latest snapshot from the leader.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... the latest snapshot from the leader.

Side note: we should support install snapshot from another follower in order to reduce the load of the leader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jojochuang
Copy link
Contributor Author

@ivandika3 thanks for the tips for huge OM DB. I think it's useful information.
I'd like to think it's an advanced topic and we should create an advanced/troubleshooting section for these kind of info.

Thoughts?

@ivandika3
Copy link
Contributor

I'd like to think it's an advanced topic and we should create an advanced/troubleshooting section for these kind of info.

Yes, this can be updated in a separate ticket / as part of the website v2.

@ivandika3
Copy link
Contributor

I think we can also a one-liner that this snapshot (Raft snapshot) and Ozone snapshot is related but different.

Again, this can be addressed in a separate ticket if needed.

@jojochuang
Copy link
Contributor Author

https://issues.apache.org/jira/browse/HDDS-13268 for creating a troubleshooting user doc.

jojochuang and others added 10 commits June 13, 2025 20:44
…ehind too much.

Change-Id: I913149039d2cea2a50c855f1dbe59f57c66193f9
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Change-Id: I7f5de2e0d99a5b59778f3d0e04cb84d4f390805d
Change-Id: I562b0b936121c7a912a87ee5d8775ac2a7ad4b9d
Change-Id: I2bca7665ff35e44a912ef57b0c80c0ee1d13db5b
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

+1 The change looks good.

@jojochuang jojochuang merged commit 7e77058 into apache:master Jun 14, 2025
14 checks passed
@jojochuang
Copy link
Contributor Author

Thanks @ivandika3 @szetszwo @adoroszlai @sadanand48 merged.

@adoroszlai
Copy link
Contributor

@jojochuang So in the end we removed all changes made by Gemini, but it was still credited in the commit. However, @sadanand48, who contributed a completely new paragraph, was not. I don't think that's fair.

sadanand48 pushed a commit to sadanand48/hadoop-ozone that referenced this pull request Jun 16, 2025
…ehind too much. (apache#8600)

Generated-by: gemini-code-assist
Generated-by: ChatGPT
aswinshakil added a commit to aswinshakil/ozone that referenced this pull request Jun 20, 2025
…239-container-reconciliation

Commits: 62

da53b5b HDDS-13299. Fix failures related to delete (apache#8665)
8c1b439 HDDS-13296. Integration check always passes due to missing output (apache#8662)
7329859 HDDS-13023. Container checksum is missing after container import (apache#8459)
a0af93e HDDS-13292. Change `<? extends KeyValue>` to `<KeyValue>` in test (apache#8657)
f3050cf HDDS-13276. Use KEY_ONLY/VALUE_ONLY iterator in SCM/Datanode. (apache#8638)
e9c0a45 HDDS-13262. Simplify key name validation (apache#8619)
f713e57 HDDS-12482. Avoid using CommonConfigurationKeys (apache#8647)
b574709 HDDS-12924. datanode used space calculation optimization (apache#8365)
de683aa HDDS-13263. Refactor DB Checkpoint Utilities. (apache#8620)
97262aa HDDS-13256. Updated OM Snapshot Grafana Dashboard to reflect metric updates from HDDS-13181. (apache#8639)
9d2b415 HDDS-13234. Expired secret key can abort leader OM startup. (apache#8601)
d9049a2 HDDS-13220. Change Recon 'Negative usedBytes' message loglevel to DEBUG (apache#8648)
6df3077 HDDS-9223. Use protobuf for SnapshotDiffJobCodec (apache#8503)
a7fc290 HDDS-13236. Change Table methods not to throw IOException. (apache#8645)
9958f5b HDDS-13287. Upgrade commons-beanutils to 1.11.0 due to CVE-2025-48734 (apache#8646)
48aefea HDDS-13277. [Docs] Native C/C++ Ozone clients (apache#8630)
052d912 HDDS-13037. Let container create command support STANDALONE , RATIS and EC containers (apache#8559)
90ed60b HDDS-13279. Skip verifying Apache Ranger binaries in CI (apache#8633)
9bc53b2 HDDS-11513. All deletion configurations should be configurable without restart (apache#8003)
ac511ac HDDS-13259. Deletion Progress - Grafana Dashboard (apache#8617)
3370f42 HDDS-13246. Change `<? extend KeyValue>` to `<KeyValue>` in hadoop-hdds (apache#8631)
7af8c44 HDDS-11454. Ranger integration for Docker Compose environment (apache#8575)
5a3e4e7 HDDS-13273. Bump awssdk to 2.31.63 (apache#8626)
77138b8 HDDS-13254. Change table iterator to optionally read key or value. (apache#8621)
ce288b6 HDDS-13265. Simplify the page Access Ozone using HTTPFS REST API (apache#8629)
36fe888 HDDS-13275. Improve CheckNative implementation (apache#8628)
d38484e HDDS-13274. Bump sqlite-jdbc to 3.50.1.0 (apache#8627)
3f3ec43 HDDS-13266. `ozone debug checknative` to show OpenSSL lib (apache#8623)
8983a63 HDDS-13272. Bump junit to 5.13.1 (apache#8625)
a927113 HDDS-13271. [Docs] Minor text updates, reference links. (apache#8624)
7e77058 HDDS-13112. [Docs] OM Bootstrap can also happen when follower falls behind too much. (apache#8600)
fd13300 HDDS-10775. Support bucket ownership verification (apache#8558)
3ecf345 HDDS-13207. [Docs] Third party systems compatible with Ozone S3. (apache#8584)
ad5a507 HDDS-13035. SnapshotDeletingService should hold write locks while purging deleted snapshots (apache#8554)
38a9186 HDDS-12637. Increase max buffer size for tar entry read/write (apache#8618)
f31c264 HDDS-13045. Implement Immediate Triggering of Heartbeat when Volume Full (apache#8590)
0701d6a HDDS-13248. Remove `ozone debug replicas verify` option --output-dir (apache#8612)
ca1afe8 HDDS-13257. Remove separate split for shell integration tests (apache#8616)
5d6fe94 HDDS-13216. Standardize Container[Replica]NotFoundException messages (apache#8599)
1e47217 HDDS-13168. Fix error response format in CheckUploadContentTypeFilter (apache#8614)
6d4d423 HDDS-13181. Added metrics for internal Snapshot Operations. (apache#8606)
4a461b2 HDDS-10490. Intermittent NPE in TestSnapshotDiffManager#testLoadJobsOnStartUp (apache#8596)
bf29f7f HDDS-13235. The equals/hashCode methods in anonymous KeyValue classes may not work. (apache#8607)
6ff3ad6 HDDS-12873. Improve ContainerData statistics synchronization. (apache#8305)
09d3b27 HDDS-13244. TestSnapshotDeletingServiceIntegrationTest should close snapshots after deleting them (apache#8611)
931bc2d HDDS-13243. copy-rename-maven-plugin version is missing (apache#8605)
3b5985c HDDS-13244. Disable TestSnapshotDeletingServiceIntegrationTest
6bf009c HDDS-12927. metrics and log to indicate datanode crossing disk limits (apache#8573)
752da2b HDDS-12760. Intermittent Timeout in testImportedContainerIsClosed (apache#8349)
8c32363 HDDS-13050. Update StartFromDockerHub.md. (apache#8586)
ba1887c HDDS-13241. Fix some potential resource leaks (apache#8602)
bbaf71e HDDS-13130. Rename all instances of Disk Usage to Namespace usage (apache#8571)
0628386 HDDS-13142. Correct SCMPerformanceMetrics for delete operation. (apache#8592)
516bc96 HDDS-13148. [Docs] Update Transparent Data Encryption doc. (apache#8530)
5787135 HDDS-13229. [Doc] Fix incorrect CLI argument order in OM upgrade docs (apache#8598)
ba95074 HDDS-13107. Support limiting output of `ozone admin datanode list` (apache#8595)
e7f5544 HDDS-13171. Replace pipelineID if nodes are changed (apache#8562)
3c9d4d8 HDDS-13103. Correct transaction metrics in SCMBlockDeletingService. (apache#8516)
f62eb8a HDDS-13160. Remove SnapshotDirectoryCleaningService and refactor AbstractDeletingService (apache#8547)
b46e6b2 HDDS-13150. Fixed SnapshotLimitCheck when failures occur. (apache#8532)
203c1d3 HDDS-13206. Update documentation for Apache Ranger (apache#8583)
2072ef0 HDDS-13214. populate-cache fails due to unused dependency (apache#8594)

Conflicts:
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerData.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/helpers/KeyValueContainerUtil.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/statemachine/background/BlockDeletingTask.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants