From fcf961265a7bab0d0914eae036e21c11704d4a9e Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Sun, 8 Jun 2025 14:37:19 -0700 Subject: [PATCH 01/10] HDDS-13112. [Docs] OM Bootstrap can also happen when follower falls behind too much. Change-Id: I913149039d2cea2a50c855f1dbe59f57c66193f9 --- hadoop-hdds/docs/content/feature/OM-HA.md | 24 +++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/hadoop-hdds/docs/content/feature/OM-HA.md b/hadoop-hdds/docs/content/feature/OM-HA.md index 1a0a46481d6f..78f69fff9027 100644 --- a/hadoop-hdds/docs/content/feature/OM-HA.md +++ b/hadoop-hdds/docs/content/feature/OM-HA.md @@ -125,7 +125,31 @@ ozone om [global options (optional)] --bootstrap --force Note that using the _force_ option during bootstrap could crash the OM process if it does not have updated configurations. +## Automatic Snapshot Installation for Stale Ozone Managers + +In an Ozone Manager (OM) High Availability (HA) cluster, all OM nodes maintain a consistent metadata state using the Ratis consensus protocol. Sometimes, an OM follower node may be offline or fall so far behind the leader OM’s log that it cannot catch up by replaying individual log entries. + +The OM HA implementation includes an **automatic snapshot installation and recovery process** for such cases: + +- **Snapshot Installation Trigger:** + When a follower OM falls significantly behind and is unable to catch up with the leader OM through standard log replication, the leader OM will notify the follower to install a snapshot. This is handled internally by the OM state machine. + +- **How it works:** + - The follower OM receives a snapshot installation notification from the leader via the consensus protocol. + - The follower OM then downloads and installs the latest consistent checkpoint (snapshot) from the leader OM. + - After installing the snapshot, the follower OM resumes normal operation and log replication from the new state. + +- **Relevant Implementation:** + This logic is implemented in the `OzoneManagerStateMachine.notifyInstallSnapshotFromLeader()` method. The install is triggered automatically by the consensus layer (Ratis) when it detects that a follower cannot catch up by log replay alone. + +- **What this means for administrators:** + - In most scenarios, stale OMs will recover automatically after coming back online, even if they have missed a large number of operations. + - Manual intervention (such as running `ozone om --bootstrap`) is only required when adding a new OM node to the cluster or when explicitly requested by support instructions. + + ## References * Check [this page]({{< ref "design/omha.md" >}}) for the links to the original design docs * Ozone distribution contains an example OM HA configuration, under the `compose/ozone-om-ha` directory which can be tested with the help of [docker-compose]({{< ref "start/RunningViaDocker.md" >}}). + * [OzoneManagerStateMachine.notifyInstallSnapshotFromLeader source code](https://github.com/apache/ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerStateMachine.java) +* [Apache Ratis State Machine API documentation](https://github.com/apache/ratis/blob/master/ratis-server-api/src/main/java/org/apache/ratis/statemachine/StateMachine.java) From 84e36a57c4019264899368005d9056e809572cb1 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Tue, 10 Jun 2025 10:41:03 -0700 Subject: [PATCH 02/10] Update hadoop-hdds/docs/content/feature/OM-HA.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- hadoop-hdds/docs/content/feature/OM-HA.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hadoop-hdds/docs/content/feature/OM-HA.md b/hadoop-hdds/docs/content/feature/OM-HA.md index 78f69fff9027..15ef6948c85a 100644 --- a/hadoop-hdds/docs/content/feature/OM-HA.md +++ b/hadoop-hdds/docs/content/feature/OM-HA.md @@ -132,7 +132,7 @@ In an Ozone Manager (OM) High Availability (HA) cluster, all OM nodes maintain a The OM HA implementation includes an **automatic snapshot installation and recovery process** for such cases: - **Snapshot Installation Trigger:** - When a follower OM falls significantly behind and is unable to catch up with the leader OM through standard log replication, the leader OM will notify the follower to install a snapshot. This is handled internally by the OM state machine. +When a follower OM falls significantly behind and is unable to catch up with the leader OM through standard log replication, the Ratis consensus layer on the leader OM may determine that a snapshot installation is necessary. The leader then notifies the follower, and the snapshot installation on the follower is handled by its `OzoneManagerStateMachine`. - **How it works:** - The follower OM receives a snapshot installation notification from the leader via the consensus protocol. From bab6733d33d64da84adf8d68569e9c18baf0133b Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Tue, 10 Jun 2025 10:41:40 -0700 Subject: [PATCH 03/10] Update hadoop-hdds/docs/content/feature/OM-HA.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- hadoop-hdds/docs/content/feature/OM-HA.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hadoop-hdds/docs/content/feature/OM-HA.md b/hadoop-hdds/docs/content/feature/OM-HA.md index 15ef6948c85a..ec7131e2b04a 100644 --- a/hadoop-hdds/docs/content/feature/OM-HA.md +++ b/hadoop-hdds/docs/content/feature/OM-HA.md @@ -151,5 +151,5 @@ When a follower OM falls significantly behind and is unable to catch up with the * Check [this page]({{< ref "design/omha.md" >}}) for the links to the original design docs * Ozone distribution contains an example OM HA configuration, under the `compose/ozone-om-ha` directory which can be tested with the help of [docker-compose]({{< ref "start/RunningViaDocker.md" >}}). - * [OzoneManagerStateMachine.notifyInstallSnapshotFromLeader source code](https://github.com/apache/ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerStateMachine.java) +* [OzoneManagerStateMachine.notifyInstallSnapshotFromLeader source code](https://github.com/apache/ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerStateMachine.java#L530) * [Apache Ratis State Machine API documentation](https://github.com/apache/ratis/blob/master/ratis-server-api/src/main/java/org/apache/ratis/statemachine/StateMachine.java) From 6276529a38a585639b06c4eb2fc6fc500294eab2 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Tue, 10 Jun 2025 10:44:23 -0700 Subject: [PATCH 04/10] Update hadoop-hdds/docs/content/feature/OM-HA.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- hadoop-hdds/docs/content/feature/OM-HA.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hadoop-hdds/docs/content/feature/OM-HA.md b/hadoop-hdds/docs/content/feature/OM-HA.md index ec7131e2b04a..c83d2cb75836 100644 --- a/hadoop-hdds/docs/content/feature/OM-HA.md +++ b/hadoop-hdds/docs/content/feature/OM-HA.md @@ -143,7 +143,7 @@ When a follower OM falls significantly behind and is unable to catch up with the This logic is implemented in the `OzoneManagerStateMachine.notifyInstallSnapshotFromLeader()` method. The install is triggered automatically by the consensus layer (Ratis) when it detects that a follower cannot catch up by log replay alone. - **What this means for administrators:** - - In most scenarios, stale OMs will recover automatically after coming back online, even if they have missed a large number of operations. + - In most scenarios, stale OMs—whether they were temporarily offline or simply fell too far behind the leader while remaining online—will recover automatically, even if they have missed a large number of operations. - Manual intervention (such as running `ozone om --bootstrap`) is only required when adding a new OM node to the cluster or when explicitly requested by support instructions. From 4d2a49e8a998393ebde41107daf9d99caca53b22 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Thu, 12 Jun 2025 16:34:21 -0700 Subject: [PATCH 05/10] Update hadoop-hdds/docs/content/feature/OM-HA.md Co-authored-by: Doroszlai, Attila <6454655+adoroszlai@users.noreply.github.com> --- hadoop-hdds/docs/content/feature/OM-HA.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hadoop-hdds/docs/content/feature/OM-HA.md b/hadoop-hdds/docs/content/feature/OM-HA.md index c83d2cb75836..4614c52ae16b 100644 --- a/hadoop-hdds/docs/content/feature/OM-HA.md +++ b/hadoop-hdds/docs/content/feature/OM-HA.md @@ -152,4 +152,4 @@ When a follower OM falls significantly behind and is unable to catch up with the * Check [this page]({{< ref "design/omha.md" >}}) for the links to the original design docs * Ozone distribution contains an example OM HA configuration, under the `compose/ozone-om-ha` directory which can be tested with the help of [docker-compose]({{< ref "start/RunningViaDocker.md" >}}). * [OzoneManagerStateMachine.notifyInstallSnapshotFromLeader source code](https://github.com/apache/ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerStateMachine.java#L530) -* [Apache Ratis State Machine API documentation](https://github.com/apache/ratis/blob/master/ratis-server-api/src/main/java/org/apache/ratis/statemachine/StateMachine.java) +* [Apache Ratis State Machine API documentation](https://github.com/apache/ratis/blob/3612bcaf7d3e48a658935fc8b250e5d3b35df174/ratis-server-api/src/main/java/org/apache/ratis/statemachine/StateMachine.java) From 4b036acdcbe7a455b7482fcb9756a17c72e77fce Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Thu, 12 Jun 2025 16:35:17 -0700 Subject: [PATCH 06/10] Update hadoop-hdds/docs/content/feature/OM-HA.md Co-authored-by: Doroszlai, Attila <6454655+adoroszlai@users.noreply.github.com> --- hadoop-hdds/docs/content/feature/OM-HA.md | 1 - 1 file changed, 1 deletion(-) diff --git a/hadoop-hdds/docs/content/feature/OM-HA.md b/hadoop-hdds/docs/content/feature/OM-HA.md index 4614c52ae16b..54df73c098b1 100644 --- a/hadoop-hdds/docs/content/feature/OM-HA.md +++ b/hadoop-hdds/docs/content/feature/OM-HA.md @@ -151,5 +151,4 @@ When a follower OM falls significantly behind and is unable to catch up with the * Check [this page]({{< ref "design/omha.md" >}}) for the links to the original design docs * Ozone distribution contains an example OM HA configuration, under the `compose/ozone-om-ha` directory which can be tested with the help of [docker-compose]({{< ref "start/RunningViaDocker.md" >}}). -* [OzoneManagerStateMachine.notifyInstallSnapshotFromLeader source code](https://github.com/apache/ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerStateMachine.java#L530) * [Apache Ratis State Machine API documentation](https://github.com/apache/ratis/blob/3612bcaf7d3e48a658935fc8b250e5d3b35df174/ratis-server-api/src/main/java/org/apache/ratis/statemachine/StateMachine.java) From 4b09119d624b65e655396ca209244a29391429c1 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Thu, 12 Jun 2025 16:36:12 -0700 Subject: [PATCH 07/10] Update hadoop-hdds/docs/content/feature/OM-HA.md Co-authored-by: Doroszlai, Attila <6454655+adoroszlai@users.noreply.github.com> --- hadoop-hdds/docs/content/feature/OM-HA.md | 23 ++++++++--------------- 1 file changed, 8 insertions(+), 15 deletions(-) diff --git a/hadoop-hdds/docs/content/feature/OM-HA.md b/hadoop-hdds/docs/content/feature/OM-HA.md index 54df73c098b1..877e90606fb2 100644 --- a/hadoop-hdds/docs/content/feature/OM-HA.md +++ b/hadoop-hdds/docs/content/feature/OM-HA.md @@ -127,25 +127,18 @@ Note that using the _force_ option during bootstrap could crash the OM process i ## Automatic Snapshot Installation for Stale Ozone Managers -In an Ozone Manager (OM) High Availability (HA) cluster, all OM nodes maintain a consistent metadata state using the Ratis consensus protocol. Sometimes, an OM follower node may be offline or fall so far behind the leader OM’s log that it cannot catch up by replaying individual log entries. +Sometimes an OM follower node may be offline or fall so far behind the leader OM's log that it cannot catch up by replaying individual log entries. The OM HA implementation includes an automatic snapshot installation and recovery process for such cases. -The OM HA implementation includes an **automatic snapshot installation and recovery process** for such cases: +How it works: -- **Snapshot Installation Trigger:** -When a follower OM falls significantly behind and is unable to catch up with the leader OM through standard log replication, the Ratis consensus layer on the leader OM may determine that a snapshot installation is necessary. The leader then notifies the follower, and the snapshot installation on the follower is handled by its `OzoneManagerStateMachine`. +1. Leader determines that the follower is too far behind. +2. Leader notifies the follower to catch up via snapshot. +3. The follower downloads and installs the latest snapshot from the leader. +4. After installing the snapshot, the follower OM resumes normal operation and log replication from the new state. -- **How it works:** - - The follower OM receives a snapshot installation notification from the leader via the consensus protocol. - - The follower OM then downloads and installs the latest consistent checkpoint (snapshot) from the leader OM. - - After installing the snapshot, the follower OM resumes normal operation and log replication from the new state. - -- **Relevant Implementation:** - This logic is implemented in the `OzoneManagerStateMachine.notifyInstallSnapshotFromLeader()` method. The install is triggered automatically by the consensus layer (Ratis) when it detects that a follower cannot catch up by log replay alone. - -- **What this means for administrators:** - - In most scenarios, stale OMs—whether they were temporarily offline or simply fell too far behind the leader while remaining online—will recover automatically, even if they have missed a large number of operations. - - Manual intervention (such as running `ozone om --bootstrap`) is only required when adding a new OM node to the cluster or when explicitly requested by support instructions. +This logic is implemented in the [`OzoneManagerStateMachine.notifyInstallSnapshotFromLeader()` method](https://github.com/apache/ozone/blob/931bc2d8a9e8e8595bb49034c03c14e2b15be865/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerStateMachine.java#L521-L541). +In most scenarios, stale OMs will recover automatically, even if they have missed a large number of operations. Manual intervention (such as running `ozone om --bootstrap`) is only required when adding a new OM node to the cluster or when explicitly requested by support instructions. ## References From a929d531371ae3928a6e85feef72c54ef9ad1476 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Fri, 13 Jun 2025 15:39:17 -0700 Subject: [PATCH 08/10] Incorporate review comments from Nicholas, Sadandand and Ivan. Change-Id: I7f5de2e0d99a5b59778f3d0e04cb84d4f390805d --- hadoop-hdds/docs/content/feature/OM-HA.md | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/hadoop-hdds/docs/content/feature/OM-HA.md b/hadoop-hdds/docs/content/feature/OM-HA.md index 877e90606fb2..bc82eec1030d 100644 --- a/hadoop-hdds/docs/content/feature/OM-HA.md +++ b/hadoop-hdds/docs/content/feature/OM-HA.md @@ -127,21 +127,34 @@ Note that using the _force_ option during bootstrap could crash the OM process i ## Automatic Snapshot Installation for Stale Ozone Managers -Sometimes an OM follower node may be offline or fall so far behind the leader OM's log that it cannot catch up by replaying individual log entries. The OM HA implementation includes an automatic snapshot installation and recovery process for such cases. +Sometimes an OM follower node may be offline or fall far behind the OM leader's raft log. +Then, it cannot easily catch up by replaying individual log entries. +The OM HA implementation includes an automatic snapshot installation +and recovery process for such cases. How it works: 1. Leader determines that the follower is too far behind. -2. Leader notifies the follower to catch up via snapshot. +2. Leader notifies the follower to install a snapshot. 3. The follower downloads and installs the latest snapshot from the leader. 4. After installing the snapshot, the follower OM resumes normal operation and log replication from the new state. -This logic is implemented in the [`OzoneManagerStateMachine.notifyInstallSnapshotFromLeader()` method](https://github.com/apache/ozone/blob/931bc2d8a9e8e8595bb49034c03c14e2b15be865/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerStateMachine.java#L521-L541). +This logic is implemented in the `OzoneManagerStateMachine.notifyInstallSnapshotFromLeader()`; +see the [code](https://github.com/apache/ozone/blob/ozone-2.0.0/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerStateMachine.java#L520-L531) +in Release 2.0.0. -In most scenarios, stale OMs will recover automatically, even if they have missed a large number of operations. Manual intervention (such as running `ozone om --bootstrap`) is only required when adding a new OM node to the cluster or when explicitly requested by support instructions. +In most scenarios, stale OMs will recover automatically, even if they have missed a large number of operations. +Manual intervention (such as running `ozone om --bootstrap`) is only required when adding a new OM node to the cluster +or when explicitly requested by support instructions. + +**Important Note on Ozone Manager (OM) Disk Space for Snapshots** + +When an Ozone Manager (OM) acts as a follower in an HA setup, it downloads snapshot tarballs from the leader to its +local metadata directory. Therefore, always ensure your OM disks have at least 2x the current OM database size to +accommodate the existing data and incoming snapshots, preventing disk space issues and maintaining cluster stability. ## References * Check [this page]({{< ref "design/omha.md" >}}) for the links to the original design docs * Ozone distribution contains an example OM HA configuration, under the `compose/ozone-om-ha` directory which can be tested with the help of [docker-compose]({{< ref "start/RunningViaDocker.md" >}}). -* [Apache Ratis State Machine API documentation](https://github.com/apache/ratis/blob/3612bcaf7d3e48a658935fc8b250e5d3b35df174/ratis-server-api/src/main/java/org/apache/ratis/statemachine/StateMachine.java) +* [Apache Ratis State Machine API documentation](https://github.com/apache/ratis/blob/ratis-3.1.3/ratis-server-api/src/main/java/org/apache/ratis/statemachine/StateMachine.java) From a50028c540a812b133494a17638f84fa63803ed5 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Fri, 13 Jun 2025 15:55:08 -0700 Subject: [PATCH 09/10] Update per Ivan's suggestion. Change-Id: I562b0b936121c7a912a87ee5d8775ac2a7ad4b9d --- hadoop-hdds/docs/content/feature/OM-HA.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/hadoop-hdds/docs/content/feature/OM-HA.md b/hadoop-hdds/docs/content/feature/OM-HA.md index bc82eec1030d..4eb7d8293a7f 100644 --- a/hadoop-hdds/docs/content/feature/OM-HA.md +++ b/hadoop-hdds/docs/content/feature/OM-HA.md @@ -144,8 +144,7 @@ see the [code](https://github.com/apache/ozone/blob/ozone-2.0.0/hadoop-ozone/ozo in Release 2.0.0. In most scenarios, stale OMs will recover automatically, even if they have missed a large number of operations. -Manual intervention (such as running `ozone om --bootstrap`) is only required when adding a new OM node to the cluster -or when explicitly requested by support instructions. +Manual intervention (such as running `ozone om --bootstrap`) is only required when adding a new OM node to the cluster. **Important Note on Ozone Manager (OM) Disk Space for Snapshots** From a0290f389890dd4e80e20b1e1854e0f33f20edd5 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Fri, 13 Jun 2025 20:31:48 -0700 Subject: [PATCH 10/10] Explain Raft Snapshot vs Ozone Snapshot Change-Id: I2bca7665ff35e44a912ef57b0c80c0ee1d13db5b --- hadoop-hdds/docs/content/feature/OM-HA.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/hadoop-hdds/docs/content/feature/OM-HA.md b/hadoop-hdds/docs/content/feature/OM-HA.md index 4eb7d8293a7f..7eb83c5e5302 100644 --- a/hadoop-hdds/docs/content/feature/OM-HA.md +++ b/hadoop-hdds/docs/content/feature/OM-HA.md @@ -143,6 +143,8 @@ This logic is implemented in the `OzoneManagerStateMachine.notifyInstallSnapshot see the [code](https://github.com/apache/ozone/blob/ozone-2.0.0/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerStateMachine.java#L520-L531) in Release 2.0.0. +Note that this `Raft Snapshot`, used for OM HA state synchronization, is distinct from `Ozone Snapshot`, which is used for data backup and recovery purposes. + In most scenarios, stale OMs will recover automatically, even if they have missed a large number of operations. Manual intervention (such as running `ozone om --bootstrap`) is only required when adding a new OM node to the cluster.