-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-8387. Improved Storage Volume Handling in Datanodes #8405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
c9dc3b5
a1bb2dd
1a23cf7
08d8bdd
ac56044
c1c06ab
2b32b60
dd35be4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,212 @@ | ||
| --- | ||
| title: Improved Storage Volume Handling for Ozone Datanodes | ||
| summary: Proposal to add a degraded storage volume health state in datanodes. | ||
| date: 2025-05-06 | ||
| jira: HDDS-8387 | ||
| status: draft | ||
| author: Ethan Rose, Rishabh Patel | ||
| --- | ||
| <!-- | ||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. See accompanying LICENSE file. | ||
| --> | ||
|
|
||
| # Improved Storage Volume Handling for Ozone Datanodes | ||
|
|
||
| ## Background | ||
|
|
||
| Currently Ozone uses two health states for storage volumes: **healthy** and **failed**. A volume scanner runs on each datanode to determine whether a volume should be moved from a **healthy** to a **failed** state. Once a volume is failed, all container replicas on that volume are removed from tracking by the datanode and considered lost. Volumes cannot return to a healthy state after failure without a datanode restart. | ||
|
|
||
| This model only works for hard failures in volumes, but in practice most volume failures are soft failures. Disk issues manifest in a variety of ways and minor problems usually appear before a drive fails completely. The current approach to volume scanning and health classification does not account for this. If a volume is starting to exhibit signs of failure, the datanode only has two options: | ||
| - Fail the volume | ||
| - In many cases the volume may still be mostly or partially readable. Containers on this volume that were still readable would be removed by the system and have their redundancy reduced unecessarily. This is not a safe operation. | ||
| - Keep the volume healthy | ||
| - Containers on this volume will not have extra copies made until the container scanner finds corruption and marks them unhealthy, after which we have already lost redundancy. | ||
|
|
||
| For the common case of soft volume failures, neither of these are good options. This document outlines a proposal to classify and handle soft volume failures in datanodes. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you explain a little bit more about what kind of failure is categorized as hard failure, and what kind of failure will be treated as soft failure? Some examples will be helpful with the understanding of goal of this proposal.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Specifications are provided later in the doc. Do you still have questions after finishing the document?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I found these two. Do I miss anything else? What will be the recommended(default) thresholds value for degraded and failed state, and what will be the default slide window duration mentioned? Also an explanation of why we choose these default value is helpful.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes there is mapping of which checks correspond to which sliding window defined in the Sliding Window section, and then when the threshold of the window is crossed, the state is changed. Defining the specific thresholds for the windows is going to take some thought, so for now I've left that detail to one of the tasks in the Task Breakdown section. If we are able to decide on this earlier we can specify the initial recommendation in the doc as well.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So here is a situation: I hit a bad sector, and an IO error is reported, which triggers an on-demand scan: the value of X is incremented. Now, in the current behavior, RM replicates the good replicas from other sources immediately. So, full durability is restored by the system.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When the first on-demand container scan is triggered, we could speed up the degraded/failed state detection of a disk by throttling up the background volume scanner. This would reduce the time required to satisfy the sliding window criteria at the expense of operational reads and increased IO. The durability of data is a priority. One of the points discussed was that the replication manager changes required for acting upon a degraded volume would align with the changes required for a volume-decommissioning feature. As a result, this proposal suggests taking on the replication manager changes as the next step. An alternative would be to first have a simplified detection of the degraded state and improve the existing replication manager's actions to consider the new degraded volume state when replicating. Improving the detection of degraded state and decommissioning of volumes could be done at a later stage. What do you think @errose28?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This would still happen in the proposed model. There are no proposed changes to replication manager or container states in this document. I think there is some confusion between the on-demand container scanner and on-demand volume scanners here as well. On-demand container scanner will be triggered when a bad sector is read within the container, and if that fails it will mark the container unhealthy triggering the normal replication process. There is no sliding window for the on-demand container scanner. What is proposed in this doc is that if the on-demand container scanner marks a container unhealthy, it should also trigger an on-demand volume scan. For each on-demand volume scan requested, it would add a counter towards the degraded state sliding window of that volume.
If there is only one copy of a container then it is already under-replicated and RM will copy from this volume as long as it is not failed. This doc does not propose any changes here. |
||
|
|
||
| ## Proposal | ||
|
|
||
| This document proposes adding a new volume state called **degraded**, which will correspond to partially failed volumes. Handling degraded volumes can be broken into two problems: | ||
| - **Identification**: Detecting degraded volumes and alerting via metrics and reports to SCM and Recon | ||
| - **Remediation**: Proactively making copies of data on degraded volumes and preventing new writes before the volume completely fails | ||
|
|
||
| This document is primarily focused on identification, and proposes handling remediation with a volume decommissioning feature that can be implemented independently of volume health state. | ||
|
|
||
| ### Identification of Degraded Volumes | ||
|
|
||
| Ozone has access to the following checks from the volume scanner to determine volume health. Most of these checks are already present. | ||
|
|
||
| #### Directory Check | ||
|
|
||
| This check verifies that a directory exists at the specified location for the volume, and that the datanode has read, write, and execute permissions on the directory. | ||
|
|
||
| #### Database Check | ||
|
|
||
| This check only applies to container data volumes (called `HddsVolumes` in the code). It checks that a new read handle can be acquired for the RocksDB instance on that volume, in addition to the write handle the process is currently holding. It does not use any RocksDB APIs that do individual SST file checksum validation, like paranoid checks. corruption within individual SST files will only affect the keys in those files, and RocksDB verifies checksums for individual keys on each read. This makes SST file checksum errors isolated to a per-container level and they will be detected by the container scanner and cause the container to be marked unhealthy. | ||
|
|
||
| #### File Check | ||
|
|
||
| This check runs the following steps: | ||
| 1. Generates a fixed amount of data and keeps it in memory | ||
| 2. Writes the data to a file on the disk | ||
| 3. Syncs the file to the disk to touch the hardware | ||
| 4. Reads the file back to ensure the contents match what was in memory | ||
| 5. Deletes the file | ||
|
|
||
| Of these, the file sync is the most important check, because it ensures that the disk is still reachable. This detects a dangerous condition where the disk is no longer present, but data remains readable and even writeable (if sync is not used) due to in-memory caching by the OS and file system. The cached data may cease to be reachable at any time, and should not be counted as valid replicas of the data. | ||
|
|
||
| #### IO Error Count | ||
|
|
||
| This would be a new check that can be used as part of this feature. Currently each time datanode IO encounters an error, we request an on-demand volume scan. This should include every time the container scanner marks a container unhealthy. We can keep a counter of how many IO errors have been reported on a volume over a given time frame, regardless of whether the corresponding volume scan passed or failed. This accounts for cases that may show up on the main IO path but may otherwise not be detected by the volume scanner. For example, numerous old sectors with existing container data may be unreadable. The volume scanner's **File Check** will only utilize new disk sectors so it will still pass with these errors present, but the container scanner may be hitting many bad sectors across containers, which this check will account for. | ||
|
|
||
| #### Sliding Window | ||
|
|
||
| Most checks will encounter intermittent issues, even on overall healthy drives, so we should not downgrade volume health state after just one error. The current volume scanner uses a counter based sliding window for intermittent failues, meaning the volume will be failed if `x` out of the last `y` checks failed, regardless of when they occurred. This approach works for background volume scans, because `y` is the number of times the check ran, and `x` is the number of times it failed. It does not work if we want to apply a sliding window to on-demand checks like IO error count that do not care if the corresponding volume scan passed or failed. | ||
| To handle this, we can switch to time based sliding windows to determine when a threshold of tolerable errors is crossed. For example, if this check has failed `x` times in the last `y` minutes, we should consider the volume degraded. | ||
|
|
||
| We can use one time based sliding window to track errors that would cause a volume to be degraded, and a second one for errors that would cause a volume to be failed. When a check fails, it can add the result to whichever sliding window it corresponds to. We can create the following assignments of checks: | ||
|
|
||
| - **Directory Check**: No sliding window required. If the volume is not present based on filesystem metadata it should be failed immediately. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. metadata volume failure is similar to DN failure, as it will be having ratis dir. Now sure about other common metadata. I think we may define check applicable.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code already has a tolerance set for how many of each volume type must be available for a datanode to continue running, and for metadata directory it is set to 1. So if the metadata volume is marked as failed the datanode stops running. Do we need any more specific checks for this volume type? |
||
| - **Database Check**: On failure, add an entry to the **failed health sliding window** | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similarly there is another volume type - DB Volume, its container db volume which may be separate. This is optional configuration. Not checked in details, but may have some impact for defining rules.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes that feature is not widely used because it creates a single point of failure for all data on the datanode. We could add a database check for this volume type as well for completeness. |
||
| - **File Check**: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need define what we are adding as part of this,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is noted in the above section already, explaining it is currently based on counters, not time, and why we should switch to a time based implementation.
The counter would increment for each request for an on-demand volume scan. I can clarify this in the above "IO Error Count" section, but it is defined below as well. Locations for on-demand volume scanning are already present in the current code, although we may want to review where they have been placed.
I'm still working on the proposed config changes and plan to update the doc later today with that information, but with a shorter minimum volume scan gap (like 1 minute vs the current 15 minutes), the requests for on-demand volume scans should handle this. For example, say the DB has become completely inaccessible and our default configs are:
All IO, including the container scanner which needs to read DB metadata, would flag this and trigger an on-demand volume scan which would execute the DB check. Since all ops are failing, this means a volume with a completely inaccessible DB would still be failed in 3 minutes. |
||
| - If the sync portion of the check fails, add an entry to the **failed health sliding window** | ||
| - If any other part of this check fails, add an entry to the **degraded health sliding window** | ||
| - **IO Error Count**: When an on-demand volume scan is requested, add an entry to the **degraded health sliding window** | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is how the container scanner informs the volume scanner of a problem, correct?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We currently have code that triggers on-demand volume scans in
There is still a time based component in this suggestion: datanode uptime. A very long running datanode will eventually hit X even on a healthy volume. Creating a fixed time in the sliding window normalizes for this.
This is a tricky problem, and I'm not sure I have a good heuristic right now. But we should note it is not unique to this proposal. Even the current volume scanner uses a counter based sliding window where 2/3 of the last checks must have passed to fail a volume. The only other option is to fail a volume on a single IO error which would be too aggressive. Even a healthy disk is going to have some IO bumps occasionally.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @errose28 In the current design, the definition and tracking of I/O errors are critical for evaluating disks in borderline states. These disks often exhibit intermittent failures—functioning normally during certain time windows and frequently failing during others. Should we consider further optimizing the configuration of the sliding window mechanism to avoid repeatedly triggering the degraded state due to error fluctuations that have not yet escalated, thereby preventing unnecessary data replication or alerts?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In addition, I have another concern: I believe that certain administrative operations themselves can contribute to performance degradation in Datanodes. Tasks such as disk scanning and data recovery introduce additional I/O overhead, especially when the disk is already under stress. What is the current time interval configured for the sliding window? If the interval is too short, it may lead to frequent state changes due to temporary fluctuations. If it's too long, it might delay fault detection and cause us to miss the optimal window for intervention. Would it be possible to introduce a pre-warning mechanism that can proactively detect potential disk degradation based on performance trends, before actual failure thresholds are reached? For example, if a disk's read/write latency or throughput is significantly worse than other disks on the same node, could the system flag it as "performance abnormal" or "under observation" and trigger an alert? This would allow administrators to review and decide whether to manually degrade or replace the disk. Such proactive handling may be more effective than waiting for hard errors to trigger degradation, especially in environments where soft failures and high node load are common.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are very good practical points, thanks for adding them. Let me try to address each one.
The sliding window of the degraded state is intended to deal with this exact situation: intermittent errors that are not enough to escalate to full volume failure. Degraded is the only state that can move back to healthy, so there would be no fluctuation of volumes from failed to healthy triggering re-replication, only possible fluctuation from degraded to healthy. In this case it just provides more monitoring options. The current system provides no optics into intermittent volume errors, so it is as if all of these types of alerts are ignored. If the concern is with spurious alerts, then alerting can be ignored for degraded volume metrics, which puts it on par with the current system. The sliding windows can also be tuned to adjust how sensitive the disk is to health state changes.
This is a good point. Right now such situations may cause the volume to be marked as degraded for alerting purposes, but should not fail the volume. Container import/export and container scanner can have their bandwidth throttled with configs if those operations themselves are burdening the node to the point where it is unhealthy.
Yes I will add a proposal for specific values in the document, although it will be a tricky to pick a "best" value. I'm still working on this area and will update the doc soon.
This would be a good detection mechanism, but I'm not sure it needs to be handled within Ozone. Ozone can and should report issues it sees while operating, but IO wait can be detected by other systems like smartctl, iostat, and prometheus node exporter. We don't need to re-invent the wheel within Ozone when we have these dedicated tools available.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you very much for the detailed response — it gave me a much clearer understanding of the design logic behind this feature. Overall, it seems that your considerations are already quite thorough, and I'm looking forward to seeing it implemented. I'd also like to add one more thought from my side:
Regarding point 1 on optimizing the sliding window mechanism — I fully agree with your explanation, and it's clear that the current design addresses intermittent errors. However, I do have a follow-up question: how exactly does a volume transition from degraded to failed? Is there a clearly defined threshold or set of criteria for this transition?
Regarding point 2, your explanation has largely addressed my concerns. However, I’m wondering if we could take it a step further by supporting dynamic configuration of bandwidth limits for these operations. In real-world scenarios, we’ve observed cases where disk scanning introduced I/O pressure that affected normal read/write performance. Allowing bandwidth limits to be adjusted at runtime based on node load could help better balance stability and performance.
Regarding point 3, I fully understand that it’s difficult to define a value, as disk usage patterns and environments can vary significantly across deployments. That said, I believe it would be helpful to include a clear explanation in the documentation. Speaking from personal experience, when I come across a critical configuration parameter, I really appreciate seeing a detailed description — for example, how increasing or decreasing the value would affect system behavior. This kind of guidance makes it much easier for users to understand the design rationale and make informed tuning decisions.
Regarding point 4, I think you raised a great point, and I generally agree with your approach. However, I’d like to offer an additional perspective. While there are indeed many external tools available for monitoring I/O performance, relying entirely on them can lead to a fragmented view of system health. Monitoring data becomes scattered across multiple sources, and I personally believe that it would be more effective if Ozone could provide some built-in, conclusive metrics to help assess disk health directly — rather than requiring SREs to piece together information from various systems to make a judgment. I’ve experienced this challenge firsthand. When users report performance issues in Ozone — especially in scenarios where performance is critical — I often find myself digging through different metrics and dashboards to locate the root cause. This process is time-consuming and mentally taxing. If Ozone could consolidate key signals and present them in a unified way, it would significantly improve troubleshooting efficiency and reliability. Take I/O performance as an example — we can retrieve read/write latency or throughput data simply by reading certain system files. This doesn’t require much effort or any complex tooling. In fact, I’ve already made some progress on this in #7273 , where I exposed some of these metrics directly through Ozone’s built-in metrics system. This kind of integration is much more intuitive, centralized, and operationally helpful.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Overall, this upgrade proposal is already quite comprehensive, and I’m happy to give it a +1. |
||
|
|
||
| Once a sliding window crosses the threshold of errors over a period of time, the volume is moved to that health state. | ||
|
|
||
| #### Summary of Volume Health States | ||
|
|
||
| Above we have defined each type of check the volume scanner can do, mapped each one to a sliding window that corresponds to a volume health state. We can use this information to finish defining the reuqirements for each volume health state. This document does not propose adding persistence for volume health states, so all volumes will return to healthy on restart until checks move them to a different state. | ||
|
|
||
| ##### Healthy | ||
|
|
||
| Healthy is the default state for a volume. Volumes remain in this state until health tests fail and move them to a lower health state. | ||
|
|
||
| - **Enter**: | ||
| - **On Startup**: If permissions are valid and a RocksDB write handle is successsfully acquired. | ||
errose28 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - **While Running**: N/A. Volumes cannot transition into the healthy state after startup. | ||
| - **Exit**: When tests for one of the other states fail. | ||
| - **Actions**: None | ||
|
|
||
| ##### Degraded | ||
|
|
||
| Degraded volumes are still reachable, but are reporting numerous IO errors when interacting with them. For identification purposes, we can escalate this with SCM, Recon, and metrics which will allow admins to decide whether to leave the volume or remove it. Identification of degraded volumes will be based on errors reported by ongoing datanode IO. This includes the container scanner to ensure that degraded volumes can still be detected even on idle nodes. The container scanner will continue to determine whether individual containers are healthy or not, and the volume scanner will still run to determine if the volume needs to be moved to **failed**. **Degraded** volumes may move back to **healthy** if IO errors are no longer being reported. The historical moves of a volume from healthy to degraded and back can be tracked in a metrics database like Prometheus. | ||
|
|
||
| - **Enter**: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we define the time window of check if different from existing checker? and do all type of failure follow same time window or different ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Time window would be configurable separately for degraded and failed sliding windows. This gives us a way to tweak our sensitivity to disk issues. The defaults could use the same value though. I'm not sure which component you are referring to as the "existing checker". If it's the background volume scanner, the interval that runs at would still be configured separately from the sliding window interval. There is somewhat of a dependency here which is called out at the end of the doc:
I don't yet have a proposal for specific values, it is something we would need to discuss and can add in this document since there are a lot of questions in this area. |
||
| - **On Startup**: N/A. Degraded state is not persisted to disk. | ||
| - **While Running**: | ||
| - Failure threshold of the **degraded volume sliding window** is crossed. | ||
| - **Exit**: | ||
| - **On Startup**: | ||
| - The volume will return to **healthy** until re-evaluation is complete | ||
| - **While Running**: | ||
| - When tests for **failed** state indicate the volume should be failed. | ||
| - When the failure threshold of the **degraded volume sliding window** is no longer crossed, the volume will be returned to **healthy**. | ||
| - **Actions**: | ||
| - Datanode publishes metrics indicating the degraded volume | ||
| - Volume state in storage report to SCM and Recon updated | ||
| - Container scanner continues to identify containers that are unhealthy individually | ||
| - Volume scanner continues to run all checks to determine whether the volume should be failed | ||
|
|
||
| ##### Failed | ||
|
|
||
| Failed volumes are completely inaccessible. We are confident that there is no way to retrieve any container data from this volume, and all replicas on the volume should be reported as missing. | ||
|
|
||
| - **Enter**: | ||
| - **On Startup**: | ||
| - A RocksDB write handle cannot be acquired. This check is a superset of checking that the volume directory exists. | ||
| - **While Running**: | ||
| - Failure threshold of the **failed volume sliding window** is crossed. | ||
| - **Exit**: | ||
| - **On Startup**: | ||
| - Entry conditions will be re-evaluated and the volume may become **healthy** or be **failed** again. | ||
| - **While Running**: | ||
| - N/A. **failed** is a terminal state while the node is running. | ||
| - **Actions**: | ||
| - Container scanner stops running | ||
| - Volume scanner stops running | ||
| - All containers on the volume are removed from Datanode memory | ||
| - All containers on the volume are reported missing to SCM | ||
|
|
||
| #### Volume Health and Disk Usage | ||
|
|
||
| Ideally, volumes that are completely full should not be considered degraded or failed. Just like the rest of the datanode, the scanners will depend on the [VolumeUsage](https://github.com/apache/ozone/blob/2a1a6bf124007821eb68779663bbaad371ea668f/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/VolumeUsage.java#L97) API returning reasonably accurate values to assess whether there is enough space to perform checks. **File Check**, for example, must be skipped if there is not enough space to write the file. Additionally, volume checks will be dependent on the space reservation feature to ensure that the RocksDB instance has enough room to write files needed to be opened in write mode and do compactions. | ||
|
|
||
| In the future, we may want to handle completely full volumes by opening the RocksDB instance in read-only mode and only allowing read operations on the volume. While this may be a nice feature to have, it is more complicated to implement due to its effects on the delete path. This could have a ripple effect because SCM depends on replicas to ack deletes before it can proceed with operations like deleting empty containers. For simplicity the current proposal considers failure to obtain a DB write handle on startup a full volume failure and delegates read-only volumes as a separate feature. | ||
|
|
||
| #### Identifying Volume Health with Metrics | ||
|
|
||
| To identify degraded and failed volumes through metrics which can be plugged into an alerting system, we can expose counters for each volume health state per datanode: `NumHealthyVolumes`, `NumDegradedVolumes`, and `NumFailedVolumes`. When the counter for degraded or failed volumes goes above 0, it can be used to trigger an alert about an unhealthy volume on a datanode. This saves us from having to expose health state (which does not easily map to a numeric counter or gauge) as a per-volume metric in [VolumeInfoMetrics](https://github.com/apache/ozone/blob/536701649e1a3d5fa94e95888a566e213febb6ff/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/VolumeInfoMetrics.java#L64). Once the alert is raised, admins can use the command line to determine which volumes are concerning. | ||
|
|
||
| #### Identifying Volume Health with the Command Line | ||
|
|
||
| The `ozone admin datanode` command is currently lacking any specifc information about volumes. It is not clear where to put this information since the command's layout is atypical of other Ozone commands. Most commands follow the pattern of a `list` subcommand which gives a summary of each item, followed by an `info` command to get more detailed information. See `ozone admin container {info,list}` and `ozone sh {volume,bucket,key} {info,list}`. The datanode CLI instead provides two relevant subcommands: | ||
errose28 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - `ozone admin datanode list` which provides very verbose information about each datanode, some of which could be saved for an `info` command | ||
| - `ozone admin datanode usageinfo` which provides only information about total node capacity, in addition to verbose information about the original node when `--json` is used. | ||
| - This command adds further confusion because it seems to be intended to provide info on one node, but then provides options to obtain a list of most and least used nodes, changing it to a list type command. | ||
|
|
||
| This current CLI layout splits a dedicated `info` command between two commands that are each `list/info` hybrids. This makes it difficult to add new datanode information like we want for volumes. To improve the CLI in a compatible way we can take the following steps: | ||
| 1. Add sorting flags to `ozone admin datanode list` that allow sorting by node capacity. | ||
| 2. Add a counter for `totalVolumes` and `healthyVolumes` to each entry in `ozone admin datanode list`. | ||
| - This allows filtering for datanodes that have `healthyVolumes < totalVolumes` from a single place using `jq`. | ||
| - We could optionally add a flag to the `list` command to only give nodes with unhealthy volumes. | ||
| 3. Add an `ozone admin datanode info` command that gives all information about a datanode in a verbose json format. | ||
| - This includes its total usage info, per volume usage info, and volume health. | ||
| 4. Deprecate `ozone admin datanode usageinfo`. | ||
| - All its functionality should now be covered by `ozone admin datanode {list,info}`. | ||
|
|
||
| ### Remediating Degraded Volumes | ||
|
|
||
| Once degraded volumes can be reported to SCM, it is possible to take action for the affected data on that volume. When a degraded volume is encountered, we should: | ||
| - Stop writing new data onto the volume | ||
| - Make additional copies of the data on that volume, ideally using other nodes as sources if possible | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the extra benefit of keeping volume as degraded instead of failure, if both state's post process is replicating all data on this volume?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is specified in the beginning of the doc:
It is the same difference between decommissioning a node and just shutting it down.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we going to explicitly exclude reading data from these degraded state volume?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, the point of the degraded state is to avoid availability or durability issues with healthy data. I've updated the doc to specify that reading from a degraded volume is supported. |
||
| - Notify the admin when all data from the volume has been fully replicated so it is safe to remove | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another possibly simpler option, but not without its own challenges, would be to use a variation of the disk balancer to copy "as much as possible" off the old disk onto other disks on the same node. Then fail the disk and let the hopefully few containers with issues be replicated via the normal path. This does get into just how bad the disk is - if most reads are failing or are very slow, this is unworkable probably, but then also the disk is probably affecting normal cluster read ops.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this shows another way of action to handled degraded volume quickly.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes this is getting into the decom implementation which is not really in scope for this document. This section is just to demonstrate how degraded volumes could be handled by a generic volume decom feature with only a loose coupling between the two features' implementations. |
||
|
|
||
| This summary is very similar to the process of datanode decommissioning, just at the disk level instead of the node level. Having decommissioning for each failure domain (full node or disk) in the system is generally useful, so we can implement this as a volume decommissioning feature that is completely separate from disk health. Designing this feature comes with its own challenges and is not in the scope of this document. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it's more like a DeadNodeHandler, say FailedVolumeHandler, which removes all replicas from SCM in memory container map for this volume, and then let the replication manager do the rest of replication work.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right now SCM just learns of volume failures through an FCR sent when the volume fails which shows the containers missing. If we do end up creating a volume to container map in SCM we can re-evaluate whether we want to continue using this approach on volume failures. Volume decommissioning would need some kind of handler, probably similar to the |
||
|
|
||
| If automated decommissioning of degraded volumes is desired, the two features could be joined with a config key that enables SCM to automatically decommission degraded volumes. | ||
|
|
||
| ## Task Breakdown | ||
|
|
||
| This is an approximate task breakdown of the work required to implement this feature. Tasks are listed in order and grouped by focus area. | ||
|
|
||
| ### Improve Volume Scanner | ||
|
|
||
| - Create a generic timer-based sliding window class that can be used for checks with intermittent failures. | ||
| - Create a **degraded** volume health state which can be used by the volume scanner. | ||
| - Add volume health state to `StorageReportProto` so it is received at SCM | ||
| - Currently we only have a boolean that says whether the volume is healthy or not. | ||
| - Determine default configurations for sliding windows and scan intervals | ||
| - Sliding window timeouts and scan intervals need to be set accordingly so that background scanners alone have a chance to cross the check threshold even if no foreground load is triggering on-demand scans. | ||
| - Minimum volume scan gap is currently 15 minutes. This value can likely be reduced since volume scans are cheap, which will give us a more accurate view of the volume's health. | ||
| - Migrate existing volume scanner checks to put entries in either the **failed** or **degraded** sliding windows and move health state accordingly. | ||
| - Add IO error count to on-demand volume scanner, which counts towards the **degraded** sliding window. | ||
| - Trigger on-demand volume scan when the container scanner finds an unhealthy container. | ||
|
|
||
| ### Improve CLI | ||
|
|
||
| - Improve `ozone admin datanode volume list/info` for identifying volumes based on health state. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need pass datanode ID to list volume specific to datanodes. This is more similar to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This content is stale, I forgot to update it when I updated the CLI section. Will fix this. The proposal is for volumes to be identified though |
||
| - (optional) Add container to volume mapping in SCM to see which containers are affected by volume health. | ||
| - This would not work for **failed** volumes, because datanodes would not report the replicas or volume anymore. | ||
|
|
||
| ### Improve Metrics | ||
|
|
||
| - Add metrics for volume scanners. They currently don't have any. | ||
| - This includes metrics for the current counts in each health state sliding window. | ||
| - Add metrics for volume health state (including the new degraded state), which can be used for alerting. | ||
|
||
|
|
||
| ### Improve Recon Volume Overview | ||
|
|
||
| - Support showing information about individual volumes within a node from Recon based on the existing storage reports received. | ||
| - This would include capacity and health information. | ||
Uh oh!
There was an error while loading. Please reload this page.