HDDS-11770. Change default failed volume tolerated to 0 #7499

ChenSammi · 2024-11-28T07:52:50Z

What changes were proposed in this pull request?

Currently "hdds.datanode.failed.data.volumes.tolerated", "hdds.datanode.failed.metadata.volumes.tolerated" and "hdds.datanode.failed.db.volumes.tolerated" all have default "-1" value, means unlimited, as long as there is a good volume left.

There is a corresponding property in HDFS "dfs.datanode.failed.volumes.tolerated", which default is 0, means any volume failure will cause a datanode to shutdown.

It's better to have a more conservative default value than this current unlimited value.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11770

How was this patch tested?

existing unit tests.

adoroszlai · 2024-11-28T09:18:34Z

What about @errose28's comment on the task?

We should not make the default behavior that a datanode shuts down on any volume failure, that defeats the purpose of having multiple volumes per node.

ChenSammi · 2024-11-28T10:59:16Z

What about @errose28's comment on the task?

We should not make the default behavior that a datanode shuts down on any volume failure, that defeats the purpose of having multiple volumes per node.

We change the default value to 0, user still can modify it to some other value, for example "1", based on their own needs. The current -1 unlimited default value is really not appropriate, imaging a datanode with 20 disks, -1 means even 19 disks fails, the datanode is still running, the administrator will not know that if he doesn't pay attention to that. Given the recent data loss issue caused by volume failure, I think we should better be conservative about these configurations regarding volume failures. We should fail fast and alert the cluster administrators, keep the failure context, root cause and fix the problem as early as possible.

sodonnel · 2024-11-28T14:30:30Z

Shutting down a 20 disk DN due to one disk failure is not a good default for me. The disruption caused by an immediate loss of a large node is significant, and will result in a lot of needless replication.

Lets say the disk is a legitimate hardware failure and needs replaced. Many of these hosts have hot swappable drives, the node could probably run on for days or weeks until the admins decide to remedy all the failed drives in the cluster. Disk failures are something that are expected to happen somewhat regularly, so we should handle them much more gracefully.

I'd probably suggest something like a default of "if 50% - 75% of the configured drives fail", shutdown the DN., but there is an argument for keeping it running with only a single disk left too. The DNs should be monitored for disk failures and they should be investigated. It should not need a DN to shutdown to make that happen.

adoroszlai · 2024-11-28T15:50:48Z

See also #7505.

errose28 · 2024-12-02T19:26:38Z

We change the default value to 0, user still can modify it to some other value, for example "1", based on their own needs.

We should strive for intuitive configurations out of the box to help usability as much as possible. IMO -1 is expected behavior and will not lead to a surprise node shutdown.

-1 means even 19 disks fails, the datanode is still running, the administrator will not know that if he doesn't pay attention to that.

The DNs should be monitored for disk failures and they should be investigated. It should not need a DN to shutdown to make that happen.

This is the real problem: we do not have a good alerting system for disk failures. A new Recon page (HDDS-11840), dashboards, and changes like #7266 can all remediate this problem without introducing potentially surprising config changes.

slfan1989 · 2024-12-03T03:52:35Z

Everyone's points are reasonable, and I understand their perspectives. However, I agree more with @errose28 viewpoint. I believe setting this value to -1 is the better choice, as users should not face the risk of DN crashes.

Our largest cluster has 1,500 machines, and every day, some machines experience disk failures. If a DN crashes directly, as an administrator, my first reaction is panic. It takes time to locate the logs, and the DN logs are often quite large.

I think we should provide a more detailed description for this configuration: if set to -1, the DN will never crash under any circumstances. If set to a specific number, it indicates the number of disk failures that can be tolerated before the DN crashes. Ultimately, users should be allowed to choose based on their specific environment.

Additionally, the situation mentioned by @sodonnel is also reasonable. We currently have a special type of machine with 60 data disks. In the event of a disk failure, we opt for the hot repair method, which means replacing the disk without shutting down. This is because shutting down would trigger a large amount of container replication, potentially involving tens of thousands of containers. Currently, we only perform a shutdown for repairs in the case of CPU failure, memory failure, or system disk failure.

Our current strategy is to configure the system to tolerate a single disk failure and perform daily routine inspections. Once a machine with a disk failure is identified, we quickly carry out repairs. Therefore, I personally believe that we only need to improve the comments for this configuration, clearly describing the potential risks involved.

The above is my understanding as a user, and I would also like to hear more thoughts from other members of the community.

errose28 · 2024-12-04T22:45:37Z

This is because shutting down would trigger a large amount of container replication, potentially involving tens of thousands of containers.

Just as a side note in case you haven't tried this yet, but Ozone does have maintenance mode for datanodes similar to HDFS which maintains availability but not redundancy. When a node is put in maintenance mode, replication will only be triggered if data would become unreadable without that node, for example single-replicated data. In this state you can shut down a node for a period of time with little or no replication happening, if are ok with temporarily having one data copy offline.

ChenSammi · 2024-12-06T09:10:18Z

Thank you all for you feedback. It looks like every approach has its pros and cons. And @slfan1989 gave a very valuable information from the real cluster maintenance experience. So let's keep the current default value now. @adoroszlai @sodonnel @errose28 @slfan1989 .

HDDS-11770. Change default failed volume tolerated to 0

aa4cdda

errose28 requested review from errose28 and kerneltime December 2, 2024 16:47

ChenSammi closed this Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-11770. Change default failed volume tolerated to 0 #7499

HDDS-11770. Change default failed volume tolerated to 0 #7499

Uh oh!

ChenSammi commented Nov 28, 2024 •

edited

Loading

Uh oh!

adoroszlai commented Nov 28, 2024

Uh oh!

ChenSammi commented Nov 28, 2024 •

edited

Loading

Uh oh!

sodonnel commented Nov 28, 2024

Uh oh!

adoroszlai commented Nov 28, 2024

Uh oh!

errose28 commented Dec 2, 2024

Uh oh!

slfan1989 commented Dec 3, 2024

Uh oh!

errose28 commented Dec 4, 2024

Uh oh!

ChenSammi commented Dec 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HDDS-11770. Change default failed volume tolerated to 0 #7499

HDDS-11770. Change default failed volume tolerated to 0 #7499

Uh oh!

Conversation

ChenSammi commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai commented Nov 28, 2024

Uh oh!

ChenSammi commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sodonnel commented Nov 28, 2024

Uh oh!

adoroszlai commented Nov 28, 2024

Uh oh!

errose28 commented Dec 2, 2024

Uh oh!

slfan1989 commented Dec 3, 2024

Uh oh!

errose28 commented Dec 4, 2024

Uh oh!

ChenSammi commented Dec 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ChenSammi commented Nov 28, 2024 •

edited

Loading

ChenSammi commented Nov 28, 2024 •

edited

Loading