Skip to content

Conversation

@ChenSammi
Copy link
Contributor

@ChenSammi ChenSammi commented Nov 28, 2024

What changes were proposed in this pull request?

Currently "hdds.datanode.failed.data.volumes.tolerated", "hdds.datanode.failed.metadata.volumes.tolerated" and "hdds.datanode.failed.db.volumes.tolerated" all have default "-1" value, means unlimited, as long as there is a good volume left.

There is a corresponding property in HDFS "dfs.datanode.failed.volumes.tolerated", which default is 0, means any volume failure will cause a datanode to shutdown.

It's better to have a more conservative default value than this current unlimited value.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11770

How was this patch tested?

existing unit tests.

@adoroszlai
Copy link
Contributor

What about @errose28's comment on the task?

We should not make the default behavior that a datanode shuts down on any volume failure, that defeats the purpose of having multiple volumes per node.

@ChenSammi
Copy link
Contributor Author

ChenSammi commented Nov 28, 2024

What about @errose28's comment on the task?

We should not make the default behavior that a datanode shuts down on any volume failure, that defeats the purpose of having multiple volumes per node.

We change the default value to 0, user still can modify it to some other value, for example "1", based on their own needs. The current -1 unlimited default value is really not appropriate, imaging a datanode with 20 disks, -1 means even 19 disks fails, the datanode is still running, the administrator will not know that if he doesn't pay attention to that. Given the recent data loss issue caused by volume failure, I think we should better be conservative about these configurations regarding volume failures. We should fail fast and alert the cluster administrators, keep the failure context, root cause and fix the problem as early as possible.

@sodonnel
Copy link
Contributor

Shutting down a 20 disk DN due to one disk failure is not a good default for me. The disruption caused by an immediate loss of a large node is significant, and will result in a lot of needless replication.

Lets say the disk is a legitimate hardware failure and needs replaced. Many of these hosts have hot swappable drives, the node could probably run on for days or weeks until the admins decide to remedy all the failed drives in the cluster. Disk failures are something that are expected to happen somewhat regularly, so we should handle them much more gracefully.

I'd probably suggest something like a default of "if 50% - 75% of the configured drives fail", shutdown the DN., but there is an argument for keeping it running with only a single disk left too. The DNs should be monitored for disk failures and they should be investigated. It should not need a DN to shutdown to make that happen.

@adoroszlai
Copy link
Contributor

See also #7505.

@errose28
Copy link
Contributor

errose28 commented Dec 2, 2024

We change the default value to 0, user still can modify it to some other value, for example "1", based on their own needs.

We should strive for intuitive configurations out of the box to help usability as much as possible. IMO -1 is expected behavior and will not lead to a surprise node shutdown.

-1 means even 19 disks fails, the datanode is still running, the administrator will not know that if he doesn't pay attention to that.

The DNs should be monitored for disk failures and they should be investigated. It should not need a DN to shutdown to make that happen.

This is the real problem: we do not have a good alerting system for disk failures. A new Recon page (HDDS-11840), dashboards, and changes like #7266 can all remediate this problem without introducing potentially surprising config changes.

@slfan1989
Copy link
Contributor

Everyone's points are reasonable, and I understand their perspectives. However, I agree more with @errose28 viewpoint. I believe setting this value to -1 is the better choice, as users should not face the risk of DN crashes.

Our largest cluster has 1,500 machines, and every day, some machines experience disk failures. If a DN crashes directly, as an administrator, my first reaction is panic. It takes time to locate the logs, and the DN logs are often quite large.

I think we should provide a more detailed description for this configuration: if set to -1, the DN will never crash under any circumstances. If set to a specific number, it indicates the number of disk failures that can be tolerated before the DN crashes. Ultimately, users should be allowed to choose based on their specific environment.

Additionally, the situation mentioned by @sodonnel is also reasonable. We currently have a special type of machine with 60 data disks. In the event of a disk failure, we opt for the hot repair method, which means replacing the disk without shutting down. This is because shutting down would trigger a large amount of container replication, potentially involving tens of thousands of containers. Currently, we only perform a shutdown for repairs in the case of CPU failure, memory failure, or system disk failure.

Our current strategy is to configure the system to tolerate a single disk failure and perform daily routine inspections. Once a machine with a disk failure is identified, we quickly carry out repairs. Therefore, I personally believe that we only need to improve the comments for this configuration, clearly describing the potential risks involved.

The above is my understanding as a user, and I would also like to hear more thoughts from other members of the community.

@errose28
Copy link
Contributor

errose28 commented Dec 4, 2024

This is because shutting down would trigger a large amount of container replication, potentially involving tens of thousands of containers.

Just as a side note in case you haven't tried this yet, but Ozone does have maintenance mode for datanodes similar to HDFS which maintains availability but not redundancy. When a node is put in maintenance mode, replication will only be triggered if data would become unreadable without that node, for example single-replicated data. In this state you can shut down a node for a period of time with little or no replication happening, if are ok with temporarily having one data copy offline.

@ChenSammi
Copy link
Contributor Author

Thank you all for you feedback. It looks like every approach has its pros and cons. And @slfan1989 gave a very valuable information from the real cluster maintenance experience. So let's keep the current default value now. @adoroszlai @sodonnel @errose28 @slfan1989 .

@ChenSammi ChenSammi closed this Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants