-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-11770. Change default failed volume tolerated to 0 #7499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We change the default value to 0, user still can modify it to some other value, for example "1", based on their own needs. The current -1 unlimited default value is really not appropriate, imaging a datanode with 20 disks, -1 means even 19 disks fails, the datanode is still running, the administrator will not know that if he doesn't pay attention to that. Given the recent data loss issue caused by volume failure, I think we should better be conservative about these configurations regarding volume failures. We should fail fast and alert the cluster administrators, keep the failure context, root cause and fix the problem as early as possible. |
|
Shutting down a 20 disk DN due to one disk failure is not a good default for me. The disruption caused by an immediate loss of a large node is significant, and will result in a lot of needless replication. Lets say the disk is a legitimate hardware failure and needs replaced. Many of these hosts have hot swappable drives, the node could probably run on for days or weeks until the admins decide to remedy all the failed drives in the cluster. Disk failures are something that are expected to happen somewhat regularly, so we should handle them much more gracefully. I'd probably suggest something like a default of "if 50% - 75% of the configured drives fail", shutdown the DN., but there is an argument for keeping it running with only a single disk left too. The DNs should be monitored for disk failures and they should be investigated. It should not need a DN to shutdown to make that happen. |
|
See also #7505. |
We should strive for intuitive configurations out of the box to help usability as much as possible. IMO
This is the real problem: we do not have a good alerting system for disk failures. A new Recon page (HDDS-11840), dashboards, and changes like #7266 can all remediate this problem without introducing potentially surprising config changes. |
|
Everyone's points are reasonable, and I understand their perspectives. However, I agree more with @errose28 viewpoint. I believe setting this value to -1 is the better choice, as users should not face the risk of DN crashes. Our largest cluster has 1,500 machines, and every day, some machines experience disk failures. If a DN crashes directly, as an administrator, my first reaction is panic. It takes time to locate the logs, and the DN logs are often quite large. I think we should provide a more detailed description for this configuration: if set to -1, the DN will never crash under any circumstances. If set to a specific number, it indicates the number of disk failures that can be tolerated before the DN crashes. Ultimately, users should be allowed to choose based on their specific environment. Additionally, the situation mentioned by @sodonnel is also reasonable. We currently have a special type of machine with 60 data disks. In the event of a disk failure, we opt for the hot repair method, which means replacing the disk without shutting down. This is because shutting down would trigger a large amount of container replication, potentially involving tens of thousands of containers. Currently, we only perform a shutdown for repairs in the case of CPU failure, memory failure, or system disk failure. Our current strategy is to configure the system to tolerate a single disk failure and perform daily routine inspections. Once a machine with a disk failure is identified, we quickly carry out repairs. Therefore, I personally believe that we only need to improve the comments for this configuration, clearly describing the potential risks involved. The above is my understanding as a user, and I would also like to hear more thoughts from other members of the community. |
Just as a side note in case you haven't tried this yet, but Ozone does have maintenance mode for datanodes similar to HDFS which maintains availability but not redundancy. When a node is put in maintenance mode, replication will only be triggered if data would become unreadable without that node, for example single-replicated data. In this state you can shut down a node for a period of time with little or no replication happening, if are ok with temporarily having one data copy offline. |
|
Thank you all for you feedback. It looks like every approach has its pros and cons. And @slfan1989 gave a very valuable information from the real cluster maintenance experience. So let's keep the current default value now. @adoroszlai @sodonnel @errose28 @slfan1989 . |
What changes were proposed in this pull request?
Currently "hdds.datanode.failed.data.volumes.tolerated", "hdds.datanode.failed.metadata.volumes.tolerated" and "hdds.datanode.failed.db.volumes.tolerated" all have default "-1" value, means unlimited, as long as there is a good volume left.
There is a corresponding property in HDFS "dfs.datanode.failed.volumes.tolerated", which default is 0, means any volume failure will cause a datanode to shutdown.
It's better to have a more conservative default value than this current unlimited value.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11770
How was this patch tested?
existing unit tests.