-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] Document cluster behavior when a file system crashes but node remains operational #25591
Comments
Related Discussion: #18279 |
Pinging @elastic/es-core-infra |
[docs issue triage] |
Pinging @elastic/es-distributed (Team:Distributed) |
Thanks for the ping @stefnestor. @jaymode is now part of another team, but I've added some labels to include the Distributed team. Thanks to the work in #45286, we hopefully have a simpler story here. As this info is largely targeted as users doing recovery planning, we may want to add a page to Designing for Resilience. I don't personally have the bandwidth to pick this up in the near term, but I can bring this to our next Docs sync to see if anyone else if available. |
I wonder if it's worth making a distinction between "node failed" and "filesystem failed but node still running" any more. #45286 means that a node with a broken filesystem will remove itself from the cluster, just like any other failure mode. |
Node leaves the cluster as soon as the fs is no longer writable. |
Describe the feature: Document cluster behavior when a file system crashes but node remains operational
Elasticsearch version: Generic
Plugins installed: [] n/a
JVM version (
java -version
): n/aOS version (
uname -a
if on a Unix-like system): genericDescription of the problem including expected versus actual behavior:
Elasticsearch documentation currently describes behavior when a node in a cluster fails. The documentation does not describe behavior when a node's file system fails but the node itself remains operational. Such failure conditions can and will happen especially for customers using 3rd-party high-performance disk systems (SSD, RAID, etc.) which are loosely coupled with the OS. Additionally it is common that customers will mount their data directories on high-performance disk systems while keeping their log data on the system drive.
General issues that need to be addressed:
Relevant Discussions
"Expected behavior" during disk crashes has changed significantly between elastic search versions and there are several significant open issues speaking to this question including:
#18417
#18467
#19789
Cluster response specifically to failed disk conditions should be documented for user system design and recovery planning.
The text was updated successfully, but these errors were encountered: