[DOCS] Document cluster behavior when a file system crashes but node remains operational #25591

MorrieAtElastic · 2017-07-07T08:54:27Z

Describe the feature: Document cluster behavior when a file system crashes but node remains operational

Elasticsearch version: Generic

Plugins installed: [] n/a

JVM version (java -version): n/a

OS version (uname -a if on a Unix-like system): generic

Description of the problem including expected versus actual behavior:

Elasticsearch documentation currently describes behavior when a node in a cluster fails. The documentation does not describe behavior when a node's file system fails but the node itself remains operational. Such failure conditions can and will happen especially for customers using 3rd-party high-performance disk systems (SSD, RAID, etc.) which are loosely coupled with the OS. Additionally it is common that customers will mount their data directories on high-performance disk systems while keeping their log data on the system drive.

General issues that need to be addressed:

cluster actions when primary shards are lost due to disk failure (according to my testing, replica shards are promoted on other nodes)
cluster actions when replica shards are lost due to disk failure (new replica shards are created on surviving nodes)
parameters affecting shard management when a disk failure occurs
cluster response when disk failure is resolved and the disk system is brought back online (according to my testing, nothing happens until the entire cluster is restarted)
response of the node and the cluster to queries and CRUD requests addressed to the node with the failed system.

Relevant Discussions

"Expected behavior" during disk crashes has changed significantly between elastic search versions and there are several significant open issues speaking to this question including:

#18417
#18467
#19789

Cluster response specifically to failed disk conditions should be documented for user system design and recovery planning.

The text was updated successfully, but these errors were encountered:

PhaedrusTheGreek · 2017-07-11T13:16:50Z

Related Discussion: #18279

elasticmachine · 2018-04-24T09:36:12Z

Pinging @elastic/es-core-infra

pugnascotia · 2019-09-12T15:52:23Z

See also Improve handling of readonly filesystems (#45286).

jrodewig · 2019-11-01T19:21:26Z

[docs issue triage]

stefnestor · 2021-11-24T17:23:05Z

@jrodewig @jaymode, I think this fell of radar. Can you review? 🙏🏼

elasticmachine · 2021-11-24T18:12:19Z

Pinging @elastic/es-distributed (Team:Distributed)

jrodewig · 2021-11-24T18:15:30Z

Thanks for the ping @stefnestor. @jaymode is now part of another team, but I've added some labels to include the Distributed team.

Thanks to the work in #45286, we hopefully have a simpler story here. As this info is largely targeted as users doing recovery planning, we may want to add a page to Designing for Resilience.

I don't personally have the bandwidth to pick this up in the near term, but I can bring this to our next Docs sync to see if anyone else if available.

DaveCTurner · 2021-11-24T19:34:08Z

I wonder if it's worth making a distinction between "node failed" and "filesystem failed but node still running" any more. #45286 means that a node with a broken filesystem will remove itself from the cluster, just like any other failure mode.

idegtiarenko · 2022-07-28T13:14:29Z

Node leaves the cluster as soon as the fs is no longer writable.

jimczi added the >docs General docs changes label Jul 7, 2017

colings86 added the :Core/Infra/Core Core issues without another label label Apr 24, 2018

pugnascotia mentioned this issue Sep 12, 2019

Improve handling of readonly filesystems #45286

Closed

3 tasks

rjernst added Team:Core/Infra Meta label for core/infra team Team:Docs Meta label for docs team labels May 4, 2020

rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020

jaymode removed the needs:triage Requires assignment of a team area label label Dec 14, 2020

elasticmachine added the Team:Distributed Meta label for distributed team (obsolete) label Nov 24, 2021

debadair removed the Team:Docs Meta label for docs team label Apr 27, 2022

debadair changed the title ~~Elasticsearch: Document Cluster Behavior When A File System Crashes But Node Remains Operational~~ [DOCS] Document cluster behavior when a file system crashes but node remains operational Apr 27, 2022

idegtiarenko closed this as completed Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOCS] Document cluster behavior when a file system crashes but node remains operational #25591

[DOCS] Document cluster behavior when a file system crashes but node remains operational #25591

MorrieAtElastic commented Jul 7, 2017 •

edited

Loading

PhaedrusTheGreek commented Jul 11, 2017

elasticmachine commented Apr 24, 2018

pugnascotia commented Sep 12, 2019

jrodewig commented Nov 1, 2019

stefnestor commented Nov 24, 2021

elasticmachine commented Nov 24, 2021

jrodewig commented Nov 24, 2021 •

edited

Loading

DaveCTurner commented Nov 24, 2021

idegtiarenko commented Jul 28, 2022

[DOCS] Document cluster behavior when a file system crashes but node remains operational #25591

[DOCS] Document cluster behavior when a file system crashes but node remains operational #25591

Comments

MorrieAtElastic commented Jul 7, 2017 • edited Loading

PhaedrusTheGreek commented Jul 11, 2017

elasticmachine commented Apr 24, 2018

pugnascotia commented Sep 12, 2019

jrodewig commented Nov 1, 2019

stefnestor commented Nov 24, 2021

elasticmachine commented Nov 24, 2021

jrodewig commented Nov 24, 2021 • edited Loading

DaveCTurner commented Nov 24, 2021

idegtiarenko commented Jul 28, 2022

MorrieAtElastic commented Jul 7, 2017 •

edited

Loading

jrodewig commented Nov 24, 2021 •

edited

Loading