-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hot swappable path.data disks #18279
Comments
Related to #18217 |
While I think there may be improvements that can be made when a disk dies, if you want hot swapping etc I think you need a proper RAID system or LVS |
I think we need to add some resiliency here:
I will take care of this |
yeah I am torn on the hot-swapping. I think we can potentially take things out of the loop internally but if you are pluggin in a new disk and we should auto-detect that a datapath is good again I think you should restart the node instead? |
Definitely we don't want to introduce any resiliency issues. Some manual intervention makes sense, but restarting a node can sometimes take a long time. Should there be something like delayed allocation on marking a path.data as failed? - there is the case of something like NFS, where a network problem might make the drive appear to come and go. |
I think if you loose a disk you need to restart the node. I can totally improve along the lines of failing shards quicker but we shouldn't try to be fancy here. I think we should take the node out of the cluster somehow but that's something that needs more thought. |
Multiple disks on Restarting a node is much easier than re-building a logical volume, and much less data is lost, so either way we are ahead. |
In general this makes sense but it would be nice if you could apply something like a transient setting to tell that node that a disk has died and to temporarily stop trying to perform I/O on it. That would still require manual intervention, but it would allow to apply a temporary hotfix if a node restart is not immediately feasible. |
Had this issue come up against last night. Our logging nodes have 4 SSDs. We've passed an array to the Over the weekend, one of the file systems on one of the disks one one of the ES servers became corrupt. Over the next 12 hours, ES spewed 500GB of errors like the following into the logs, filling up the root partition and eventually alerting us (because we alert on disk usage but we didn't at the time have alerts on ES log file size / growth)
There are 12 data nodes in this cluster with 4 SSDs, 3 dedicated masters, and we run a replication factor of 2 using hourly indices with 2 primary shards. During the time that this happened, Elasticsearch continued to place primary shards on the failed As a result of Elasticsearch continuing to place primary shards on the failed disk, we lost half of the log data for 9 out of the 12 hours that this disk was unreachable (because 9 out of 12 times it attempted to place at least one of each hour's primary shards on the unreachable disk; the writes to primary failed, the primary was never moved elsewhere). I suspect, although I did not dig into it or write a test case to prove it, that the process whereby Elasticsearch determines which nodes are eligible to get a write and which disk to write to once it gets there might also bias further writes towards the drive that failed. In our case, we had 9 data nodes that were eligible to accept writes, each having 4 eligible disks that had not exceeded any water marks or otherwise were unwritable. Over 12 hours, 9 of the 24 primary shards created were allocated to the node with the disk failure and it routed them to the unreachable disk. As a result of being unwritable for several hours, that disk also was less full than the other disks on the cluster. Again, I don't know that a disk failure like the one we had biases shard placement in favor of writing to the unreachable disk. But we did see an abnormally high number of shards placed on one machine, and on one disk on one machine.... abnormal enough to make me wonder if that wasn't just a coincidence. All of which to say.... I think this issue is extremely important. I also think @s1monw is right to suggest that ensuring a filepath is writable before placing a shard (especially a primary shard) will go a long way towards adding resiliency. |
#18279 (comment) describe two things that need to happen to resolve this issue. The first has been done in #16745 . The second (failing the shard) is very easy. I opened #29008 to highlight it as an adopt me and a low hanging fruit. Closing this one as superseded by these two issues. |
@bleskes would you consider reopening this ticket as a high hanging fruit, as per #29008 (comment)? Or, if you feel it should remain closed, can you share a bit more of your thinking about why? I don't feel like #16745 and #18279 (comment) are talking about the same thing |
@EvanV I agree it's not the same thing. As the discussion above indicates, we feel adding hot swappiness on the path level will come at a too high of a price. Elasticsearch currently works on the level of a node - shard copies are spread up across nodes and if a shard fails the master will try to assign it to another node. We can do better there and start tracking failures per node so we can stop allocating to it (we don't do that now) but adding another conceptual layer isn't worth it. LVM or RAID are much more mature solutions to achieve that part. That said, there were a few things we can do that came out of the discussion. One is done and the other is tracked by the another issue, which is why I closed this one. |
Thank you for explaining. I see what you're saying. I feel like this ticket shouldn't be called "Hot swappable data paths" and instead be a bug report along the lines of "ES shouldn't allocate shards to dead disks." I think the later is still true, albeit far more complicated, to your point. I also feel like the docs recommending multiple file paths should be caveated that RAID0 might be a better option, depending on your needs (I'm happy to submit an update to the docs along these lines, if you'd be open to accepting it). You're definitely right that ES shouldn't be responsible for replacing RAID or LVM. Focusing on the issues you did makes sense as a better solution than currently exists. Not to beat a dead horse, but I do feel that ES should be capable of not trying to allocate shards to dead disks. That is how I viewed this original issue, and it sounds like we both agree that #29008 doesn't quite cover that. Would you be open to adding an issue along the lines of "ES Shouldn't Allocate Shard to Dead Disks" and/or renaming this one and orienting the scope of it around that, not hot swappable disks? |
Pinging @elastic/es-core-infra |
Yes please, though I tried to find what you meant and couldn't.
I think this one #18417 covers it? If you agree, feel free to comment there. |
I may be recalling incorrectly, or it may have been a blog post. In any event, I'll poke around and add a note to the docs on "things to watch out for" vis a vis multiple data paths. #18417 does cover my concern yes. Thanks for taking the time to explain your reasoning on this one. I wasn't following you at first, but it's very clear now what you're thinking and how you're breaking down the work on this task. Much appreciated. |
It seems that when making use of
path.data
over multiple physical disks, that when a disk is removed, the system should recover automatically. Currently, searches and or indexing requests over missing shards throw exceptions, and no allocation/recovery occurs. The only way to bring the data back online is to restart the node, or to reinsert the original disk with existing data.It would be great if Elasticsearch could:
Steps to Test / Reproduce:
path.data
over 2 disks, and start 2 elasticsearch nodes locallyExceptions start to show in logs
but
_cat/shards
shows everything is OK_refresh
No change
Logs show an exception
_cat/shards
still show all shards STARTEDNo change
The text was updated successfully, but these errors were encountered: