cmd/utils: Handle graceful shutdown on low disk space#21884
cmd/utils: Handle graceful shutdown on low disk space#21884vyrwu wants to merge 2 commits intoethereum:masterfrom vyrwu:handle-full-disk
Conversation
Currently, the program will panic when the node runs out of memory, sometimes leading to corrupting the databases. This commit introduces a goroutine that checks for available disk space every 5 seconds: - If less than 500 MB, prints warning - If less than 100 MB, writes SIGTERM to channel that is used to handle graceful termination of a node
|
@vyrwu thanks for picking up this task!
You're currently checking Otherwise I think it looks good, and clever idea to use the sigterm channel to simulate a regular exit! We'll have to check whether 100Mb is sufficient. Whenever geth exits, it has quite a lot of data held in memory which must be persisted, so 100Mb might be on the low side. I would assume the level for low-disk-exit should be on the same order of magnitude as the cache limit. |
|
Thanks for fast reply. I will work on it next weekend. I'll try testing geth a bit too, and see maybe I can simulate OOM somehow to verify that the fix does what it supposed to. |
| return nil | ||
| } | ||
|
|
||
| func ensureSufficientMemory(sigc chan os.Signal) { |
There was a problem hiding this comment.
It's a bit confusing to have this method named ensureSufficientMemory, as it's disk-space, not RAM that's being checked.
| log.Info("Available disk space is less than 100 MB. Gracefully shutting down to prevent database corruption.") | ||
| sigc <- syscall.SIGTERM | ||
| } else if avMemMB < 500 { | ||
| log.Warnf("Node is running low on memory. It will terminate if memory runs below 100MB. Remaining: %v MB.", avMemMB) |
There was a problem hiding this comment.
Same here, low on disk space ... if disk space runs below...
| go func() { | ||
| var stat syscall.Statfs_t | ||
| wd, err := os.Getwd(); err != nil { | ||
| Fatalf("Error reading available memory of Node: %v", err) |
There was a problem hiding this comment.
Please avoid Fatalf -- that one causes an immediate os.Exit, which means it will almost certainly cause data loss and/or database corruption. We only use it in cases where things are already irrevocably broken beyond repair.
Just use log.Warn (and perhaps send a SIGTERM?)
|
As for testing, it might be good to make it trigger after ~1h, and right before it triggers, print out the measured disk size. And once geth has exited, you can dump out the same measure again. Doing that a few times should collect some stats on how much disk is used by the shutdown process. |
|
This is still untested, I was reading a bit about Geth today and digging more into the codebase. I still need to find the right cache limit for the disk size thresholds, there are a few defined in code but I'm sure I can find the right one after understanding it a little better. 👍 My intuition is that it's 1000 MBs. |
|
Also need to look into these CI errors: |
|
About the
Also particular, The sys package has So I think what's needed is,
And once we have that, create architecture-specific files, with a method e.g. |
|
Superseded by #22103 |
Currently, the program will panic when the node runs out of memory, sometimes leading to corrupting the databases.
This commit introduces a goroutine that checks for available disk space every 5 seconds: