cmd/utils: Handle graceful shutdown on low disk space by vyrwu · Pull Request #21884 · ethereum/go-ethereum

vyrwu · 2020-11-21T17:11:30Z

Currently, the program will panic when the node runs out of memory, sometimes leading to corrupting the databases.
This commit introduces a goroutine that checks for available disk space every 5 seconds:

If less than 500 MB, prints warning
If less than 100 MB, writes SIGTERM to channel that is used to handle graceful termination of a node

Currently, the program will panic when the node runs out of memory, sometimes leading to corrupting the databases. This commit introduces a goroutine that checks for available disk space every 5 seconds: - If less than 500 MB, prints warning - If less than 100 MB, writes SIGTERM to channel that is used to handle graceful termination of a node

vyrwu · 2020-11-21T17:22:07Z

Addresses #20965

@holiman One requirement was unclear to me: "check the disk where the datadir resides". I can get this path from Config, but why do I need it? Or did I miss-understood the requirements?

holiman · 2020-11-23T13:01:14Z

@vyrwu thanks for picking up this task!

One requirement was unclear to me: "check the disk where the datadir resides". I can get this path from Config, but why do I need it?

You're currently checking os.Getwd(). However, in many deployments, people don't have the data directory in the same location as where they're executing from. E.g. if you run dockerized geth, then you have probably mounted your data directory from the host. And the data directory is the concern here, not the working directory.

Otherwise I think it looks good, and clever idea to use the sigterm channel to simulate a regular exit!

We'll have to check whether 100Mb is sufficient. Whenever geth exits, it has quite a lot of data held in memory which must be persisted, so 100Mb might be on the low side.

I would assume the level for low-disk-exit should be on the same order of magnitude as the cache limit.

vyrwu · 2020-11-23T13:17:02Z

Thanks for fast reply. I will work on it next weekend. I'll try testing geth a bit too, and see maybe I can simulate OOM somehow to verify that the fix does what it supposed to.

holiman · 2020-11-23T13:38:02Z

 	return nil
 }
+
+func ensureSufficientMemory(sigc chan os.Signal) {


It's a bit confusing to have this method named ensureSufficientMemory, as it's disk-space, not RAM that's being checked.

Addressed in c355ae4

holiman · 2020-11-23T13:38:26Z

+					log.Info("Available disk space is less than 100 MB. Gracefully shutting down to prevent database corruption.")
+					sigc <- syscall.SIGTERM
+				} else if avMemMB < 500 {
+					log.Warnf("Node is running low on memory. It will terminate if memory runs below 100MB. Remaining: %v MB.", avMemMB)


Same here, low on disk space ... if disk space runs below...

Addressed in c355ae4

holiman · 2020-11-23T13:40:55Z

+			go func() {
+				var stat syscall.Statfs_t
+				wd, err := os.Getwd(); err != nil {
+					Fatalf("Error reading available memory of Node: %v", err)


Please avoid Fatalf -- that one causes an immediate os.Exit, which means it will almost certainly cause data loss and/or database corruption. We only use it in cases where things are already irrevocably broken beyond repair.
Just use log.Warn (and perhaps send a SIGTERM?)

Addressed in c355ae4

holiman · 2020-11-23T13:43:09Z

As for testing, it might be good to make it trigger after ~1h, and right before it triggers, print out the measured disk size. And once geth has exited, you can dump out the same measure again. Doing that a few times should collect some stats on how much disk is used by the shutdown process.

vyrwu · 2020-11-28T14:43:55Z

This is still untested, I was reading a bit about Geth today and digging more into the codebase. I still need to find the right cache limit for the disk size thresholds, there are a few defined in code but I'm sure I can find the right one after understanding it a little better. 👍

My intuition is that it's 1000 MBs.

vyrwu · 2020-11-28T14:50:59Z

Also need to look into these CI errors:

# github.com/ethereum/go-ethereum/cmd/utils
cmd\utils\cmd.go:322:12: undefined: syscall.Statfs_t
cmd\utils\cmd.go:323:13: undefined: syscall.Statfs
util.go:47: exit status 2
exit status 1

holiman · 2020-12-05T09:38:37Z

About the syscall package (https://golang.org/pkg/syscall/)

Deprecated: this package is locked down. Callers should use the corresponding package in the golang.org/x/sys repository instead.

Also particular, Statfs seems to be a unix thing, (but seems to be deprecated in favour of vstatfs ) .

The sys package has statfs in the unix package.

So I think what's needed is,

Find a corresponding windows method of figuring out the disk use,
Verify that we also have something for darwin (maybe the unix version works there)

And once we have that, create architecture-specific files, with a method e.g. availableSpace(path string), defined in <xx>.go, where there is one <xxx>_windows.go and one <xxx>_linux.go with appropriate compiler directives.

holiman · 2021-01-20T18:36:16Z

Superseded by #22103

vyrwu mentioned this pull request Nov 21, 2020

Graceful shutdown if disk is full #20965

Closed

vyrwu changed the title ~~Handle graceful shutdown on low disk space~~ cmd/utils: Handle graceful shutdown on low disk space Nov 21, 2020

holiman reviewed Nov 23, 2020

View reviewed changes

Address review comments

c355ae4

ligi mentioned this pull request Jan 3, 2021

cmd/geth: Graceful shutdown if disk is full #22103

Merged

holiman closed this Jan 20, 2021

Conversation

vyrwu commented Nov 21, 2020

Uh oh!

vyrwu commented Nov 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

holiman commented Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vyrwu commented Nov 23, 2020

Uh oh!

holiman Nov 23, 2020

Choose a reason for hiding this comment

Uh oh!

vyrwu Nov 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holiman Nov 23, 2020

Choose a reason for hiding this comment

Uh oh!

vyrwu Nov 28, 2020

Choose a reason for hiding this comment

Uh oh!

holiman Nov 23, 2020

Choose a reason for hiding this comment

Uh oh!

vyrwu Nov 28, 2020

Choose a reason for hiding this comment

Uh oh!

holiman commented Nov 23, 2020

Uh oh!

vyrwu commented Nov 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vyrwu commented Nov 28, 2020

Uh oh!

holiman commented Dec 5, 2020

Uh oh!

holiman commented Jan 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vyrwu commented Nov 21, 2020 •

edited

Loading

holiman commented Nov 23, 2020 •

edited

Loading

vyrwu Nov 28, 2020 •

edited

Loading

vyrwu commented Nov 28, 2020 •

edited

Loading