-
Notifications
You must be signed in to change notification settings - Fork 20.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/utils: Handle graceful shutdown on low disk space #21884
Conversation
Currently, the program will panic when the node runs out of memory, sometimes leading to corrupting the databases. This commit introduces a goroutine that checks for available disk space every 5 seconds: - If less than 500 MB, prints warning - If less than 100 MB, writes SIGTERM to channel that is used to handle graceful termination of a node
@vyrwu thanks for picking up this task!
You're currently checking Otherwise I think it looks good, and clever idea to use the sigterm channel to simulate a regular exit! We'll have to check whether 100Mb is sufficient. Whenever geth exits, it has quite a lot of data held in memory which must be persisted, so 100Mb might be on the low side. I would assume the level for low-disk-exit should be on the same order of magnitude as the cache limit. |
Thanks for fast reply. I will work on it next weekend. I'll try testing geth a bit too, and see maybe I can simulate OOM somehow to verify that the fix does what it supposed to. |
cmd/utils/cmd.go
Outdated
@@ -312,3 +313,28 @@ func ExportPreimages(db ethdb.Database, fn string) error { | |||
log.Info("Exported preimages", "file", fn) | |||
return nil | |||
} | |||
|
|||
func ensureSufficientMemory(sigc chan os.Signal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit confusing to have this method named ensureSufficientMemory
, as it's disk-space, not RAM that's being checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c355ae4
cmd/utils/cmd.go
Outdated
log.Info("Available disk space is less than 100 MB. Gracefully shutting down to prevent database corruption.") | ||
sigc <- syscall.SIGTERM | ||
} else if avMemMB < 500 { | ||
log.Warnf("Node is running low on memory. It will terminate if memory runs below 100MB. Remaining: %v MB.", avMemMB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, low on disk space ... if disk space runs below...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c355ae4
cmd/utils/cmd.go
Outdated
go func() { | ||
var stat syscall.Statfs_t | ||
wd, err := os.Getwd(); err != nil { | ||
Fatalf("Error reading available memory of Node: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid Fatalf
-- that one causes an immediate os.Exit
, which means it will almost certainly cause data loss and/or database corruption. We only use it in cases where things are already irrevocably broken beyond repair.
Just use log.Warn
(and perhaps send a SIGTERM?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c355ae4
As for testing, it might be good to make it trigger after ~1h, and right before it triggers, print out the measured disk size. And once geth has exited, you can dump out the same measure again. Doing that a few times should collect some stats on how much disk is used by the shutdown process. |
This is still untested, I was reading a bit about Geth today and digging more into the codebase. I still need to find the right cache limit for the disk size thresholds, there are a few defined in code but I'm sure I can find the right one after understanding it a little better. 👍 My intuition is that it's 1000 MBs. |
Also need to look into these CI errors:
|
About the
Also particular, The sys package has So I think what's needed is,
And once we have that, create architecture-specific files, with a method e.g. |
Superseded by #22103 |
Currently, the program will panic when the node runs out of memory, sometimes leading to corrupting the databases.
This commit introduces a goroutine that checks for available disk space every 5 seconds: