-
Notifications
You must be signed in to change notification settings - Fork 76
Toggle primary read-only when disk capacity hits threshold #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
internal/flycheck/pg.go
Outdated
return "", fmt.Errorf("failed to turn primary readonly: %s", err) | ||
} | ||
|
||
return "", fmt.Errorf("%0.1f%% - extend your volume to re-enable writes", usedPercentage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly sure why, but i'm getting MISSING
here.
[✗] disk-capacity: 93.2%!e(MISSING)xtend your volume to re-enable writes (2.92ms)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been fixed with:
// Primary will be made read-only when disk capacity reaches this percentage. | ||
const diskCapacityPercentageThreshold = 90.0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been meaning to make health-check stuff configurable but that can be out of scope of this PR
return connectionCount(ctx, localConn) | ||
}) | ||
|
||
if member.Role == flypg.PrimaryRoleName && member.Active { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to expose this healthcheck on replicas without setting them to readonly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it might be useful info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a VM check that communicates general capacity that should cover that. It makes me think though that maybe we need a new name for the check.
This PR ensures the primary is made read-only when disk capacity exceeds 90%.
I've introduced a new PG health check called
disk-capacity
. This health check has quite a bit of overlap with the existing VM check communicating disk usage, but given the implications and message I feel it should be separated.How it works
On an interval defined by the health check, we will calculate disk usage and evaluate whether capacity has exceeded our pre-defined threshold of 90% ( should be made configurable in the future ).
When capacity exceeds the pre-defined threshold, we will set a
readonly.lock
file on the filesystem and establish a connection to each user-defined table ( anything that's not Postgres or Repmgr) and make it readonly. When disk usage falls below the defined threshold, either through file cleanup or volume extension, we will work to turn all tables back to read/write and then clear thereadonly.lock
file on success.The
readonly.lock
file is used for two things:If the
readonly.lock
file is present, we know the database(s) have already been made readonly so there's no need to needlessly establish any more connections.It allows us to coordinate intentions across the codebase.
Limitations
Read-only mode is currently enabled at runtime and will be cleared on restart. That being said, there will be a small window on boot where the primary will accept writes before it is reconfigured back to read-only. If this becomes a problem, we can look into addressing this.
#56