Skip to content

Conversation

davissp14
Copy link
Contributor

@davissp14 davissp14 commented Feb 2, 2023

This PR ensures the primary is made read-only when disk capacity exceeds 90%.

I've introduced a new PG health check called disk-capacity. This health check has quite a bit of overlap with the existing VM check communicating disk usage, but given the implications and message I feel it should be separated.

How it works

On an interval defined by the health check, we will calculate disk usage and evaluate whether capacity has exceeded our pre-defined threshold of 90% ( should be made configurable in the future ).

When capacity exceeds the pre-defined threshold, we will set a readonly.lock file on the filesystem and establish a connection to each user-defined table ( anything that's not Postgres or Repmgr) and make it readonly. When disk usage falls below the defined threshold, either through file cleanup or volume extension, we will work to turn all tables back to read/write and then clear the readonly.lock file on success.

The readonly.lock file is used for two things:

  1. If the readonly.lock file is present, we know the database(s) have already been made readonly so there's no need to needlessly establish any more connections.

  2. It allows us to coordinate intentions across the codebase.

Limitations

Read-only mode is currently enabled at runtime and will be cleared on restart. That being said, there will be a small window on boot where the primary will accept writes before it is reconfigured back to read-only. If this becomes a problem, we can look into addressing this.

#56

@davissp14 davissp14 requested a review from DAlperin February 2, 2023 03:11
return "", fmt.Errorf("failed to turn primary readonly: %s", err)
}

return "", fmt.Errorf("%0.1f%% - extend your volume to re-enable writes", usedPercentage)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly sure why, but i'm getting MISSING here.

[✗] disk-capacity: 93.2%!e(MISSING)xtend your volume to re-enable writes (2.92ms)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed with:

Comment on lines +13 to +15
// Primary will be made read-only when disk capacity reaches this percentage.
const diskCapacityPercentageThreshold = 90.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been meaning to make health-check stuff configurable but that can be out of scope of this PR

return connectionCount(ctx, localConn)
})

if member.Role == flypg.PrimaryRoleName && member.Active {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to expose this healthcheck on replicas without setting them to readonly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it might be useful info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a VM check that communicates general capacity that should cover that. It makes me think though that maybe we need a new name for the check.

@davissp14 davissp14 merged commit ce0f3c8 into master Feb 2, 2023
@davissp14 davissp14 deleted the disk-capacity-check branch February 25, 2023 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants