Conversation
changelog: Internal, Platform, Improve health checking Adds a check to attempt writing to the Redis session store then reading the key back. A default TTL of 1 second is used. The check will fail if: * Redis is not writeable * The value can not be read back from Redis * It takes more than 1 second to read back from Redis
47c788a to
9f822dc
Compare
|
|
||
| # @api private | ||
| def health_write_and_read | ||
| REDIS_POOL.with do |client| |
There was a problem hiding this comment.
Not a blocker, but would we want to check all the Redis pools?
There was a problem hiding this comment.
We could at least add a check for the rate limiting Redis. Attempts storage is not always enabled so seems like a conditional health check would be needed for that.
Should we combine it all here in one spot and call this RedisHealthChecker?
There was a problem hiding this comment.
I think that makes sense, yeah.
There was a problem hiding this comment.
I guess they could be separate too, I don't have strong feelings, whatever is easier.
|
|
||
| # @return [HealthCheckSummary] | ||
| def check | ||
| HealthCheckSummary.new(healthy: true, result: health_write_and_read) |
There was a problem hiding this comment.
Similar to the Outbound health check, would we want to cache this for a period of time so that we mitigate this as a potential vector for Denial of Service? I suppose we'd want to be careful to not use Redis as the cache as well.
There was a problem hiding this comment.
We are multiproc now and may be multithread soon so getting a per-instance global might be tricky. (*For me, a Ruby n00b.)
That said we are talking hundreds of calls a minute for normal health checking. Some sort of caching is a good idea.
There was a problem hiding this comment.
ooo when did we upgrade to multithreaded?
There was a problem hiding this comment.
Haven't yet @zachmargolis... I updated my note.
There was a problem hiding this comment.
That said we are talking hundreds of calls a minute for normal health checking. Some sort of caching is a good idea.
Maybe a whole separate convo, but if this health check is happening so frequently, should we break redis health into a separate endpoint? Like separate high frequency "can you serve traffic" health check endpoints from medium frequency "are all connections working 100% normal"?
Because if redis is down, typically it's down for all instances, so doing one quick check across all instances every so often (1/minute) is fine vs 1/instance/minute or whatever?
There was a problem hiding this comment.
I am thinking of some nasty failure modes like if Redis is unavailable from one of the AZs - We want all the instances in that AZ to go bye-bye. A Redis call is cheap and made for nearly every page served, which makes me worry about the frequency this will run less.
There was a problem hiding this comment.
We spoke a bit and I think we should defer caching unless it becomes a problem.
mitchellhenke
left a comment
There was a problem hiding this comment.
This is minor suggestion that moves it in the direction of potentially supporting multiple Redis checks here.
Co-authored-by: Mitchell Henke <mitchell.henke@gsa.gov>
Co-authored-by: Mitchell Henke <mitchell.henke@gsa.gov>
|
Checking in on the status of old pull requests. Is this still being worked on, or can we close it? |
|
Closing due to inactivity. This can be restored / reopened in the future if needed. |
🎫 Ticket
https://github.com/18F/identity-devops/issues/5629
🛠 Summary of changes
Add a Redis session check to the existing
/api/healthendpoint to ensure a working connection with Redis when assessing the health of a server.📜 Testing Plan
TBD
👀 Screenshots
TBD
If relevant, include a screenshot or screen capture of the changes.
Before:
After: