Skip to content

WIP - Add RedisSessionHealthChecker#7547

Closed
pauldoomgov wants to merge 3 commits intomainfrom
pauldoom/sick-cache
Closed

WIP - Add RedisSessionHealthChecker#7547
pauldoomgov wants to merge 3 commits intomainfrom
pauldoom/sick-cache

Conversation

@pauldoomgov
Copy link
Contributor

🎫 Ticket

https://github.com/18F/identity-devops/issues/5629

🛠 Summary of changes

Add a Redis session check to the existing /api/health endpoint to ensure a working connection with Redis when assessing the health of a server.

📜 Testing Plan

TBD

  • Step 1
  • Step 2
  • Step 3

👀 Screenshots

TBD

If relevant, include a screenshot or screen capture of the changes.

Before:
After:

changelog: Internal, Platform, Improve health checking

Adds a check to attempt writing to the Redis session store then reading
the key back.  A default TTL of 1 second is used.  The check will fail if:

* Redis is not writeable
* The value can not be read back from Redis
* It takes more than 1 second to read back from Redis

# @api private
def health_write_and_read
REDIS_POOL.with do |client|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but would we want to check all the Redis pools?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could at least add a check for the rate limiting Redis. Attempts storage is not always enabled so seems like a conditional health check would be needed for that.

Should we combine it all here in one spot and call this RedisHealthChecker?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense, yeah.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess they could be separate too, I don't have strong feelings, whatever is easier.


# @return [HealthCheckSummary]
def check
HealthCheckSummary.new(healthy: true, result: health_write_and_read)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the Outbound health check, would we want to cache this for a period of time so that we mitigate this as a potential vector for Denial of Service? I suppose we'd want to be careful to not use Redis as the cache as well.

Copy link
Contributor Author

@pauldoomgov pauldoomgov Dec 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are multiproc now and may be multithread soon so getting a per-instance global might be tricky. (*For me, a Ruby n00b.)

That said we are talking hundreds of calls a minute for normal health checking. Some sort of caching is a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooo when did we upgrade to multithreaded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't yet @zachmargolis... I updated my note.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said we are talking hundreds of calls a minute for normal health checking. Some sort of caching is a good idea.

Maybe a whole separate convo, but if this health check is happening so frequently, should we break redis health into a separate endpoint? Like separate high frequency "can you serve traffic" health check endpoints from medium frequency "are all connections working 100% normal"?

Because if redis is down, typically it's down for all instances, so doing one quick check across all instances every so often (1/minute) is fine vs 1/instance/minute or whatever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking of some nasty failure modes like if Redis is unavailable from one of the AZs - We want all the instances in that AZ to go bye-bye. A Redis call is cheap and made for nearly every page served, which makes me worry about the frequency this will run less.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We spoke a bit and I think we should defer caching unless it becomes a problem.

Copy link
Contributor

@mitchellhenke mitchellhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is minor suggestion that moves it in the direction of potentially supporting multiple Redis checks here.

pauldoomgov and others added 2 commits December 30, 2022 11:47
Co-authored-by: Mitchell Henke <mitchell.henke@gsa.gov>
Co-authored-by: Mitchell Henke <mitchell.henke@gsa.gov>
@aduth
Copy link
Contributor

aduth commented Feb 24, 2023

Checking in on the status of old pull requests. Is this still being worked on, or can we close it?

@aduth
Copy link
Contributor

aduth commented Mar 2, 2023

Closing due to inactivity. This can be restored / reopened in the future if needed.

@aduth aduth closed this Mar 2, 2023
@aduth aduth deleted the pauldoom/sick-cache branch March 2, 2023 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants