fix: respect Redis cluster slots when inserting multiple items#8185
fix: respect Redis cluster slots when inserting multiple items#8185
Conversation
✅ Docs preview readyThe preview is ready to be viewed. View the preview File Changes 0 new, 1 changed, 0 removedBuild ID: 3af6e296d2c94ce7fb5b9f00 URL: https://www.apollographql.com/docs/deploy-preview/3af6e296d2c94ce7fb5b9f00 |
This comment has been minimized.
This comment has been minimized.
goto-bus-stop
left a comment
There was a problem hiding this comment.
LGTM based on visual inspection only
abernix
left a comment
There was a problem hiding this comment.
I'd just like to know what the testing strategy is for this.
@abernix it's all been manual thus far, as when I opened this PR I didn't have the bandwidth to get redis cluster set up with our github testing infrastructure. But I'm working on that now for a different project and will add tests here as well! |
|
Thanks, @carodewig! Just wanted to make sure it didn't slip in with the existing approval w/o whatever we really wanted there. Appreciate the reply! 🙇 |
This reverts commit 19c6bf1.
| for value in values { | ||
| let value: RedisValue<usize> = value.ok_or("missing value")?; | ||
| assert_eq!(value.0, expected_value); | ||
| } |
There was a problem hiding this comment.
just to doublecheck because I tripped over it when trying to understand the test: we're setting the same value for all keys and then getting all keys (first as a set of gets and then as one big mget) to check their values, which is just the same int?
There was a problem hiding this comment.
Yep! The idea with the 'same value' bit is if this test is somehow running twice against the same redis cluster, we're actually getting the value for this test. Probably unnecessary, but didn't add much complexity so I thought it was worth it as a backup.
| router | ||
| .assert_metrics_contains( | ||
| r#"apollo_router_cache_redis_commands_executed_total{kind="entity",otel_scope_name="apollo/router"} 17"#, | ||
| r#"apollo_router_cache_redis_commands_executed_total{kind="entity",otel_scope_name="apollo/router"} 16"#, |
There was a problem hiding this comment.
how'd we lose a command invocation?
There was a problem hiding this comment.
I spent a while on this and couldn't figure it out 💀
I think the 17 was deduced by running the test and seeing what value it spit out, as I'm not sure how you'd get 17 from 7 queries either?
I ended up deciding to set it aside since the other tests show the insert is working, but I definitely would love others' hypotheses on this.
| Ok(server) => { | ||
| tracing::debug!("Redis client ({server:?}) unresponsive"); | ||
| u64_counter_with_unit!( | ||
| "apollo.router.cache.redis.unresponsive", |
There was a problem hiding this comment.
is it worth distinguishing between the server and client? maybe as a tag or something, not sure; mostly, it'd be nice to know whether the client is struggling (eg, too many buffered commands while the server is still chomping away as expected) or the server is struggling (client is happy but the server has ground to a halt for some reason)
There was a problem hiding this comment.
I think you'd be able to get the client via your metrics ingest engine - for example, IIRC prometheus adds the 'target' to metrics it scrapes.
Or do you mean having a tag for the specific client within the router, if you've got multiple clients in the pool?
There was a problem hiding this comment.
more the ingestion bit; just some way to distinguish between server and client unresponsiveness, which if there's already some way to figure that out with defaults, then that'd be great
There was a problem hiding this comment.
Sadly I don't think there's a good way to make that distinction; the unresponsive event is published by fred when it's gone a certain amount of time since hearing from the server per this config.
But the other metrics Bryn added around command queue length etc might be a good way to diagnose this live -- if you see a bunch of unresponsive events while the command queue length is high, that would be a good indicator of a troublesome Redis server.
| for (key, value) in data { | ||
| let key = self.make_key(key.clone()); | ||
| let _ = pipeline | ||
| .set::<(), _, _>(key, value.clone(), expiration.clone(), None, false) |
There was a problem hiding this comment.
if I understand this right, previously when we had no ttl, we'd use mset to blast away a chonky set of writes; now we just use a sequential set--do we understand the performance differences between the two? I'm assuming so, but figured I'd ask just in case (sounds like the mset was failing when multiple hashslots were targeted, but I'm wondering about the case where it wasn't silently failing)
not a blocker, I don't think, because a working mset for a good number of cases is better than a performant-but-broken mset for a smaller number of a cases
There was a problem hiding this comment.
That's correct!
I haven't profiled the MSET vs sequential SET behavior; I suspect which one is better would depend on environment, data size, etc. (ie if the payload is large enough, it's better to send multiple SETs).
But most users will not have been hitting the MSET path previously - insert_multiple is only used in the entity caching plugin and the vast majority of users will have a TTL set for that.
| ports: | ||
| - 8126:8126 | ||
|
|
||
| # redis cluster |
There was a problem hiding this comment.
this is the only part of the pr that I'm sort of hesitant about; I don't think it's something to solve here (maybe I can take what's here and iterate on it), but it'd be super nice to not run both standalone redis and clustered redis (my office is in the attic and gets too warm already without having docker run more stuff)
There was a problem hiding this comment.
I agree! Personally I think it'd be better to only use clustered Redis, but I didn't want to make that change everywhere as part of this PR in case others disagree.
There was a problem hiding this comment.
whew, spent some time on this and everything is messy apart from just running both!
There was a problem hiding this comment.
I think it might be good to do something like this in the future, where we test against both clustered and non-clustered redis! ie parameterize the tests and use rstest to run against both:
#[tokio::test]
#[rstest::rstest]
async fn multiple_documents(
#[values(true, false)] clustered: bool,
) -> Result<(), BoxError> {
let config = redis_config(clustered);
todo!()
}
The existing insert code will silently fail when we try to insert multiple values which correspond to different Redis cluster hash slots. This PR corrects that behavior, raises errors when inserts fail, adds new metrics to track Redis client health, and adds a test against redis-cluster.
New metrics:
apollo.router.cache.redis.unresponsive: counter for 'unresponsive' events raised by the Redis librarykind: Redis cache purpose (APQ,query planner,entity)server: Redis server that became unresponsiveapollo.router.cache.redis.reconnection: counter for 'reconnect' events raised by the Redis librarykind: Redis cache purpose (APQ,query planner,entity)server: Redis server that required client reconnectionChecklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Exceptions
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and
debug-level logs. Please read this guidance on metrics best-practices. ↩Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩