Skip to content

fix: respect Redis cluster slots when inserting multiple items#8185

Merged
carodewig merged 26 commits intodevfrom
caroline/redis-fixes
Sep 19, 2025
Merged

fix: respect Redis cluster slots when inserting multiple items#8185
carodewig merged 26 commits intodevfrom
caroline/redis-fixes

Conversation

@carodewig
Copy link
Contributor

@carodewig carodewig commented Sep 2, 2025

The existing insert code will silently fail when we try to insert multiple values which correspond to different Redis cluster hash slots. This PR corrects that behavior, raises errors when inserts fail, adds new metrics to track Redis client health, and adds a test against redis-cluster.

New metrics:

  • apollo.router.cache.redis.unresponsive: counter for 'unresponsive' events raised by the Redis library
    • kind: Redis cache purpose (APQ, query planner, entity)
    • server: Redis server that became unresponsive
  • apollo.router.cache.redis.reconnection: counter for 'reconnect' events raised by the Redis library
    • kind: Redis cache purpose (APQ, query planner, entity)
    • server: Redis server that required client reconnection

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • PR description explains the motivation for the change and relevant context for reviewing
  • PR description links appropriate GitHub/Jira tickets (creating when necessary)
  • Changeset is included for user-facing changes
  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Metrics and logs are added3 and documented
  • Tests added and passing4
    • Unit tests
    • Integration tests
    • Manual tests, as necessary

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices.

  4. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@apollo-librarian
Copy link

apollo-librarian bot commented Sep 2, 2025

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 1 changed, 0 removed
* graphos/routing/(latest)/observability/telemetry/instrumentation/standard-instruments.mdx

Build ID: 3af6e296d2c94ce7fb5b9f00
Build Logs: View logs

URL: https://www.apollographql.com/docs/deploy-preview/3af6e296d2c94ce7fb5b9f00

@github-actions

This comment has been minimized.

@carodewig carodewig marked this pull request as ready for review September 2, 2025 19:31
@carodewig carodewig requested a review from a team September 2, 2025 19:31
@carodewig carodewig requested a review from a team as a code owner September 2, 2025 19:31
Copy link
Member

@goto-bus-stop goto-bus-stop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on visual inspection only

Copy link
Member

@abernix abernix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just like to know what the testing strategy is for this.

@carodewig
Copy link
Contributor Author

I'd just like to know what the testing strategy is for this.

@abernix it's all been manual thus far, as when I opened this PR I didn't have the bandwidth to get redis cluster set up with our github testing infrastructure. But I'm working on that now for a different project and will add tests here as well!

@carodewig carodewig marked this pull request as draft September 11, 2025 15:09
@carodewig carodewig changed the title fix: respect Redis cluster slots when inserting multiple items feat: add more metrics for Redis client health, add test for redis-cluster behavior Sep 11, 2025
@carodewig carodewig marked this pull request as ready for review September 11, 2025 18:19
@abernix
Copy link
Member

abernix commented Sep 11, 2025

Thanks, @carodewig! Just wanted to make sure it didn't slip in with the existing approval w/o whatever we really wanted there. Appreciate the reply! 🙇

@carodewig carodewig changed the title feat: add more metrics for Redis client health, add test for redis-cluster behavior fix: respect Redis cluster slots when inserting multiple items Sep 12, 2025
@carodewig carodewig requested a review from abernix September 12, 2025 14:14
Comment on lines +940 to +943
for value in values {
let value: RedisValue<usize> = value.ok_or("missing value")?;
assert_eq!(value.0, expected_value);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to doublecheck because I tripped over it when trying to understand the test: we're setting the same value for all keys and then getting all keys (first as a set of gets and then as one big mget) to check their values, which is just the same int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! The idea with the 'same value' bit is if this test is somehow running twice against the same redis cluster, we're actually getting the value for this test. Probably unnecessary, but didn't add much complexity so I thought it was worth it as a backup.

router
.assert_metrics_contains(
r#"apollo_router_cache_redis_commands_executed_total{kind="entity",otel_scope_name="apollo/router"} 17"#,
r#"apollo_router_cache_redis_commands_executed_total{kind="entity",otel_scope_name="apollo/router"} 16"#,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how'd we lose a command invocation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a while on this and couldn't figure it out 💀

I think the 17 was deduced by running the test and seeing what value it spit out, as I'm not sure how you'd get 17 from 7 queries either?

I ended up deciding to set it aside since the other tests show the insert is working, but I definitely would love others' hypotheses on this.

Ok(server) => {
tracing::debug!("Redis client ({server:?}) unresponsive");
u64_counter_with_unit!(
"apollo.router.cache.redis.unresponsive",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth distinguishing between the server and client? maybe as a tag or something, not sure; mostly, it'd be nice to know whether the client is struggling (eg, too many buffered commands while the server is still chomping away as expected) or the server is struggling (client is happy but the server has ground to a halt for some reason)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'd be able to get the client via your metrics ingest engine - for example, IIRC prometheus adds the 'target' to metrics it scrapes.

Or do you mean having a tag for the specific client within the router, if you've got multiple clients in the pool?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more the ingestion bit; just some way to distinguish between server and client unresponsiveness, which if there's already some way to figure that out with defaults, then that'd be great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly I don't think there's a good way to make that distinction; the unresponsive event is published by fred when it's gone a certain amount of time since hearing from the server per this config.

But the other metrics Bryn added around command queue length etc might be a good way to diagnose this live -- if you see a bunch of unresponsive events while the command queue length is high, that would be a good indicator of a troublesome Redis server.

for (key, value) in data {
let key = self.make_key(key.clone());
let _ = pipeline
.set::<(), _, _>(key, value.clone(), expiration.clone(), None, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I understand this right, previously when we had no ttl, we'd use mset to blast away a chonky set of writes; now we just use a sequential set--do we understand the performance differences between the two? I'm assuming so, but figured I'd ask just in case (sounds like the mset was failing when multiple hashslots were targeted, but I'm wondering about the case where it wasn't silently failing)

not a blocker, I don't think, because a working mset for a good number of cases is better than a performant-but-broken mset for a smaller number of a cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct!

I haven't profiled the MSET vs sequential SET behavior; I suspect which one is better would depend on environment, data size, etc. (ie if the payload is large enough, it's better to send multiple SETs).

But most users will not have been hitting the MSET path previously - insert_multiple is only used in the entity caching plugin and the vast majority of users will have a TTL set for that.

ports:
- 8126:8126

# redis cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only part of the pr that I'm sort of hesitant about; I don't think it's something to solve here (maybe I can take what's here and iterate on it), but it'd be super nice to not run both standalone redis and clustered redis (my office is in the attic and gets too warm already without having docker run more stuff)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! Personally I think it'd be better to only use clustered Redis, but I didn't want to make that change everywhere as part of this PR in case others disagree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whew, spent some time on this and everything is messy apart from just running both!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be good to do something like this in the future, where we test against both clustered and non-clustered redis! ie parameterize the tests and use rstest to run against both:

#[tokio::test]
#[rstest::rstest]
async fn multiple_documents(
    #[values(true, false)] clustered: bool,
) -> Result<(), BoxError> {
    let config = redis_config(clustered);
    todo!()
}

@carodewig carodewig dismissed abernix’s stale review September 19, 2025 13:27

Unit test added

@carodewig carodewig merged commit a98a362 into dev Sep 19, 2025
15 checks passed
@carodewig carodewig deleted the caroline/redis-fixes branch September 19, 2025 14:16
@abernix abernix mentioned this pull request Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants