fix: respect Redis cluster slots when inserting multiple items by carodewig · Pull Request #8185 · apollographql/router

carodewig · 2025-09-02T19:29:51Z

The existing insert code will silently fail when we try to insert multiple values which correspond to different Redis cluster hash slots. This PR corrects that behavior, raises errors when inserts fail, adds new metrics to track Redis client health, and adds a test against redis-cluster.

New metrics:

apollo.router.cache.redis.unresponsive: counter for 'unresponsive' events raised by the Redis library
- kind: Redis cache purpose (APQ, query planner, entity)
- server: Redis server that became unresponsive
apollo.router.cache.redis.reconnection: counter for 'reconnect' events raised by the Redis library
- kind: Redis cache purpose (APQ, query planner, entity)
- server: Redis server that required client reconnection

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

Exceptions

Note any exceptions here

Notes

It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices. ↩
Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩

apollo-librarian · 2025-09-02T19:30:00Z

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 1 changed, 0 removed

* graphos/routing/(latest)/observability/telemetry/instrumentation/standard-instruments.mdx

Build ID: 3af6e296d2c94ce7fb5b9f00
Build Logs: View logs

URL: https://www.apollographql.com/docs/deploy-preview/3af6e296d2c94ce7fb5b9f00

goto-bus-stop

LGTM based on visual inspection only

abernix

I'd just like to know what the testing strategy is for this.

carodewig · 2025-09-11T15:08:44Z

I'd just like to know what the testing strategy is for this.

@abernix it's all been manual thus far, as when I opened this PR I didn't have the bandwidth to get redis cluster set up with our github testing infrastructure. But I'm working on that now for a different project and will add tests here as well!

…ster

abernix · 2025-09-11T18:37:51Z

Thanks, @carodewig! Just wanted to make sure it didn't slip in with the existing approval w/o whatever we really wanted there. Appreciate the reply! 🙇

.changesets/feat_caroline_redis_fixes.md

This reverts commit 19c6bf1.

apollo-router/src/cache/redis.rs

aaronArinder · 2025-09-15T19:40:06Z

apollo-router/src/cache/redis.rs

+        for value in values {
+            let value: RedisValue<usize> = value.ok_or("missing value")?;
+            assert_eq!(value.0, expected_value);
+        }


just to doublecheck because I tripped over it when trying to understand the test: we're setting the same value for all keys and then getting all keys (first as a set of gets and then as one big mget) to check their values, which is just the same int?

Yep! The idea with the 'same value' bit is if this test is somehow running twice against the same redis cluster, we're actually getting the value for this test. Probably unnecessary, but didn't add much complexity so I thought it was worth it as a backup.

aaronArinder · 2025-09-15T19:40:35Z

apollo-router/tests/integration/entity_cache.rs

    router
        .assert_metrics_contains(
-            r#"apollo_router_cache_redis_commands_executed_total{kind="entity",otel_scope_name="apollo/router"} 17"#,
+            r#"apollo_router_cache_redis_commands_executed_total{kind="entity",otel_scope_name="apollo/router"} 16"#,


how'd we lose a command invocation?

I spent a while on this and couldn't figure it out 💀

I think the 17 was deduced by running the test and seeing what value it spit out, as I'm not sure how you'd get 17 from 7 queries either?

I ended up deciding to set it aside since the other tests show the insert is working, but I definitely would love others' hypotheses on this.

aaronArinder · 2025-09-15T19:42:52Z

apollo-router/src/cache/redis.rs

+                        Ok(server) => {
+                            tracing::debug!("Redis client ({server:?}) unresponsive");
+                            u64_counter_with_unit!(
+                                "apollo.router.cache.redis.unresponsive",


is it worth distinguishing between the server and client? maybe as a tag or something, not sure; mostly, it'd be nice to know whether the client is struggling (eg, too many buffered commands while the server is still chomping away as expected) or the server is struggling (client is happy but the server has ground to a halt for some reason)

I think you'd be able to get the client via your metrics ingest engine - for example, IIRC prometheus adds the 'target' to metrics it scrapes.

Or do you mean having a tag for the specific client within the router, if you've got multiple clients in the pool?

more the ingestion bit; just some way to distinguish between server and client unresponsiveness, which if there's already some way to figure that out with defaults, then that'd be great

Sadly I don't think there's a good way to make that distinction; the unresponsive event is published by fred when it's gone a certain amount of time since hearing from the server per this config.

But the other metrics Bryn added around command queue length etc might be a good way to diagnose this live -- if you see a bunch of unresponsive events while the command queue length is high, that would be a good indicator of a troublesome Redis server.

aaronArinder · 2025-09-15T19:47:08Z

apollo-router/src/cache/redis.rs

+        for (key, value) in data {
+            let key = self.make_key(key.clone());
+            let _ = pipeline
+                .set::<(), _, _>(key, value.clone(), expiration.clone(), None, false)


if I understand this right, previously when we had no ttl, we'd use mset to blast away a chonky set of writes; now we just use a sequential set--do we understand the performance differences between the two? I'm assuming so, but figured I'd ask just in case (sounds like the mset was failing when multiple hashslots were targeted, but I'm wondering about the case where it wasn't silently failing)

not a blocker, I don't think, because a working mset for a good number of cases is better than a performant-but-broken mset for a smaller number of a cases

That's correct!

I haven't profiled the MSET vs sequential SET behavior; I suspect which one is better would depend on environment, data size, etc. (ie if the payload is large enough, it's better to send multiple SETs).

But most users will not have been hitting the MSET path previously - insert_multiple is only used in the entity caching plugin and the vast majority of users will have a TTL set for that.

apollo-router/src/cache/redis.rs

aaronArinder · 2025-09-15T19:52:36Z

docker-compose.yml

    ports:
      - 8126:8126
+
+  # redis cluster


this is the only part of the pr that I'm sort of hesitant about; I don't think it's something to solve here (maybe I can take what's here and iterate on it), but it'd be super nice to not run both standalone redis and clustered redis (my office is in the attic and gets too warm already without having docker run more stuff)

I agree! Personally I think it'd be better to only use clustered Redis, but I didn't want to make that change everywhere as part of this PR in case others disagree.

whew, spent some time on this and everything is messy apart from just running both!

I think it might be good to do something like this in the future, where we test against both clustered and non-clustered redis! ie parameterize the tests and use rstest to run against both:

#[tokio::test] #[rstest::rstest] async fn multiple_documents( #[values(true, false)] clustered: bool, ) -> Result<(), BoxError> { let config = redis_config(clustered); todo!() }

Unit test added

carodewig added 9 commits September 2, 2025 14:36

fix: clustered inserts must respect hash slots

1e7500b

fix: record insert errors

1ceca95

fix: correct name in histogram

05e13bb

fix: don't return anything

9269b70

fix: increment error metric on error_rx

add83e4

feat: add counter for reconnections

d7a4e6b

feat: add counter for 'unresponsive' redis cli events

a120373

maint: record errors while closing the pool

683dd74

maint: rename metric for consistency

2ef316d

This comment has been minimized.

Sign in to view

docs: create changeset entry

ab774a6

carodewig marked this pull request as ready for review September 2, 2025 19:31

carodewig requested a review from a team September 2, 2025 19:31

carodewig requested a review from a team as a code owner September 2, 2025 19:31

carodewig added 2 commits September 3, 2025 14:41

fix: don't double-count errors

2f961dd

Merge branch 'dev' into caroline/redis-fixes

de7d98c

goto-bus-stop approved these changes Sep 5, 2025

View reviewed changes

abernix previously requested changes Sep 10, 2025

View reviewed changes

carodewig marked this pull request as draft September 11, 2025 15:09

carodewig added 3 commits September 11, 2025 12:09

feat: add redis-cluster to local docker and circleci

22bc15b

fix: don't need to separate paths if cluster

bb2b671

test: add test for insert_multiple and get_multiple behavior in a clu…

b893332

…ster

carodewig changed the title ~~fix: respect Redis cluster slots when inserting multiple items~~ feat: add more metrics for Redis client health, add test for redis-cluster behavior Sep 11, 2025

carodewig added 3 commits September 11, 2025 14:14

docs: update changeset

19c6bf1

Merge branch 'dev' into caroline/redis-fixes

99b78f0

style: revert rustrover's sneaky formatting

bf7a440

carodewig requested review from abernix and goto-bus-stop September 11, 2025 18:19

carodewig marked this pull request as ready for review September 11, 2025 18:19

test: use replicas in circleci

1958e2f

test: update expected command count

187eca7

abernix reviewed Sep 12, 2025

View reviewed changes

.changesets/feat_caroline_redis_fixes.md Show resolved Hide resolved

carodewig changed the title ~~feat: add more metrics for Redis client health, add test for redis-cluster behavior~~ fix: respect Redis cluster slots when inserting multiple items Sep 12, 2025

carodewig added 2 commits September 12, 2025 10:09

Revert "docs: update changeset"

b595901

This reverts commit 19c6bf1.

Merge branch 'dev' into caroline/redis-fixes

5709690

carodewig requested a review from abernix September 12, 2025 14:14

aaronArinder approved these changes Sep 15, 2025

View reviewed changes

carodewig added 3 commits September 15, 2025 16:58

chore: better debugging information

cc5a210

doc: better test name and description

ab7253b

Merge branch 'dev' into caroline/redis-fixes

16ebadc

Merge branch 'dev' into caroline/redis-fixes

373e537

carodewig merged commit a98a362 into dev Sep 19, 2025
15 checks passed

carodewig deleted the caroline/redis-fixes branch September 19, 2025 14:16

abernix mentioned this pull request Oct 27, 2025

prep release: v2.8.0 #8495

Merged

Conversation

carodewig commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

apollo-librarian bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Docs preview ready

Uh oh!

This comment has been minimized.

goto-bus-stop left a comment

Choose a reason for hiding this comment

Uh oh!

abernix left a comment

Choose a reason for hiding this comment

Uh oh!

carodewig commented Sep 11, 2025

Uh oh!

abernix commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

carodewig commented Sep 2, 2025 •

edited

Loading

apollo-librarian bot commented Sep 2, 2025 •

edited

Loading