Provide a backpressure enabled pipeline by garypen · Pull Request #6486 · apollographql/router

garypen · 2024-12-18T17:14:33Z

Most of the functionality is now in place for effective traffic-shaping at the router and subgraph levels.

Please read the review guidance before starting your review.

dev-docs/BACKPRESSURE_REVIEW_NOTES.md

There's a long way to go, but some interesting aspects of the required changes are implemented here. I noted that a significant number of errors and glitches and tests are resolved by making these changes. Mainly places where we were calling a service without readying it first or places where we were cloning an inner service without doing the memory "replace" dance. Because we are using a "trick" to make the router service cloneable right now, there are a few tests which don't work properly. I think for the "full" work, we'll need to make the router service properly cloneable (without requiring a mutex). This will require some fairly substantial re-working of a wide variety of services and layers. On the plus side, once we've done that work we'll be able to retire a bunch of code that we've written that we will no longer require. I'll pick this up in the New Year...

svc-apollo-docs · 2024-12-18T17:14:36Z

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 1 changed, 0 removed

* graphos/reference/migration/from-router-v1.mdx

Build ID: c58fa85d7d545d7bab96c675

URL: https://www.apollographql.com/docs/deploy-preview/c58fa85d7d545d7bab96c675

router-perf · 2024-12-18T17:15:03Z

That test is just old on dev/next, so fixing it makes sense Also: add a note to coprocessor test changes to remind me that I need to understand what is happening there before this branch can merge.

re-order use statements to keep cargo fmt happy

To see how far off passing we are in CI

IMPORTANT: This change modifies the supergraph method invocation test to be a router_service service invocation test. Amongst other important details (such as we are now really testing the service through the full pipeline), it's important to note that we can't use `oneshot()` and just re-create the service every time we want to call it. If we do, then the rate limiting details are lost. So, we must re-use our service to make sure that state isn't lost.

Also: - update snapshot so it knows about concurrency - replace a bad supergraph test with a broken router test - re-order traffic shaping so that timeouts > concurrency > rate-limit

The code added yesterday was creating errors as data and using numeric codes (rather than the magic strings in 1.x). Re-instated the 1.x behaviour for reporting errors and also fixed the new timeout test.

Since we now fail traffic shaping tests at the router_service, they are not counted as graphql_errors (which are only processed, as they should be at the supergraph_service). IMO, these should never have been counted as graphql errors anyway, since they clearly aren't graphql errors but are traffic shaping (rate limit, timeout, etc...) errors. We'll still report them to the user as a 504 or a 503 or whatever, but they won't count towards the graphql_error metric. I've also updated a snapshot to reflect the error message we now provide.

Not required for GA. Implement later as a separate project.

apollo-router/src/services/router/service.rs

This breaks all the http_post_mutation tests because of changes in expectation.

The AsyncCheckpoint layer was using `oneshot` and wasn't calling the prepared service. I've fixed that. This affects some of the tests, so I've fixed them as well.

Make it pass until subgraph rate limiting is changed. We'll need to update the test agains at that point.

The various tests in the limits layer should be readying their service before they call it. The deduplication function (dedup) accepts a readied service and then used ready_oneshot(). Since it can only accept a readied service, modify this to be a simple call() invocation.

A bit fiddly to find and fix, so I only just spotted it.

I've left the `fake` "builders" taking an Option and using String::default() if None is supplied. This seems like a nice compromise.

.changesets/feat_garypen_next_backpressure.md

apollo-router/src/axum_factory/tests.rs

apollo-router/src/layers/async_checkpoint.rs

apollo-router/src/plugins/authorization/mod.rs

apollo-router/src/plugins/limits/layer.rs

apollo-router/src/plugins/traffic_shaping/deduplication.rs

apollo-router/src/plugins/traffic_shaping/mod.rs

apollo-router/src/query_planner/tests.rs

apollo-router/src/services/hickory_dns_connector.rs

apollo-router/src/services/http/service.rs

apollo-router/src/services/layers/allow_only_http_post_mutations.rs

apollo-router/src/services/layers/apq.rs

apollo-router/src/services/layers/content_negotiation.rs

apollo-router/src/services/router/service.rs

apollo-router/tests/integration/traffic_shaping.rs

You can't Arc Batching in RouterService. It contains request specific state about the progress of a Batch and is broken if shared across multiple services. I didn't correctly poll the entity cache inner service. This is now fixed.

It's simpler to understand and fits in well with the Tower method. It may improve performance as well. also: Add a big comment to DEFAULT_BUFFER_SIZE to try and explain what it represents.

Put some basic details into the migration guide to be enhanced as required before release. also: Update the traffic shaping docs for concurrency.

This now just uses http extensions to store the control. Test added for service reuse, and limits dynamic update fixed.

Make sure the migration docs references the backpressure PR.

…vice_closed` test This appears to test an intended-enough behaviour. Originally introduced: #918 Comment outdated as of: #1440 This is a behaviour change compared to #1440, but the intended behaviour from #918. What's a bit worrying is that this depends on the internal composition of our pipeline, but is externally observable?

Move setting of if there was a graphql error into the router and supergraph builders. This will enable users to use on_graphql_error for all cases where we return an error.

apollo-router/src/plugins/telemetry/config_new/spans.rs The deleted line at line:443 caused the following test to fail. plugins::telemetry::config_new::spans::test::test_router_request_custom_attribute_on_graphql_error I've restored that line and the test is now passing.

garypen self-assigned this Dec 18, 2024

Obey the linter...

6d80932

Gary Pennington and others added 17 commits December 18, 2024 17:15

Merge branch 'next' into garypen/next-backpressure

7491e1e

Merge branch 'next' into garypen/next-backpressure

718a870

Merge branch 'next' into garypen/next-backpressure

4c93a96

Fix the rhai integration test

716f7e5

That test is just old on dev/next, so fixing it makes sense Also: add a note to coprocessor test changes to remind me that I need to understand what is happening there before this branch can merge.

fix lint complaint

8b7653b

re-order use statements to keep cargo fmt happy

Add the new rhai testng config file

746d2c0

temporarily comment out one test

330f969

To see how far off passing we are in CI

still experimenting to see how far away this approach is

1b724d5

Move limits to traffic shaping

f932806

Rename http_server to router

3b8ed61

Fix formatting errors reported by lint

81faed0

Rename some stuff to minimise change from 1.x

3155440

Also: - update snapshot so it knows about concurrency - replace a bad supergraph test with a broken router test - re-order traffic shaping so that timeouts > concurrency > rate-limit

Try to restore the existing behaviour for reporting errors

9bd0386

The code added yesterday was creating errors as data and using numeric codes (rather than the magic strings in 1.x). Re-instated the 1.x behaviour for reporting errors and also fixed the new timeout test.

Fix lint complaints

7432306

Remove 1/2 implemented little loadshedder

d05619f

Not required for GA. Implement later as a separate project.

bnjjj reviewed Jan 13, 2025

View reviewed changes

apollo-router/src/services/router/service.rs Outdated Show resolved Hide resolved

apollo-router/src/services/router/service.rs Outdated Show resolved Hide resolved

Gary Pennington added 7 commits January 13, 2025 17:59

Merge branch 'next' into garypen/next-backpressure

69e36f1

POC: Make supergraph creator clone a BoxCloneService

b85ca4f

This breaks all the http_post_mutation tests because of changes in expectation.

Fix AsyncCheckpoint and update tests for correct behaviour

d5718f2

The AsyncCheckpoint layer was using `oneshot` and wasn't calling the prepared service. I've fixed that. This affects some of the tests, so I've fixed them as well.

Fix the xtask lint complaints

5c72744

POC: Make supergraph creator clone a BoxCloneService (#6540)

a8a8950

Modify subgraph rate-limiting test to pass for now

21dee20

Make it pass until subgraph rate limiting is changed. We'll need to update the test agains at that point.

Merge branch 'next' into garypen/next-backpressure

00c1689

garypen requested a review from a team as a code owner January 21, 2025 12:55

garypen changed the title ~~First draft of a backpressure enabled pipeline~~ Provide a backpressure enabled pipeline Jan 22, 2025

Gary Pennington added 5 commits January 23, 2025 08:47

Found another inner service not following tower advice

0bddcde

A bit fiddly to find and fix, so I only just spotted it.

Make subgraph_name mandatory on Request and Response

2b3bed2

I've left the `fake` "builders" taking an Option and using String::default() if None is supplied. This seems like a nice compromise.

Merge branch 'dev' into garypen/next-backpressure

f6525a8

xtask lint

f08f21d

Remove the comment because the name is no longer Option

339c502

BrynCooke reviewed Jan 24, 2025

View reviewed changes

BrynCooke requested changes Jan 24, 2025

View reviewed changes

Gary Pennington and others added 12 commits January 24, 2025 17:20

Code review comments.

f3944d1

Spotted this dbg! in code review and should remove it

3b88d45

Fix mistakes made during code review changes.

017b0f1

You can't Arc Batching in RouterService. It contains request specific state about the progress of a Batch and is broken if shared across multiple services. I didn't correctly poll the entity cache inner service. This is now fixed.

Replace our use of Mutex with a Buffer

9b75423

It's simpler to understand and fits in well with the Tower method. It may improve performance as well. also: Add a big comment to DEFAULT_BUFFER_SIZE to try and explain what it represents.

Add to the migration guide and the router documentation.

ddad68d

Put some basic details into the migration guide to be enhanced as required before release. also: Update the traffic shaping docs for concurrency.

Merge branch 'dev' into garypen/next-backpressure

c629fca

Remember the name of the concurrency limit

b2c236f

Fix the body limit layer.

f5d3bdb

This now just uses http extensions to store the control. Test added for service reuse, and limits dynamic update fixed.

Add a link to the appropriate PR for the migration changes

95e22d8

Make sure the migration docs references the backpressure PR.

Fixup limits plugin tests

49fd09b

Fixup telemetry tests

8ea3123

Move setting of if there was a graphql error into the router and supergraph builders. This will enable users to use on_graphql_error for all cases where we return an error.

BrynCooke approved these changes Jan 27, 2025

View reviewed changes

Gary Pennington added 2 commits January 27, 2025 15:08

Merge branch 'dev' into garypen/next-backpressure

bde4fb4

garypen merged commit 8f1459b into dev Jan 27, 2025
15 checks passed

garypen deleted the garypen/next-backpressure branch January 27, 2025 15:45

garypen mentioned this pull request Feb 11, 2025

Enough functionality to implement adaptive load shedding #6148

Closed

6 tasks

carodewig mentioned this pull request Dec 16, 2025

fix: raise 429 rather than 503 when enforcing rate-limit #8765

Merged

10 tasks

theJC mentioned this pull request Feb 23, 2026

feat(traffic_shaping): add diagnostic counters for subgraph timeout and load shed errors #8905

Open

10 tasks

Conversation

garypen commented Dec 18, 2024 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

svc-apollo-docs commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Docs preview ready

Uh oh!

router-perf bot commented Dec 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

garypen commented Dec 18, 2024 •

edited by atlassian bot

Loading

svc-apollo-docs commented Dec 18, 2024 •

edited

Loading