Skip to content

Provide a backpressure enabled pipeline#6486

Merged
garypen merged 69 commits intodevfrom
garypen/next-backpressure
Jan 27, 2025
Merged

Provide a backpressure enabled pipeline#6486
garypen merged 69 commits intodevfrom
garypen/next-backpressure

Conversation

@garypen
Copy link
Contributor

@garypen garypen commented Dec 18, 2024

Most of the functionality is now in place for effective traffic-shaping at the router and subgraph levels.

Please read the review guidance before starting your review.

dev-docs/BACKPRESSURE_REVIEW_NOTES.md

There's a long way to go, but some interesting aspects of the required
changes are implemented here.

I noted that a significant number of errors and glitches and tests are
resolved by making these changes. Mainly places where we were calling a
service without readying it first or places where we were cloning an
inner service without doing the memory "replace" dance.

Because we are using a "trick" to make the router service cloneable
right now, there are a few tests which don't work properly. I think for
the "full" work, we'll need to make the router service properly
cloneable (without requiring a mutex).

This will require some fairly substantial re-working of a wide variety
of services and layers. On the plus side, once we've done that work
we'll be able to retire a bunch of code that we've written that we will
no longer require.

I'll pick this up in the New Year...
@garypen garypen self-assigned this Dec 18, 2024
@svc-apollo-docs
Copy link
Collaborator

svc-apollo-docs commented Dec 18, 2024

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 1 changed, 0 removed
* graphos/reference/migration/from-router-v1.mdx

Build ID: c58fa85d7d545d7bab96c675

URL: https://www.apollographql.com/docs/deploy-preview/c58fa85d7d545d7bab96c675

@router-perf
Copy link

router-perf bot commented Dec 18, 2024

CI performance tests

  • connectors-const - Connectors stress test that runs with a constant number of users
  • const - Basic stress test that runs with a constant number of users
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • enhanced-signature - Enhanced signature enabled
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • extended-reference-mode - Extended reference mode enabled
  • large-request - Stress test with a 1 MB request payload
  • no-tracing - Basic stress test, no tracing
  • reload - Reload test over a long period of time at a constant rate of users
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-local-metrics - Field stats that are generated from the router rather than FTV1
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload

Gary Pennington and others added 17 commits December 18, 2024 17:15
That test is just old on dev/next, so fixing it makes sense

Also: add a note to coprocessor test changes to remind me that I need to
understand what is happening there before this branch can merge.
re-order use statements to keep cargo fmt happy
To see how far off passing we are in CI
IMPORTANT:

This change modifies the supergraph method invocation test to be a
router_service service invocation test.

Amongst other important details (such as we are now really testing the
service through the full pipeline), it's important to note that we can't
use `oneshot()` and just re-create the service every time we want to
call it. If we do, then the rate limiting details are lost. So, we must
re-use our service to make sure that state isn't lost.
Also:
 - update snapshot so it knows about concurrency
 - replace a bad supergraph test with a broken router test
 - re-order traffic shaping so that timeouts > concurrency > rate-limit
The code added yesterday was creating errors as data and using numeric
codes (rather than the magic strings in 1.x). Re-instated the 1.x
behaviour for reporting errors and also fixed the new timeout test.
Since we now fail traffic shaping tests at the router_service, they
are not counted as graphql_errors (which are only processed, as they
should be at the supergraph_service).

IMO, these should never have been counted as graphql errors anyway,
since they clearly aren't graphql errors but are traffic shaping (rate
limit, timeout, etc...) errors. We'll still report them to the user as a
504 or a 503 or whatever, but they won't count towards the graphql_error
metric.

I've also updated a snapshot to reflect the error message we now
provide.
Not required for GA. Implement later as a separate project.
Gary Pennington added 7 commits January 13, 2025 17:59
This breaks all the http_post_mutation tests because of changes in
expectation.
The AsyncCheckpoint layer was using `oneshot` and wasn't calling the
prepared service. I've fixed that.

This affects some of the tests, so I've fixed them as well.
Make it pass until subgraph rate limiting is changed. We'll need to
update the test agains at that point.
@garypen garypen requested a review from a team as a code owner January 21, 2025 12:55
The various tests in the limits layer should be readying their service
before they call it.

The deduplication function (dedup) accepts a readied service and then
used ready_oneshot(). Since it can only accept a readied service, modify
this to be a simple call() invocation.
@garypen garypen changed the title First draft of a backpressure enabled pipeline Provide a backpressure enabled pipeline Jan 22, 2025
Gary Pennington added 5 commits January 23, 2025 08:47
A bit fiddly to find and fix, so I only just spotted it.
I've left the `fake` "builders" taking an Option and using
String::default() if None is supplied. This seems like a nice
compromise.
Gary Pennington and others added 12 commits January 24, 2025 17:20
You can't Arc Batching in RouterService. It contains request specific
state about the progress of a Batch and is broken if shared across
multiple services.

I didn't correctly poll the entity cache inner service. This is now
fixed.
It's simpler to understand and fits in well with the Tower method.

It may improve performance as well.

also: Add a big comment to DEFAULT_BUFFER_SIZE to try and explain what
it represents.
Put some basic details into the migration guide to be enhanced as
required before release.

also: Update the traffic shaping docs for concurrency.
This now just uses http extensions to store the control.
Test added for service reuse, and limits dynamic update fixed.
Make sure the migration docs references the backpressure PR.
…vice_closed` test

This appears to test an intended-enough behaviour.

Originally introduced: #918
Comment outdated as of: #1440

This is a behaviour change compared to #1440, but the intended behaviour
from #918.

What's a bit worrying is that this depends on the internal composition of
our pipeline, but is externally observable?
Move setting of if there was a graphql error into the router and supergraph builders.
This will enable users to use on_graphql_error for all cases where we return an error.
Gary Pennington added 2 commits January 27, 2025 15:08
apollo-router/src/plugins/telemetry/config_new/spans.rs

The deleted line at line:443 caused the following test to fail.

plugins::telemetry::config_new::spans::test::test_router_request_custom_attribute_on_graphql_error

I've restored that line and the test is now passing.
@garypen garypen merged commit 8f1459b into dev Jan 27, 2025
15 checks passed
@garypen garypen deleted the garypen/next-backpressure branch January 27, 2025 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants