Skip to content

(refactor) UpDownCounter RAII guards#8379

Merged
BrynCooke merged 20 commits intodevfrom
bryn/up-down-counters
Oct 14, 2025
Merged

(refactor) UpDownCounter RAII guards#8379
BrynCooke merged 20 commits intodevfrom
bryn/up-down-counters

Conversation

@BrynCooke
Copy link
Contributor

@BrynCooke BrynCooke commented Oct 6, 2025

This PR changes the nature of UpDownCounters within the router and update a few metrics.

UpDownCounters are mostly good for tracking resources. They transmit an aggregate value rather than a delta, so each updown instrument sends the latest value.

Previously we were manually incrementing and decrementing updown counters, however there is nuance to this:

  • An increment that is paired with a decrement must be done via RAII guard as there is no guarantee that a code path will be called when using rust async.
  • A paired decrement must be applied to the SAME instrument that the increment was called on. The reason for this is that if a hot reload happens a new instrument may be created so if the decrement occurs on the new instrument it will be an immediate negative value. Note that instruments are normally contained within an Arc to allow callsite invalidation, but from within a guard the inner instrument must be cloned.

As we don't actually use UpDown counters for anything else (and there seems to be little reason to) the macro has been modified to return a RAII guard that will automatically decrement by the value that was incremented when it is dropped. It also stores the original instrument that was used to increment so that it is guaranteed to decrement on the same instrument that it incremented on.

Notes

  • Each commit can be reviewed in isolation.
  • There are various other gauges that can be removed and converted to UpDownCounters to simplify the codebase but these can be done separately.
  • I've had to make the lifecycle tests more lenient as series that are now old prom series that are zeroed out present in the results. See below for reasoning.

Issues

It has been noticed that the Promethues registry will NEVER drop a series. This means that over time cardinality will build up and consume memory. I've put a hacky workaround e1430e1. This will cause prom to be reinitialized 1 in 10 reloads. This may mean a minor number of metrics lost, but it's better than just accumulating forever.

Promethues is deprecated in the latest version of otel rust, and this isn't a problem for otlp. In the next major version of the router we should consider removing Prometheus support.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • PR description explains the motivation for the change and relevant context for reviewing
  • PR description links appropriate GitHub/Jira tickets (creating when necessary)
  • Changeset is included for user-facing changes
  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Metrics and logs are added3 and documented
  • Tests added and passing4
    • Unit tests
    • Integration tests
    • Manual tests, as necessary

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices.

  4. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@apollo-librarian
Copy link

apollo-librarian bot commented Oct 6, 2025

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

2 new, 4 changed, 0 removed
+ graphos/routing/(latest)/self-hosted/managed-hosting/railway.mdx
+ graphos/routing/(latest)/self-hosted/managed-hosting/render.mdx
* graphos/routing/(latest)/customization/rhai/index.mdx
* graphos/routing/(latest)/observability/router-telemetry-otel/enabling-telemetry/selectors.mdx
* graphos/routing/(latest)/observability/router-telemetry-otel/enabling-telemetry/spans.mdx
* graphos/routing/(latest)/_sidebar.yaml

Build ID: b00c2b4a87dac6ca3f7372a3
Build Logs: View logs

URL: https://www.apollographql.com/docs/deploy-preview/b00c2b4a87dac6ca3f7372a3

@github-actions

This comment has been minimized.

@BrynCooke BrynCooke changed the title Bryn/up down counters (refactor) UpDownCounter RAII guards Oct 6, 2025
@BrynCooke BrynCooke force-pushed the bryn/up-down-counters branch from e2d282d to 606b2e2 Compare October 7, 2025 09:36
@BrynCooke BrynCooke marked this pull request as ready for review October 7, 2025 09:36
@BrynCooke BrynCooke requested a review from a team October 7, 2025 09:36
@BrynCooke BrynCooke requested a review from a team as a code owner October 7, 2025 09:36
@BrynCooke BrynCooke force-pushed the bryn/up-down-counters branch from 606b2e2 to 0f3090f Compare October 7, 2025 09:48
@@ -832,26 +840,15 @@ where
operation_name: Option<String>,
) -> broadcast::Receiver<()> {
let (closing_signal_tx, closing_signal_rx) = broadcast::channel(1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goto-bus-stop I've not changed the logic here but I am uncertain as to why creating a topic should overwrite the subscription. Should it not keep the subscription if it already exists?

@BrynCooke BrynCooke marked this pull request as draft October 7, 2025 11:34
@BrynCooke BrynCooke removed the request for review from goto-bus-stop October 7, 2025 11:34
@BrynCooke
Copy link
Contributor Author

Moving back to draft as there is some sort of issue with the pipelines metric.

@BrynCooke BrynCooke force-pushed the bryn/up-down-counters branch from e9fd533 to c528de6 Compare October 7, 2025 12:40
bryn added 4 commits October 7, 2025 14:12
… original instrument

This refactor makes updown counters finally useful as in combination with the temporality fix to make them always aggregate and the fix to only reload metrics if needed they will not accurately increment and decrement reliably.
Moved tests for otlp into directory
@BrynCooke BrynCooke force-pushed the bryn/up-down-counters branch 2 times, most recently from f1ddee4 to 168c1c7 Compare October 7, 2025 15:48
@BrynCooke BrynCooke force-pushed the bryn/up-down-counters branch from 168c1c7 to 11a3cd9 Compare October 7, 2025 15:50
@BrynCooke BrynCooke force-pushed the bryn/up-down-counters branch from 11a3cd9 to d61025d Compare October 7, 2025 16:32
@BrynCooke BrynCooke marked this pull request as ready for review October 7, 2025 16:52
@BrynCooke
Copy link
Contributor Author

OK, tests passing, I think we're good to review.

@BrynCooke BrynCooke force-pushed the bryn/up-down-counters branch from 3132d40 to e1430e1 Compare October 8, 2025 08:53
@BrynCooke BrynCooke force-pushed the bryn/up-down-counters branch from 913afed to e1430e1 Compare October 8, 2025 13:49
Copy link
Member

@goto-bus-stop goto-bus-stop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall, some questions/nits

let instrument = create_instrument_fn(meter);
let attrs : &[opentelemetry::KeyValue] = &$attrs;
instrument.$mutation($value, attrs);
$guard::new(std::sync::Arc::new(instrument.clone()), $value, attrs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this Arc::new() ever happen with the NoopGuard? That would suck a bit (it wouldnt make or break performance I imagine but still an unnecessary heap allocation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c1b06cb I've added an explanation for this. The code is actually not generally run for #cfg[not(test)], it's there to prevent our unit tests from interfering with each other. We do need a test for callsite caching though, so the other arm is run for a specific test hence why the entire block is not under cfg test.

/// with labels that are unique (like launch_id) can accumulate as zeroed
/// out series that will never be incremented again.
///
/// The true solution for this is to drop Prometheus support as this has been dropped in upstream OTEL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users would probably not be happy about that, maybe we could use a third party prometheus exporter (eg https://lib.rs/crates/opentelemetry-prometheus-text-exporter) when the time comes, but 🤷🏻‍♀️ just because it isn't a first party piece of the SDK doesn't mean it can't be done ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Life finds a way....

fn setup_public_metrics(&mut self) -> Result<(), BoxError> {
if self.is_metrics_config_changed::<metrics::prometheus::Config>()
|| self.is_metrics_config_changed::<otlp::Config>()
|| self.prometheus_random_change()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'd like it on the record that I don't love this solution but I don't have a better idea, so let's just do ittt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also do not like this. Theres no other way though.....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to at least be deterministic. It reloads every 10 times rather than randomly.

@BrynCooke BrynCooke enabled auto-merge (squash) October 13, 2025 12:43
@BrynCooke BrynCooke disabled auto-merge October 13, 2025 13:25
@BrynCooke BrynCooke merged commit 4c04c55 into dev Oct 14, 2025
15 checks passed
@BrynCooke BrynCooke deleted the bryn/up-down-counters branch October 14, 2025 06:43
@abernix abernix mentioned this pull request Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants