Skip to content

Migrate value.* metrics to gauges#6476

Merged
goto-bus-stop merged 4 commits intodevfrom
renee/ROUTER-297-values
Dec 20, 2024
Merged

Migrate value.* metrics to gauges#6476
goto-bus-stop merged 4 commits intodevfrom
renee/ROUTER-297-values

Conversation

@goto-bus-stop
Copy link
Member

@goto-bus-stop goto-bus-stop commented Dec 17, 2024

Migrates tracing::info!(value.) metrics to otel instruments.

Notes in comments

@goto-bus-stop goto-bus-stop requested a review from a team December 17, 2024 14:08
@goto-bus-stop goto-bus-stop requested a review from a team as a code owner December 17, 2024 14:08
@svc-apollo-docs

This comment was marked as off-topic.

Comment on lines +570 to +647
// Tie the lifetime of the session count instrument to the lifetime of the router
// by moving it into a no-op layer.
let instrument = session_count_instrument();
router = router.layer(layer_fn(move |service| {
let _ = &instrument;
service
}));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The session_count_active instrument is "stored" in an axum::Router instance by moving it into a no-op layer. Another option would be to just have one global instrument for the entire lifetime of the program, but I'm not sure if that would work everywhere we need it to (eg. does it do the right thing across reloads, tests, etc?)

@router-perf
Copy link

router-perf bot commented Dec 17, 2024

CI performance tests

  • connectors-const - Connectors stress test that runs with a constant number of users
  • const - Basic stress test that runs with a constant number of users
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • enhanced-signature - Enhanced signature enabled
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • extended-reference-mode - Extended reference mode enabled
  • large-request - Stress test with a 1 MB request payload
  • no-tracing - Basic stress test, no tracing
  • reload - Reload test over a long period of time at a constant rate of users
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-local-metrics - Field stats that are generated from the router rather than FTV1
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload

Comment on lines +276 to +264
#[derive(Debug)]
struct SpanLruSizeInstrument {
value: Arc<AtomicU64>,
_gauge: ObservableGauge<u64>,
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the doc comment, this is a hack that kind of mimicks what tracing::info!(value.) was doing (storing the value whenever it is updated, and reading the stored value when requested).

I'm not sure about this metric, should we keep it at all in 2.0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep it. It's really helpful to debug and could potentially find the root cause of a leak or so.

@goto-bus-stop goto-bus-stop changed the title Port value.* metrics to gauges Migrate value.* metrics to gauges Dec 19, 2024
@goto-bus-stop goto-bus-stop force-pushed the renee/ROUTER-297-values branch from 27a4a08 to 1b4616b Compare December 19, 2024 13:58
@goto-bus-stop goto-bus-stop requested a review from a team as a code owner December 19, 2024 13:58
@goto-bus-stop goto-bus-stop changed the base branch from next to dev December 19, 2024 13:58
Comment on lines +276 to +264
#[derive(Debug)]
struct SpanLruSizeInstrument {
value: Arc<AtomicU64>,
_gauge: ObservableGauge<u64>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep it. It's really helpful to debug and could potentially find the root cause of a leak or so.

@github-actions

This comment has been minimized.

@goto-bus-stop goto-bus-stop enabled auto-merge (squash) December 20, 2024 11:16
@goto-bus-stop goto-bus-stop merged commit b362206 into dev Dec 20, 2024
@goto-bus-stop goto-bus-stop deleted the renee/ROUTER-297-values branch December 20, 2024 11:31
goto-bus-stop added a commit that referenced this pull request Dec 20, 2024
Somehow missed this in #6476

<!-- [ROUTER-911] -->
@abernix abernix mentioned this pull request Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants