Skip to content

Migrate apollo_router_session_count_total metric to an OTel gauge#6495

Merged
goto-bus-stop merged 1 commit intodevfrom
renee/ROUTER-911-session-count-total
Jan 7, 2025
Merged

Migrate apollo_router_session_count_total metric to an OTel gauge#6495
goto-bus-stop merged 1 commit intodevfrom
renee/ROUTER-911-session-count-total

Conversation

@goto-bus-stop
Copy link
Member

@goto-bus-stop goto-bus-stop commented Dec 20, 2024

Somehow missed this in #6476

No changeset because it's in #6476 : )

Somehow missed this in #6476

<!-- [ROUTER-911] -->
@goto-bus-stop goto-bus-stop requested a review from a team December 20, 2024 14:54
@goto-bus-stop goto-bus-stop requested a review from a team as a code owner December 20, 2024 14:54
@svc-apollo-docs
Copy link
Collaborator

svc-apollo-docs commented Dec 20, 2024

✅ Docs Preview Ready

No new or changed pages found.

@github-actions
Copy link
Contributor

@goto-bus-stop, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

@router-perf
Copy link

router-perf bot commented Dec 20, 2024

CI performance tests

  • connectors-const - Connectors stress test that runs with a constant number of users
  • const - Basic stress test that runs with a constant number of users
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • enhanced-signature - Enhanced signature enabled
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • extended-reference-mode - Extended reference mode enabled
  • large-request - Stress test with a 1 MB request payload
  • no-tracing - Basic stress test, no tracing
  • reload - Reload test over a long period of time at a constant rate of users
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-local-metrics - Field stats that are generated from the router rather than FTV1
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload

@goto-bus-stop goto-bus-stop merged commit 797a3be into dev Jan 7, 2025
@goto-bus-stop goto-bus-stop deleted the renee/ROUTER-911-session-count-total branch January 7, 2025 14:00
@goto-bus-stop
Copy link
Member Author

Oh, there's a mistake in this PR. The guard for the total session count doesn't live long enough, so the value will be decremented too early. It's being fixed in #6527

listener = &address
);
}
let _guard = main_graphql_port.then(TotalSessionCountGuard::start);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this contains a mistake. The guard must be moved into the task::spawn call below, but it isn't, so the total session count metric is broken with this PR (it is immediately decremented before the next incoming request is handled.)

#6527 will fix this problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That PR will be redone in a different way. #6539 fixes just this problem.

goto-bus-stop added a commit that referenced this pull request Jan 13, 2025
In #6495 I migrated this metric to an otel ObservableGauge. However the
lifetime of the count guard was too short, so the value of the metric
would almost always just be 0; and the lifetime of the instrument was
too short, so even the wrong value of 0 would not actually be reported.

Now the lifetime of the instrument is captured by all requests, so the
instrument is kept even post-reload if a request is still being
completed from the old schema.

Also, a test verifies that the count increments and decrements as
expected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants