Skip to content

fix(router, only 1.x): fix the apollo_router_session_count_total metric#6539

Merged
goto-bus-stop merged 2 commits intodevfrom
renee/fix-6495
Jan 16, 2025
Merged

fix(router, only 1.x): fix the apollo_router_session_count_total metric#6539
goto-bus-stop merged 2 commits intodevfrom
renee/fix-6495

Conversation

@goto-bus-stop
Copy link
Member

@goto-bus-stop goto-bus-stop commented Jan 13, 2025

In #6495 I migrated this metric to an otel ObservableGauge. However the lifetime of the count guard was too short, so the value of the metric would almost always just be 0; and the lifetime of the instrument was too short, so even the wrong value of 0 would not actually be reported. This change was not part of a release so there is no user impact as long as we land this :)

Now the lifetime of the instrument is captured by all requests, so the instrument is kept even post-reload if a request is still being completed from the old schema.

Also, a test verifies that the count increments and decrements as expected.

This does not need to be ported to Router 2.0.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

In #6495 I migrated this metric to an otel ObservableGauge. However the
lifetime of the count guard was too short, so the value of the metric
would almost always just be 0; and the lifetime of the instrument was
too short, so even the wrong value of 0 would not actually be reported.

Now the lifetime of the instrument is captured by all requests, so the
instrument is kept even post-reload if a request is still being
completed from the old schema.

Also, a test verifies that the count increments and decrements as
expected.
@goto-bus-stop goto-bus-stop requested a review from a team January 13, 2025 15:19
@goto-bus-stop goto-bus-stop requested a review from a team as a code owner January 13, 2025 15:19
@svc-apollo-docs
Copy link
Collaborator

svc-apollo-docs commented Jan 13, 2025

⚠️ Docs preview not attached to branch

The preview was not built because the PR's base branch dev is not in the list of sources.

An Apollo team member can comment one of the following commands to dictate which branch to attach the preview to:

  • !docs set-base-branch next
  • !docs set-base-branch main

Build ID: 9e0cc6433f744dc0a5139477

@github-actions

This comment has been minimized.

@goto-bus-stop goto-bus-stop changed the title Fix the apollo_router_session_count_total metric fix(router): fix the apollo_router_session_count_total metric Jan 13, 2025
Copy link
Contributor

@tninesling tninesling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Does this need a changeset, or are you reusing one that's already on dev?

@goto-bus-stop
Copy link
Member Author

goto-bus-stop commented Jan 13, 2025

Since it's just a fix to a bug that has never reached a release, I think it doesn't justify a changeset!

I might add the PR number to an existing changeset.

@goto-bus-stop goto-bus-stop requested a review from a team as a code owner January 16, 2025 08:40
@goto-bus-stop goto-bus-stop enabled auto-merge (squash) January 16, 2025 08:40
@goto-bus-stop goto-bus-stop enabled auto-merge (squash) January 16, 2025 08:40
@goto-bus-stop goto-bus-stop changed the title fix(router): fix the apollo_router_session_count_total metric fix(router, only 1.x): fix the apollo_router_session_count_total metric Jan 16, 2025
@goto-bus-stop goto-bus-stop merged commit f52bab8 into dev Jan 16, 2025
@goto-bus-stop goto-bus-stop deleted the renee/fix-6495 branch January 16, 2025 08:58
SimonSapin added a commit that referenced this pull request Jan 17, 2025
…otal` metric (#6539)"

This reverts commit f52bab8
for the purpose of merging into Router 2.0, as it is only intended for 1.x.
@router-perf
Copy link

router-perf bot commented Jan 17, 2025

CI performance tests

  • connectors-const - Connectors stress test that runs with a constant number of users
  • const - Basic stress test that runs with a constant number of users
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • enhanced-signature - Enhanced signature enabled
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • extended-reference-mode - Extended reference mode enabled
  • large-request - Stress test with a 1 MB request payload
  • no-tracing - Basic stress test, no tracing
  • reload - Reload test over a long period of time at a constant rate of users
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-local-metrics - Field stats that are generated from the router rather than FTV1
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload

@abernix abernix mentioned this pull request Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants