Skip to content

feat: memory tracking metrics#8717

Merged
rohan-b99 merged 25 commits intodevfrom
memory-tracking-metrics
Jan 14, 2026
Merged

feat: memory tracking metrics#8717
rohan-b99 merged 25 commits intodevfrom
memory-tracking-metrics

Conversation

@rohan-b99
Copy link
Contributor

@rohan-b99 rohan-b99 commented Dec 4, 2025


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • PR description explains the motivation for the change and relevant context for reviewing
  • PR description links appropriate GitHub/Jira tickets (creating when necessary)
  • Changeset is included for user-facing changes
  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Metrics and logs are added3 and documented
  • Tests added and passing4
    • Unit tests
    • Integration tests
    • Manual tests, as necessary

The majority of this work is from #8525, this PR includes some extra tests and places where memory tracking has been added.

Performance testing with vegeta shows the changes have negligible impact:

% cat dev/perf.59242.vegeta | vegeta report                                  
Requests      [total, rate, throughput]         2500, 500.20, 496.03
Duration      [total, attack, wait]             5.04s, 4.998s, 41.965ms
Latencies     [min, mean, 50, 90, 95, 99, max]  3.481ms, 12.243ms, 6.396ms, 19.94ms, 29.15ms, 135.015ms, 311.06ms
Bytes In      [total, mean]                     6248410, 2499.36
Bytes Out     [total, mean]                     14312750, 5725.10
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:2500

% cat memory-tracking-metrics/perf.64046.vegeta | vegeta report 
Requests      [total, rate, throughput]         2500, 500.14, 497.60
Duration      [total, attack, wait]             5.024s, 4.999s, 25.518ms
Latencies     [min, mean, 50, 90, 95, 99, max]  3.368ms, 11.976ms, 5.898ms, 20.586ms, 29.021ms, 122.014ms, 314.001ms
Bytes In      [total, mean]                     10612414, 4244.97
Bytes Out     [total, mean]                     14312750, 5725.10
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:2500  

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices.

  4. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

bryn and others added 9 commits December 4, 2025 11:16
This adds allocation tracking utilities so that we can get a handle on how much a single request is allocating.
This adds allocation tracking utilities so that we can get a handle on how much a single request is allocating.
Add allocation metrics for the query planner and requests
@rohan-b99 rohan-b99 requested a review from a team December 4, 2025 17:01
@apollo-librarian
Copy link

apollo-librarian bot commented Dec 4, 2025

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 1 changed, 0 removed
* graphos/routing/(latest)/observability/router-telemetry-otel/enabling-telemetry/standard-instruments.mdx

Build ID: eeafb8119b1eea6f90cf00bd
Build Logs: View logs

URL: https://www.apollographql.com/docs/deploy-preview/eeafb8119b1eea6f90cf00bd

@github-actions

This comment has been minimized.

@rohan-b99 rohan-b99 requested a review from a team as a code owner December 4, 2025 17:36
@rohan-b99 rohan-b99 changed the title Memory tracking metrics feat: memory tracking metrics Dec 5, 2025
Copy link
Contributor

@aaronArinder aaronArinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super excited for this

Comment on lines +188 to +190
// Verify metrics were recorded
// Note: We can't easily assert on histogram values, but the test verifies
// the layer compiles and runs without errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it hard to assert because the values might be wildly different or for some other reason? it'd be great to have proof that the metrics emitted are what folks think they'll be (ie, scoped to the request, to query planning, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this - I've added some assert_histogram_sum! calls here instead

Comment on lines +173 to +196
// Thread-local to track the current task's allocation stats.
//
// ## Why Cell<Option<NonNull<T>>> instead of Cell<Option<Arc<T>>> or Mutex<Option<Arc<T>>>?
//
// We use a NonNull pointer instead of Arc because:
//
// 1. **Cell requires Copy**: Cell::get() requires T: Copy, but Arc<T> is not Copy
// because it has a Drop implementation for reference counting.
//
// 2. **TLS destructors conflict with global allocators**: If we stored Option<Arc<T>>
// in the thread-local, its Drop implementation would run when the thread exits.
// This Drop could call the allocator (to deallocate the Arc), causing a fatal
// reentrancy error: "the global allocator may not use TLS with destructors".
//
// 3. **Cell is faster than Mutex**: Cell has zero overhead (just a memory read/write),
// while Mutex requires atomic operations and potential thread parking. Since we
// access this on every allocation, performance is critical.
//
// ## Safety invariants:
//
// - The NonNull pointer is only valid while a MemoryTrackedFuture holding the corresponding
// Arc is on the call stack (either in poll() or with_memory_tracking()).
// - We manually manage Arc reference counts when propagating across tasks.
// - The pointer always points to valid AllocationStats when Some.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏 so, so nice

/// If a parent context exists, creates a child context that tracks to the parent.
/// If no parent exists, creates a new root context with the given name.
/// This is useful for tracking allocations in synchronous code or threads.
#[allow(dead_code)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still dead? assuming so; also, same question but for the other allow(dead_code)s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed a couple of unused functions but had to had some more feature gates so cargo xtask lint would pass

// on top. The tracking uses thread-locals with raw pointers to avoid TLS destructor
// issues (see CURRENT_TASK_STATS documentation above).
#[cfg(all(feature = "global-allocator", not(feature = "dhat-heap"), unix))]
unsafe impl GlobalAlloc for CustomAllocator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never a scarier line existed

@@ -0,0 +1,5 @@
### Implement memory tracking metrics for requests ([PR #8717](https://github.com/apollographql/router/pull/8717))

Adds the `apollo.router.request.memory` and `apollo.router.query_planner.memory` metrics which track allocations/deallocations during the request lifecycle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this include some more context about how this is done? If I were reading this without context, I wouldn't be aware of the fact that this required a custom allocator.

@rohan-b99 rohan-b99 merged commit 2dd451c into dev Jan 14, 2026
15 checks passed
@rohan-b99 rohan-b99 deleted the memory-tracking-metrics branch January 14, 2026 13:45
@abernix abernix mentioned this pull request Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants