Conversation
This adds allocation tracking utilities so that we can get a handle on how much a single request is allocating.
This adds allocation tracking utilities so that we can get a handle on how much a single request is allocating.
Add allocation metrics for the query planner and requests
✅ Docs preview readyThe preview is ready to be viewed. View the preview File Changes 0 new, 1 changed, 0 removedBuild ID: eeafb8119b1eea6f90cf00bd URL: https://www.apollographql.com/docs/deploy-preview/eeafb8119b1eea6f90cf00bd |
This comment has been minimized.
This comment has been minimized.
aaronArinder
left a comment
There was a problem hiding this comment.
super excited for this
| // Verify metrics were recorded | ||
| // Note: We can't easily assert on histogram values, but the test verifies | ||
| // the layer compiles and runs without errors |
There was a problem hiding this comment.
is it hard to assert because the values might be wildly different or for some other reason? it'd be great to have proof that the metrics emitted are what folks think they'll be (ie, scoped to the request, to query planning, etc)
There was a problem hiding this comment.
Thanks for catching this - I've added some assert_histogram_sum! calls here instead
| // Thread-local to track the current task's allocation stats. | ||
| // | ||
| // ## Why Cell<Option<NonNull<T>>> instead of Cell<Option<Arc<T>>> or Mutex<Option<Arc<T>>>? | ||
| // | ||
| // We use a NonNull pointer instead of Arc because: | ||
| // | ||
| // 1. **Cell requires Copy**: Cell::get() requires T: Copy, but Arc<T> is not Copy | ||
| // because it has a Drop implementation for reference counting. | ||
| // | ||
| // 2. **TLS destructors conflict with global allocators**: If we stored Option<Arc<T>> | ||
| // in the thread-local, its Drop implementation would run when the thread exits. | ||
| // This Drop could call the allocator (to deallocate the Arc), causing a fatal | ||
| // reentrancy error: "the global allocator may not use TLS with destructors". | ||
| // | ||
| // 3. **Cell is faster than Mutex**: Cell has zero overhead (just a memory read/write), | ||
| // while Mutex requires atomic operations and potential thread parking. Since we | ||
| // access this on every allocation, performance is critical. | ||
| // | ||
| // ## Safety invariants: | ||
| // | ||
| // - The NonNull pointer is only valid while a MemoryTrackedFuture holding the corresponding | ||
| // Arc is on the call stack (either in poll() or with_memory_tracking()). | ||
| // - We manually manage Arc reference counts when propagating across tasks. | ||
| // - The pointer always points to valid AllocationStats when Some. |
apollo-router/src/allocator.rs
Outdated
| /// If a parent context exists, creates a child context that tracks to the parent. | ||
| /// If no parent exists, creates a new root context with the given name. | ||
| /// This is useful for tracking allocations in synchronous code or threads. | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
still dead? assuming so; also, same question but for the other allow(dead_code)s
There was a problem hiding this comment.
I've removed a couple of unused functions but had to had some more feature gates so cargo xtask lint would pass
| // on top. The tracking uses thread-locals with raw pointers to avoid TLS destructor | ||
| // issues (see CURRENT_TASK_STATS documentation above). | ||
| #[cfg(all(feature = "global-allocator", not(feature = "dhat-heap"), unix))] | ||
| unsafe impl GlobalAlloc for CustomAllocator { |
There was a problem hiding this comment.
never a scarier line existed
| @@ -0,0 +1,5 @@ | |||
| ### Implement memory tracking metrics for requests ([PR #8717](https://github.com/apollographql/router/pull/8717)) | |||
|
|
|||
| Adds the `apollo.router.request.memory` and `apollo.router.query_planner.memory` metrics which track allocations/deallocations during the request lifecycle. | |||
There was a problem hiding this comment.
Could this include some more context about how this is done? If I were reading this without context, I wouldn't be aware of the fact that this required a custom allocator.
Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
The majority of this work is from #8525, this PR includes some extra tests and places where memory tracking has been added.
Performance testing with vegeta shows the changes have negligible impact:
Exceptions
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and
debug-level logs. Please read this guidance on metrics best-practices. ↩Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩