Add metrics for cache entry memory size#5770
Conversation
These are: * currently unpopulated. * needs unit tests. * needs to be propagated when a new in memory cache is spawned they are correct.
This comment has been minimized.
This comment has been minimized.
|
CI performance tests
|
| } | ||
|
|
||
| fn serialize_none(self) -> Result<Self::Ok, Self::Error> { | ||
| Ok(self) |
There was a problem hiding this comment.
Just to double check: None would still use the size of Some in memory, right?
We estimate it as 0 because of serve limitations?
There was a problem hiding this comment.
Yeah, I guess in theory we can do something better. But None values won't have strings etc in them so I don't think we should worry about this and maybe we can follow up later to improve things.
apollo-router/src/cache/storage.rs
Outdated
| kind = %self.caller, | ||
| storage = &tracing::field::display(CacheStorageName::Memory), | ||
| ); | ||
| if let Some((_, v)) = in_memory.push(key.clone(), value.clone()) { |
There was a problem hiding this comment.
the in memory cache lock is held until the end of the function, it should only be held when inserting the old entry.
Maybe fixed with something like:
let old_entry = {
self.inner.lock().await.push(key.clone(), value.clone())
};
if let Some((_, v)) = old_entry {|
|
||
| As the size is only an estimation, users should check for correlation with pod memory usage to determine if cache needs to be updated. | ||
|
|
||
| Usage scenario: |
There was a problem hiding this comment.
@shorgi Maybe this info should also be in the docs under cache troubleshooting?
There was a problem hiding this comment.
This is great content for docs. Since it's related to pod memory pressure, moving it to the k8s page.
|
Going to wait for a docs review before merging. |
shorgi
left a comment
There was a problem hiding this comment.
Moved the usage scenario from the changeset into docs
|
|
||
| As the size is only an estimation, users should check for correlation with pod memory usage to determine if cache needs to be updated. | ||
|
|
||
| Usage scenario: |
There was a problem hiding this comment.
This is great content for docs. Since it's related to pod memory pressure, moving it to the k8s page.
Co-authored-by: Edward Huang <edward.huang@apollographql.com>
Co-authored-by: Edward Huang <edward.huang@apollographql.com>
Co-authored-by: Edward Huang <edward.huang@apollographql.com>
Co-authored-by: Edward Huang <edward.huang@apollographql.com>
Co-authored-by: Edward Huang <edward.huang@apollographql.com>
Query planner cache entries may use an significant amount of memory in the Router.
To help users understand and monitor this the Router now exposes a new metric
apollo.router.cache.storage.estimated_size.This metric give an estimated size in bytes for the cache entry and has the following attributes:
kind:query planner.storage:memory.As the size is only an estimation, users should check for correlation with pod memory usage to determine if cache needs to be updated.
Usage scenario:
apollo.router.cache.storage.estimated_size.apollo_router_cache_size.apollo_router_cache_hits-apollo_router_cache_misses.apollo.router.cache.storage.estimated_sizeto see if it grows over time and correlates with pod memory usage.Remediation:
Technical info
The estimate is only implemented for query plans and uses serde to create the estimate without actually writing to a string. It is very difficult to provide accurate memory usage, so it's only useful to see a trend.
Providing an estimated size for a cache entry is optional, and only implemented for query planner at this time.
Where no estimated size is set the gauge will not be emitted, e.g. APQ or entity caching.
Testing
For manual testing we will deploy internally and observe that the metric tops out when the query cache is full.
ROUTER-669
Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Exceptions
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩