-
Notifications
You must be signed in to change notification settings - Fork 107
Conversation
I just had this idea:
That way we would not need to slow down query processing/ingest and we can still guarantee that the cache has no memory leaks. |
mdata/cache/accnt/stats.go
Outdated
cacheSizeUsed = stats.NewGauge64("cache.size.used") | ||
cacheSizeMax = stats.NewGauge64("cache.size.max") | ||
cacheSizeUsed = stats.NewGauge64("cache.size.used") | ||
AccntEventSubmission = stats.NewLatencyHistogram15s32("cache.accounting.submission") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this metric need to be exported? It is only used in this package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
we have discussed this a few times on slack. my stance is still the same:
|
@Dieterbe can you elaborate on what you think the issues with evictions, deletions and cache clearing are. |
last time I checked they all triggered many single events into the buffered channel(s), making it more likely that a queue blocks. processing them in batches reduces stress on the queue, and generally takes much less resources (locks etc), similar to the work already done with AddRange andChunks in #943 |
right, the for evict case, I think that we need to use a limit range for the max cache size. When the cache reaches the upper limit we then evict until the cache is down to the lower limit. This will allow us to evict in batches and prevent the situation when the cache is full that every add results in an evict. |
In cases where MT runs on machines with a very high number of cores it might also be worth to consider that the cache-accounting is always single-threaded. I don't think we've seen such a scenario before and I don't think we will on our infra, but in edge cases it could become a problem that the accounting stuff doesn't scale beyond a single core |
@replay can you add to this PR:
|
@Dieterbe done |
this visualization needs work. also, looks like https://github.com/grafana/metrictank/blob/master/docs/operations.md#useful-metrics-to-monitoralert-on needs an update to monitor the queue size |
2c4af24
to
10eef29
Compare
10eef29
to
9f34629
Compare
@replay done. what do you think? |
This is the prevent a memory leak due to cached data that we didn't account for.
There is a risk that this will slow down query processing time and/or ingest speed. That's why there is a metric to measure the time it takes to submit accounting events into the queue, it probably makes sense to have an alert on that metric in production environments because if that channel fills up and blocks then there will be user-impacting consequences.