Block to submit accounting events #1010

replay · 2018-08-21T20:52:46Z

This is the prevent a memory leak due to cached data that we didn't account for.

There is a risk that this will slow down query processing time and/or ingest speed. That's why there is a metric to measure the time it takes to submit accounting events into the queue, it probably makes sense to have an alert on that metric in production environments because if that channel fills up and blocks then there will be user-impacting consequences.

replay · 2018-08-21T21:02:45Z

I just had this idea:
Instead of just blocking on pushing into this channel, we could try it in a non-blocking way and if it did not succeed then we cancel the event and return an indication that the event failed to the call-site.

If the event was adding/deleting then we simply don't perform that action on the cache. Since it is only a cache we could just choose to not perform the operation if the conditions don't allow it.
If the event was a hit then ignoring it is not a big problem.
If the event was stop or reset then the call site can retry a few times.

That way we would not need to slow down query processing/ingest and we can still guarantee that the cache has no memory leaks.
We should probably keep track of how many events get cancelled, so we can get alerted if there is an issue. But in general I think this would make it more self-healing, because I'm worried that by blocking there we could cause serious cascading issues.

woodsaj · 2018-08-22T05:16:56Z

mdata/cache/accnt/stats.go

-	cacheSizeUsed = stats.NewGauge64("cache.size.used")
+	cacheSizeMax         = stats.NewGauge64("cache.size.max")
+	cacheSizeUsed        = stats.NewGauge64("cache.size.used")
+	AccntEventSubmission = stats.NewLatencyHistogram15s32("cache.accounting.submission")


why does this metric need to be exported? It is only used in this package.

Dieterbe · 2018-08-22T11:43:10Z

we have discussed this a few times on slack. my stance is still the same:

if there is a performance problem, we solve it by making whatever is slow faster . it's too soon to think about dropping events and how to handle that. I have a hard time believing that we can't build a chunkcache that is fast enough to work accurately and also doesn't slow requests down significantly. I can't imagine why a cache would have to be that slow. If needed we can always use more cpu as the cache is very parrallelizable, but i don't see anything that is inherently high-latency. the latency measurement and more benchmarks if needed, will be our guide.
As explained in Failed to submit event to accounting, channel was blocked #942, the key to low latency cache operations is batching. the queue handles the rest. chunk cache perf fixes: AddRange + batched accounting #943 already tackled much of this. basically check all places where we call into the cache and check that number of events going into the queue is O(api calls) or O(num-series) if needed but not O(chunks). last time I checked it looked like the remaining work is around evictions, deletions and cache clearing. this is the low hanging fruit that should be tackled before we merge this. after that we can still further optimize, if needed.

woodsaj · 2018-08-22T16:10:28Z

last time I checked it looked like the remaining work is around evictions, deletions and cache clearing. this is the low hanging fruit that should be tackled before we merge this

@Dieterbe can you elaborate on what you think the issues with evictions, deletions and cache clearing are.

Dieterbe · 2018-08-22T16:14:17Z

last time I checked they all triggered many single events into the buffered channel(s), making it more likely that a queue blocks. processing them in batches reduces stress on the queue, and generally takes much less resources (locks etc), similar to the work already done with AddRange andChunks in #943

woodsaj · 2018-08-22T16:41:48Z

right, the for evict case, I think that we need to use a limit range for the max cache size. When the cache reaches the upper limit we then evict until the cache is down to the lower limit. This will allow us to evict in batches and prevent the situation when the cache is full that every add results in an evict.

replay · 2018-08-22T17:45:05Z

In cases where MT runs on machines with a very high number of cores it might also be worth to consider that the cache-accounting is always single-threaded. I don't think we've seen such a scenario before and I don't think we will on our infra, but in edge cases it could become a problem that the accounting stuff doesn't scale beyond a single core

Dieterbe · 2018-08-28T14:07:55Z

@replay can you add to this PR:

tracking of size of queue (there's a stats type that tracks min and max over time, well suited for queues)
a change to the dashboard.json to visualize both the new metrics

replay · 2018-08-28T17:11:18Z

@Dieterbe done

Dieterbe · 2018-08-29T19:02:16Z

this visualization needs work.
i suggest you plot queue used vs max queue size, which is another metric you can add based on the actual queue size. see "chunk cache size" chart on the left.
also you need to fix the y-axis units.
also i suggest you plot the latency in a different style as the queue utilisation, for that metric i would suggest setting linewidth=0 but some fill under display. that's a style we use for many other panels also.

also, looks like https://github.com/grafana/metrictank/blob/master/docs/operations.md#useful-metrics-to-monitoralert-on needs an update to monitor the queue size

Dieterbe · 2018-08-30T17:06:25Z

@replay done. what do you think?

docs/operations.md

replay requested a review from Dieterbe August 21, 2018 20:54

replay requested a review from woodsaj August 21, 2018 21:04

woodsaj reviewed Aug 22, 2018

View reviewed changes

replay added 3 commits August 30, 2018 10:06

block to submit accounting events

60e4a3d

do not export metric

c690390

add stats to record cache accounting queue len

2ee2214

replay force-pushed the block_for_accounting_events branch from 2c4af24 to 10eef29 Compare August 30, 2018 14:13

Dieterbe added 3 commits August 30, 2018 18:55

better metrics

4cc7c5f

viz new accounting stats

7c8495a

add to operations guide

9f34629

Dieterbe force-pushed the block_for_accounting_events branch from 10eef29 to 9f34629 Compare August 30, 2018 17:05

replay commented Aug 30, 2018

View reviewed changes

docs/operations.md Show resolved Hide resolved

replay merged commit 5e667b3 into master Aug 30, 2018

Dieterbe deleted the block_for_accounting_events branch September 18, 2018 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block to submit accounting events #1010

Block to submit accounting events #1010

replay commented Aug 21, 2018 •

edited

Loading

replay commented Aug 21, 2018 •

edited

Loading

woodsaj Aug 22, 2018

replay Aug 22, 2018

Dieterbe commented Aug 22, 2018 •

edited

Loading

woodsaj commented Aug 22, 2018

Dieterbe commented Aug 22, 2018

woodsaj commented Aug 22, 2018

replay commented Aug 22, 2018

Dieterbe commented Aug 28, 2018

replay commented Aug 28, 2018

Dieterbe commented Aug 29, 2018

Dieterbe commented Aug 30, 2018

Block to submit accounting events #1010

Block to submit accounting events #1010

Conversation

replay commented Aug 21, 2018 • edited Loading

replay commented Aug 21, 2018 • edited Loading

woodsaj Aug 22, 2018

Choose a reason for hiding this comment

replay Aug 22, 2018

Choose a reason for hiding this comment

Dieterbe commented Aug 22, 2018 • edited Loading

woodsaj commented Aug 22, 2018

Dieterbe commented Aug 22, 2018

woodsaj commented Aug 22, 2018

replay commented Aug 22, 2018

Dieterbe commented Aug 28, 2018

replay commented Aug 28, 2018

Dieterbe commented Aug 29, 2018

Dieterbe commented Aug 30, 2018

replay commented Aug 21, 2018 •

edited

Loading

replay commented Aug 21, 2018 •

edited

Loading

Dieterbe commented Aug 22, 2018 •

edited

Loading