Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

Block to submit accounting events #1010

Merged
merged 6 commits into from
Aug 30, 2018
Merged

Conversation

replay
Copy link
Contributor

@replay replay commented Aug 21, 2018

This is the prevent a memory leak due to cached data that we didn't account for.

There is a risk that this will slow down query processing time and/or ingest speed. That's why there is a metric to measure the time it takes to submit accounting events into the queue, it probably makes sense to have an alert on that metric in production environments because if that channel fills up and blocks then there will be user-impacting consequences.

@replay replay requested a review from Dieterbe August 21, 2018 20:54
@replay
Copy link
Contributor Author

replay commented Aug 21, 2018

I just had this idea:
Instead of just blocking on pushing into this channel, we could try it in a non-blocking way and if it did not succeed then we cancel the event and return an indication that the event failed to the call-site.

  • If the event was adding/deleting then we simply don't perform that action on the cache. Since it is only a cache we could just choose to not perform the operation if the conditions don't allow it.
  • If the event was a hit then ignoring it is not a big problem.
  • If the event was stop or reset then the call site can retry a few times.

That way we would not need to slow down query processing/ingest and we can still guarantee that the cache has no memory leaks.
We should probably keep track of how many events get cancelled, so we can get alerted if there is an issue. But in general I think this would make it more self-healing, because I'm worried that by blocking there we could cause serious cascading issues.

@replay replay requested a review from woodsaj August 21, 2018 21:04
cacheSizeUsed = stats.NewGauge64("cache.size.used")
cacheSizeMax = stats.NewGauge64("cache.size.max")
cacheSizeUsed = stats.NewGauge64("cache.size.used")
AccntEventSubmission = stats.NewLatencyHistogram15s32("cache.accounting.submission")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this metric need to be exported? It is only used in this package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@Dieterbe
Copy link
Contributor

Dieterbe commented Aug 22, 2018

we have discussed this a few times on slack. my stance is still the same:

  1. if there is a performance problem, we solve it by making whatever is slow faster . it's too soon to think about dropping events and how to handle that. I have a hard time believing that we can't build a chunkcache that is fast enough to work accurately and also doesn't slow requests down significantly. I can't imagine why a cache would have to be that slow. If needed we can always use more cpu as the cache is very parrallelizable, but i don't see anything that is inherently high-latency. the latency measurement and more benchmarks if needed, will be our guide.
  2. As explained in Failed to submit event to accounting, channel was blocked #942, the key to low latency cache operations is batching. the queue handles the rest. chunk cache perf fixes: AddRange + batched accounting #943 already tackled much of this. basically check all places where we call into the cache and check that number of events going into the queue is O(api calls) or O(num-series) if needed but not O(chunks). last time I checked it looked like the remaining work is around evictions, deletions and cache clearing. this is the low hanging fruit that should be tackled before we merge this. after that we can still further optimize, if needed.

@woodsaj
Copy link
Member

woodsaj commented Aug 22, 2018

last time I checked it looked like the remaining work is around evictions, deletions and cache clearing. this is the low hanging fruit that should be tackled before we merge this

@Dieterbe can you elaborate on what you think the issues with evictions, deletions and cache clearing are.

@Dieterbe
Copy link
Contributor

last time I checked they all triggered many single events into the buffered channel(s), making it more likely that a queue blocks. processing them in batches reduces stress on the queue, and generally takes much less resources (locks etc), similar to the work already done with AddRange andChunks in #943

@woodsaj
Copy link
Member

woodsaj commented Aug 22, 2018

right, the for evict case, I think that we need to use a limit range for the max cache size. When the cache reaches the upper limit we then evict until the cache is down to the lower limit. This will allow us to evict in batches and prevent the situation when the cache is full that every add results in an evict.

@replay
Copy link
Contributor Author

replay commented Aug 22, 2018

In cases where MT runs on machines with a very high number of cores it might also be worth to consider that the cache-accounting is always single-threaded. I don't think we've seen such a scenario before and I don't think we will on our infra, but in edge cases it could become a problem that the accounting stuff doesn't scale beyond a single core

@Dieterbe
Copy link
Contributor

@replay can you add to this PR:

  1. tracking of size of queue (there's a stats type that tracks min and max over time, well suited for queues)
  2. a change to the dashboard.json to visualize both the new metrics

@replay
Copy link
Contributor Author

replay commented Aug 28, 2018

@Dieterbe done

@Dieterbe
Copy link
Contributor

this visualization needs work.
i suggest you plot queue used vs max queue size, which is another metric you can add based on the actual queue size. see "chunk cache size" chart on the left.
also you need to fix the y-axis units.
also i suggest you plot the latency in a different style as the queue utilisation, for that metric i would suggest setting linewidth=0 but some fill under display. that's a style we use for many other panels also.

also, looks like https://github.com/grafana/metrictank/blob/master/docs/operations.md#useful-metrics-to-monitoralert-on needs an update to monitor the queue size

@replay replay force-pushed the block_for_accounting_events branch from 2c4af24 to 10eef29 Compare August 30, 2018 14:13
@Dieterbe Dieterbe force-pushed the block_for_accounting_events branch from 10eef29 to 9f34629 Compare August 30, 2018 17:05
@Dieterbe
Copy link
Contributor

@replay done. what do you think?

@replay replay merged commit 5e667b3 into master Aug 30, 2018
@Dieterbe Dieterbe deleted the block_for_accounting_events branch September 18, 2018 08:58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants