- 
                Notifications
    You must be signed in to change notification settings 
- Fork 55
Make CortexIngesterReachingSeriesLimit warning less sensitive #362
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Beorn for working on this. I think that what we want from this warning alert is find out ingesters which are constantly above the threshold even after stale series are flushed (which occurs every 2h, when the TSDB head is compacted). Since when the TSDB head is compacted, we flush series with a timestamp between [-3h, -1h] the worst case scenario is that it takes up to 3h to flush stale series. So, I would rather keep the 70% threshold but with a for duration of 3h.
What do you think?
As it turns out, during normal shuffle-sharding operation, the 70% mark is often exceeded, but not by much. Rather than increasing the threshold to 75%, this commit increases the `for` duration to 3h, following the thought that we want this alert to fire if ingesters are constantly above the threshold even after stale series are flushed (which occurs every 2h, when the TSDB head is compacted). We flush series with a timestamp between [-3h, -1h] after the last compaction, so the worst case scenario is that it takes 3h to flush a stale series. Signed-off-by: beorn7 <[email protected]>
        
          
                CHANGELOG.md
              
                Outdated
          
        
      | * [ENHANCEMENT] cortex-mixin: Added `alert_excluded_routes` config to exclude specific routes from alerts. #338 | ||
| * [ENHANCEMENT] Added `CortexMemcachedRequestErrors` alert. #346 | ||
| * [ENHANCEMENT] Ruler dashboard: added "Per route p99 latency" panel in the "Configuration API" row. #353 | ||
| * [ENHANCEMENT] Tweaked threshould and `for` duration for `CortexIngesterReachingSeriesLimit` warning alert. #362 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]
| * [ENHANCEMENT] Tweaked threshould and `for` duration for `CortexIngesterReachingSeriesLimit` warning alert. #362 | |
| * [ENHANCEMENT] Tweaked `for` duration for `CortexIngesterReachingSeriesLimit` warning alert. #362 | 
| Thanks for your thoughts. Let's make it so! I have updated the CHANGELOG.md and the commit description accordingly. | 
| Note: I cannot merge this because I'm not authorized. (I guess @pracucci you are. ;-) | 
…rting Make CortexIngesterReachingSeriesLimit warning less sensitive
What this PR does:
As it turns out, during normal shuffle-sharding operation, the 70%
mark is often exceeded, but not by much. Therefore, this change sets
the new warning mark at 75%. It also increases the
forduration to15m as the expected reaction time for warning alerts is usually in the
order of hours, so we can as well wait a bit longer to see if the
problem is transient.
Which issue(s) this PR fixes:
n/a
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]