-
Notifications
You must be signed in to change notification settings - Fork 107
aggregated chunks too loosely checked for GC #1168
Comments
I think the quick fix here is to simply remove the We can then simplify the logic to
https://github.com/grafana/metrictank/blob/master/mdata/aggmetric.go#L607-L624 To
|
a few comments on this suggestion:
one problem i see if that a metric is "not hot" in the cache, the chunk won't be cached. or even if it was hot, it may be evicted at any time (e.g. under memory pressure if cache can only retain the hottest series) Thus, I think we should keep metric-max-stale.
this way we always call gcAggregators when we should, but we delay the removal of the series until the writer has had a chance to persist. to your point, we can reduce the metric-max-stale setting to 3 hours or so, so that even in light of different gc schedules, there's sufficient time to persist chunks to the store.
|
awoods2:37 PM |
6h is needlessly long, we don't need to occupy RAM like that. see #1168
6h is needlessly long, we don't need to occupy RAM like that. see #1168
6h is needlessly long, we don't need to occupy RAM like that. see #1168
@woodsaj and myself have been looking into a case of missing rollup chunk for a customer after a write node restart and this is what was found:
For series that stop receiving data:
raw chunks are closed and persisted when at a GC run:
chunk-max-stale
(1h) or more.we only invoke GC logic for aggregated chunks when
metric-max-stale
(6h) is reached. Due to GC frequency (1h), we could go up to 7hours after the last datapoint is received before the aggregated chunks are persisted.Now, the problem is, aggregated chunks tend to be too long for this mechanism to support.
e.g. our best practices currently are aggregated chunks up to 6h, and kafka retention of 7.2 hours.
if the sender stops sending before transitioning into the new chunkspan, then the chunk has between 0 and 6h worth of data. In the worst case, it has just under 6h worth and needs to wait an additionl 7h before invoking GC to get the data to save. add another hour or so to encompass time it takes for the chunk to make it through the queue into the store and for operators to interfere in case the writer crashes and has trouble restarting (after which the new writer will need to seek back into kafka to the beginning of the chunk's data)
clearly our 7h retention doesn't account for this.
solutions?:
a) increase retention to longest-chunk + gc-interval + metric-max-stale + safety window = 6+1+6+1=14h
b) fix the code so that all chunks are subject to chunk-max-stale, not just the raw chunks.
b seems like the best and most correct.
The text was updated successfully, but these errors were encountered: