aggregated chunks too loosely checked for GC #1168

Dieterbe · 2018-12-07T18:54:50Z

@woodsaj and myself have been looking into a case of missing rollup chunk for a customer after a write node restart and this is what was found:

For series that stop receiving data:

raw chunks are closed and persisted when at a GC run:
- they haven't been written to for chunk-max-stale (1h) or more.
- and we should have started a new chunk 15min ago or more
we only invoke GC logic for aggregated chunks when metric-max-stale (6h) is reached. Due to GC frequency (1h), we could go up to 7hours after the last datapoint is received before the aggregated chunks are persisted.

Now, the problem is, aggregated chunks tend to be too long for this mechanism to support.
e.g. our best practices currently are aggregated chunks up to 6h, and kafka retention of 7.2 hours.
if the sender stops sending before transitioning into the new chunkspan, then the chunk has between 0 and 6h worth of data. In the worst case, it has just under 6h worth and needs to wait an additionl 7h before invoking GC to get the data to save. add another hour or so to encompass time it takes for the chunk to make it through the queue into the store and for operators to interfere in case the writer crashes and has trouble restarting (after which the new writer will need to seek back into kafka to the beginning of the chunk's data)
clearly our 7h retention doesn't account for this.

solutions?:
a) increase retention to longest-chunk + gc-interval + metric-max-stale + safety window = 6+1+6+1=14h
b) fix the code so that all chunks are subject to chunk-max-stale, not just the raw chunks.

b seems like the best and most correct.

The text was updated successfully, but these errors were encountered:

woodsaj · 2018-12-10T08:28:53Z

I think the quick fix here is to simply remove the metric-max-stale setting.
The original intent of this setting was so that we would keep hold of closed chunks so they would continue to be used for reads. However, we added the chunkCache since this original code was added and we should just move the closed chunks there (if the series is actively being queried).

We can then simplify the logic to

if the current raw chunk is "collectable" (no data recieved for chunk-max-stale and the chunkspan ended more than 15minutes ago). close, push to cache and persist the chunk.
- If the Aggregated chunks are also "collectable" (if the chunkspans of raw and aggregated end at the same time), close, push to cache and persist them. Then immediatly remove the AggMetric from memory.
if the current raw chunk is already persisted. Check if the aggmetrics are collectable. if so, close, push to cache and persist them. Then immediatly remove the AggMetric from memory.

https://github.com/grafana/metrictank/blob/master/mdata/aggmetric.go#L607-L624

To

	if ! currentChunk.Series.Finished {
		// chunk hasn't been written to in a while, and is not yet closed. Let's close it and persist it if
		// we are a primary
		log.Debugf("AM: Found stale Chunk, adding end-of-stream bytes. key: %v T0: %d", a.Key, currentChunk.Series.T0)
		currentChunk.Finish()
                a.pushToCache(currentChunk)
		if cluster.Manager.IsPrimary() {
			log.Debugf("AM: persist(): node is primary, saving chunk. %v T0: %d", a.Key, currentChunk.Series.T0)
			// persist the chunk. If the writeQueue is full, then this will block.
			a.persist(a.CurrentChunkPos)
		}
	}
	return a.gcAggregators(now, chunkMinTs)

Dieterbe · 2018-12-11T10:22:14Z

a few comments on this suggestion:

the current code actually has 2 bugs. the problem mentioned in OP (in the if currentChunk.Series.Finished case we only GC agg chunks if a.lastWrite < metricMinTs). but also in the other case (!currentChunk.Series.Finished) we don't trigger GC on aggchunks at all and instead wait until the next GC run. We should always run a.gcAggregators(), which your solution does, so that's good.
your suggestion also implies:

- # max age for a chunk before to be considered stale and to be persisted to Cassandra
+ # max age for a chunk before being considered stale, persisted and the series removed from the tank
chunk-max-stale = 1h
- # max age for a metric before to be considered stale and to be purged from in-memory ring buffer.
- metric-max-stale = 6h

one problem i see if that a metric is "not hot" in the cache, the chunk won't be cached. or even if it was hot, it may be evicted at any time (e.g. under memory pressure if cache can only retain the hottest series)
the result is that when readers and writers run GC around the same time, the readers remove AggMetrics from the tank, and the writer starts saving them which may take some time (typically between 0 and 40 minutes if a majority of series becomes stale at the same time), if then a query comes in for the data, the data may not be available in the tank, the cache, nor the persisted store.
This can also happen if a read node GC's before a write node.

Thus, I think we should keep metric-max-stale.
How about we change https://github.com/grafana/metrictank/blob/master/mdata/aggmetric.go#L607-L625
with

	if !currentChunk.Series.Finished {
		// chunk hasn't been written to in a while, and is not yet closed. Let's close it and persist it if
		// we are a primary
		log.Debugf("AM: Found stale Chunk, adding end-of-stream bytes. key: %v T0: %d", a.Key, currentChunk.Series.T0)
		currentChunk.Finish()
		a.pushToCache(currentChunk)
		if cluster.Manager.IsPrimary() {
			log.Debugf("AM: persist(): node is primary, saving chunk. %v T0: %d", a.Key, currentChunk.Series.T0)
			// persist the chunk. If the writeQueue is full, then this will block.
			a.persist(a.CurrentChunkPos)
		}
	}
	return a.gcAggregators(now, chunkMinTs) && a.lastWrite < metricMinTs

this way we always call gcAggregators when we should, but we delay the removal of the series until the writer has had a chance to persist. to your point, we can reduce the metric-max-stale setting to 3 hours or so, so that even in light of different gc schedules, there's sufficient time to persist chunks to the store.

@shanson7 should be interested in this as well, see Make sure rollup data is persisted before GC #840 which was a fix for rollup chunks not getting persisted upon GC at all (yikes). here we're essentially improving upon that solution.

Dieterbe · 2018-12-11T14:30:05Z

awoods2:37 PM
looks good to me.

6h is needlessly long, we don't need to occupy RAM like that. see #1168

Dieterbe added a commit that referenced this issue Dec 11, 2018

bugfix: GC aggregated chunks as soon as is reasonable. fix #1168

6c5f41a

Dieterbe added a commit that referenced this issue Dec 11, 2018

expire metrics in tank after 3 hours by default.

73683b8

6h is needlessly long, we don't need to occupy RAM like that. see #1168

Dieterbe mentioned this issue Dec 11, 2018

aggregated chunk GC fix + lower default metric-max-stale #1175

Merged

Dieterbe added a commit that referenced this issue Dec 11, 2018

expire metrics in tank after 3 hours by default.

2efcf13

6h is needlessly long, we don't need to occupy RAM like that. see #1168

Dieterbe added a commit that referenced this issue Dec 11, 2018

expire metrics in tank after 3 hours by default.

e104710

6h is needlessly long, we don't need to occupy RAM like that. see #1168

Dieterbe closed this as completed in 53887c1 Dec 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aggregated chunks too loosely checked for GC #1168

aggregated chunks too loosely checked for GC #1168

Dieterbe commented Dec 7, 2018 •

edited by woodsaj

Loading

woodsaj commented Dec 10, 2018

Dieterbe commented Dec 11, 2018 •

edited

Loading

Dieterbe commented Dec 11, 2018

aggregated chunks too loosely checked for GC #1168

aggregated chunks too loosely checked for GC #1168

Comments

Dieterbe commented Dec 7, 2018 • edited by woodsaj Loading

woodsaj commented Dec 10, 2018

Dieterbe commented Dec 11, 2018 • edited Loading

Dieterbe commented Dec 11, 2018

Dieterbe commented Dec 7, 2018 •

edited by woodsaj

Loading

Dieterbe commented Dec 11, 2018 •

edited

Loading