Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

aggregated chunks too loosely checked for GC #1168

Closed
Dieterbe opened this issue Dec 7, 2018 · 3 comments
Closed

aggregated chunks too loosely checked for GC #1168

Dieterbe opened this issue Dec 7, 2018 · 3 comments

Comments

@Dieterbe
Copy link
Contributor

Dieterbe commented Dec 7, 2018

@woodsaj and myself have been looking into a case of missing rollup chunk for a customer after a write node restart and this is what was found:

For series that stop receiving data:

  • raw chunks are closed and persisted when at a GC run:

    • they haven't been written to for chunk-max-stale (1h) or more.
    • and we should have started a new chunk 15min ago or more
  • we only invoke GC logic for aggregated chunks when metric-max-stale (6h) is reached. Due to GC frequency (1h), we could go up to 7hours after the last datapoint is received before the aggregated chunks are persisted.

Now, the problem is, aggregated chunks tend to be too long for this mechanism to support.
e.g. our best practices currently are aggregated chunks up to 6h, and kafka retention of 7.2 hours.
if the sender stops sending before transitioning into the new chunkspan, then the chunk has between 0 and 6h worth of data. In the worst case, it has just under 6h worth and needs to wait an additionl 7h before invoking GC to get the data to save. add another hour or so to encompass time it takes for the chunk to make it through the queue into the store and for operators to interfere in case the writer crashes and has trouble restarting (after which the new writer will need to seek back into kafka to the beginning of the chunk's data)
clearly our 7h retention doesn't account for this.

solutions?:
a) increase retention to longest-chunk + gc-interval + metric-max-stale + safety window = 6+1+6+1=14h
b) fix the code so that all chunks are subject to chunk-max-stale, not just the raw chunks.

b seems like the best and most correct.

@woodsaj
Copy link
Member

woodsaj commented Dec 10, 2018

I think the quick fix here is to simply remove the metric-max-stale setting.
The original intent of this setting was so that we would keep hold of closed chunks so they would continue to be used for reads. However, we added the chunkCache since this original code was added and we should just move the closed chunks there (if the series is actively being queried).

We can then simplify the logic to

  • if the current raw chunk is "collectable" (no data recieved for chunk-max-stale and the chunkspan ended more than 15minutes ago). close, push to cache and persist the chunk.

    • If the Aggregated chunks are also "collectable" (if the chunkspans of raw and aggregated end at the same time), close, push to cache and persist them. Then immediatly remove the AggMetric from memory.
  • if the current raw chunk is already persisted. Check if the aggmetrics are collectable. if so, close, push to cache and persist them. Then immediatly remove the AggMetric from memory.

https://github.com/grafana/metrictank/blob/master/mdata/aggmetric.go#L607-L624

To

	if ! currentChunk.Series.Finished {
		// chunk hasn't been written to in a while, and is not yet closed. Let's close it and persist it if
		// we are a primary
		log.Debugf("AM: Found stale Chunk, adding end-of-stream bytes. key: %v T0: %d", a.Key, currentChunk.Series.T0)
		currentChunk.Finish()
                a.pushToCache(currentChunk)
		if cluster.Manager.IsPrimary() {
			log.Debugf("AM: persist(): node is primary, saving chunk. %v T0: %d", a.Key, currentChunk.Series.T0)
			// persist the chunk. If the writeQueue is full, then this will block.
			a.persist(a.CurrentChunkPos)
		}
	}
	return a.gcAggregators(now, chunkMinTs)

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Dec 11, 2018

a few comments on this suggestion:

  1. the current code actually has 2 bugs. the problem mentioned in OP (in the if currentChunk.Series.Finished case we only GC agg chunks if a.lastWrite < metricMinTs). but also in the other case (!currentChunk.Series.Finished) we don't trigger GC on aggchunks at all and instead wait until the next GC run. We should always run a.gcAggregators(), which your solution does, so that's good.

  2. your suggestion also implies:

- # max age for a chunk before to be considered stale and to be persisted to Cassandra
+ # max age for a chunk before being considered stale, persisted and the series removed from the tank
chunk-max-stale = 1h
- # max age for a metric before to be considered stale and to be purged from in-memory ring buffer.
- metric-max-stale = 6h

one problem i see if that a metric is "not hot" in the cache, the chunk won't be cached. or even if it was hot, it may be evicted at any time (e.g. under memory pressure if cache can only retain the hottest series)
the result is that when readers and writers run GC around the same time, the readers remove AggMetrics from the tank, and the writer starts saving them which may take some time (typically between 0 and 40 minutes if a majority of series becomes stale at the same time), if then a query comes in for the data, the data may not be available in the tank, the cache, nor the persisted store.
This can also happen if a read node GC's before a write node.

Thus, I think we should keep metric-max-stale.
How about we change https://github.com/grafana/metrictank/blob/master/mdata/aggmetric.go#L607-L625
with

	if !currentChunk.Series.Finished {
		// chunk hasn't been written to in a while, and is not yet closed. Let's close it and persist it if
		// we are a primary
		log.Debugf("AM: Found stale Chunk, adding end-of-stream bytes. key: %v T0: %d", a.Key, currentChunk.Series.T0)
		currentChunk.Finish()
		a.pushToCache(currentChunk)
		if cluster.Manager.IsPrimary() {
			log.Debugf("AM: persist(): node is primary, saving chunk. %v T0: %d", a.Key, currentChunk.Series.T0)
			// persist the chunk. If the writeQueue is full, then this will block.
			a.persist(a.CurrentChunkPos)
		}
	}
	return a.gcAggregators(now, chunkMinTs) && a.lastWrite < metricMinTs

this way we always call gcAggregators when we should, but we delay the removal of the series until the writer has had a chance to persist. to your point, we can reduce the metric-max-stale setting to 3 hours or so, so that even in light of different gc schedules, there's sufficient time to persist chunks to the store.

  1. @shanson7 should be interested in this as well, see Make sure rollup data is persisted before GC #840 which was a fix for rollup chunks not getting persisted upon GC at all (yikes). here we're essentially improving upon that solution.

@Dieterbe
Copy link
Contributor Author

awoods2:37 PM
looks good to me.

Dieterbe added a commit that referenced this issue Dec 11, 2018
6h is needlessly long, we don't need to occupy RAM
like that. see #1168
Dieterbe added a commit that referenced this issue Dec 11, 2018
6h is needlessly long, we don't need to occupy RAM
like that. see #1168
Dieterbe added a commit that referenced this issue Dec 11, 2018
6h is needlessly long, we don't need to occupy RAM
like that. see #1168
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants