Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

When ingesting data from the future we use a TTL that's higher than the defined TTL #1560

Closed
replay opened this issue Dec 5, 2019 · 11 comments · Fixed by #1572
Closed

When ingesting data from the future we use a TTL that's higher than the defined TTL #1560

replay opened this issue Dec 5, 2019 · 11 comments · Fixed by #1572

Comments

@replay
Copy link
Contributor

replay commented Dec 5, 2019

When ingesting data from the future, the effective TTL with which the data gets inserted into Cassandra is ttl + (timestamp - now). This is due to the changes in #1448
This means that for data from the future expiration based on the defined storage-schemas.conf is broken.
F.e. if a datapoint has a timestamp 5 days in the future and gets inserted into a table with 30d TTL then it effectively gets a TTL of 35d. This will delay the removal of the according SSTable by an additional 5 days.
We should limit the calculated relativeTTL (https://github.com/grafana/metrictank/blob/master/store/cassandra/cassandra.go#L368) which has been introduced in #1448 to the defined table TTL. (if relativeTTL > ttl; relativeTTL = ttl;).

Thx for reporting this issue @shanson7

@replay replay added the bug label Dec 5, 2019
@woodsaj
Copy link
Member

woodsaj commented Dec 5, 2019

I am not sure that we should make this change.
The TTL is deliberately relative to the timestamp of the datapoint to support being able to select the correct rollup to use at query time.

eg. lets say we have a retention policy of 10s:7day,1m:30d and a user requests data for from=now-6d&to=now. When we process this query we know that we can use the 10s datapoints because we have data for them as far back as now-7d.

Even if the data for this series was ingested with a timestamp that was 2days into the future, the query engine still expects the 10s resolution data to be available up to now-7d.
If we limited the ttl at ingestion time, these datapoints would be become unavailable after 2days earlier then expected. so a query for from=now-6d&to=now would have the first days worth of datapoints missing.

@shanson7
Copy link
Collaborator

shanson7 commented Dec 5, 2019

The issue is with Cassandra and TWCS (which is the recommended strategy for this type of data).

Any datapoints written at the same time to the same table get compacted together. The sstables can only get cleaned up when all data is tombstones (either by delete or TTL). Writing future data with longer TTLs will prevent deletion from disk (possibly for far longer than the intended TTL)

As a concrete example, if writing 10TB / day of data with an intended TTL of 7d, one would expect ~70TB of data (times replication factor) after 7 days and then a sawtooth pattern as the SStables get cleaned up.

If even a tiny fraction of data is written with, for example, a time 1 year in the future then the TTL would mean these SStables will longer for a year and use ~52 times the storage.

Because metrictank sets the window length for TWCS and TTL range per table, I think it needs to play by those rules

@woodsaj
Copy link
Member

woodsaj commented Dec 5, 2019

Changing the TTL to be anything other then relative to the datapoint timestamps is not an option due to the requirement of the query engine to know which rollups have data for requested query time ranges.

To protect against the issue of datapoints with future timestamps breaking TWCS, metrictank should have a config option to simply reject datapoints that have a timestamp greater than now+limit.

TWCS only works when datapoints are written with timestamps that are close to NOW(). If users need to support workloads that do something different, then a different compaction strategy needs to be used.

@shanson7
Copy link
Collaborator

shanson7 commented Dec 5, 2019

Changing the TTL to be anything other then relative to the datapoint timestamps is not an option due to the requirement of the query engine to know which rollups have data for requested query time ranges.

Except this is how it worked for a very long time. This change actively endangered a large production setup, and possibly others.

metrictank should have a config option to simply reject datapoints that have a timestamp greater than now+limit.

Is limit based on the retention of the destination or some statically configured number? If a table is a 10 year table, it would be less sensitive to data 30 days in the future than a 1 day retention table.

TWCS only works when datapoints are written with timestamps that are close to NOW(). If users need to support workloads that do something different, then a different compaction strategy needs to be used.

Metrictank creates the table based on a schema skeleton. Metrictank is what passes the window size with the assumption that the compaction strategy is TWCS. Basically, MT doesn't really support other compaction strategies for cassandra.

At the end of the day, all I want is protection from this issue. I'm ok with an option of rejecting data that would be too far in the future for the given retention policy.

@replay
Copy link
Contributor Author

replay commented Dec 6, 2019

One problem with a future-tolerance-limit which is defined per storage schema is that it will effectively be applied per table. If two schemas have an aggregation with the same TTL, then this means that the deletion of the SSTables may be delayed by the larger future-tolerance-limit of the two schemas. This is confusing to users, because the ability to define the future tolerance per schema looks is if it were freely configurable, while actually it isn't.
As a simplified alternative, how about we add a global configuration parameter that defines the future tolerance limit as a fraction of the max TTL for each storage schema?
A schema with the max ttl of 10y would then get a higher future tolerance limit than one with a max TTL of 1y.
This doesn't really solve the issue of schemas affecting each other, but at least it would make it possible to define that schemas with a very high max TTL also get a higher future tolerance limit. The user then just needs to be careful about the side effect of a storage schema with such a high max TTL, which is that the expiration of sstables used by other schemas with a shared/similar TTL in their aggregations would also be delayed by up to the future tolerance limit of that storage schema with the very high max TTL. This would definitely need to be documented well.

@Dieterbe
Copy link
Contributor

Dieterbe commented Dec 13, 2019

Rejecting future data makes me sad. I would rather not suddenly cut out a very valid use case (forecasting) to address a concern about suboptimal compaction.

What if series that send future data go into separate tables?
That way we could keep the extended TTL to make the query engine happy, and we'd have efficient TWCS behavior as well.

This raises two questions:

how does the query engine know, for any given serie, which table to query?

There's 2 solutions:

  1. automatically assign a table (based on the positive delta in timestamp) and save it in the index. (or rather save the rounded delta in the index) This could follow a similar scheme to what we already have making assignments in buckets of powers of 2 (possibly with a parameter to control the granularity). It seems unlikely, though possible, for the delta to change over time. So either we track how it changes over time (this sounds very similar to our index v2 efforts around tracking metric presence over time rathar than a single lastUpdate field. When we transition lastUpdate to become periods of presence, we may add positive TTL to the mix. This is not something I want to build right now, but maybe later), or once a "future table" is choosen, we stick to it. This is only a problem when:
    A) forecasts are a significant portion of the data stored in metrictank AND
    B) the positive deltas for any given series change over time.
    If B but not A, there would be a lot of mixed-delta data in the future tables, but it would be separate from the normal data. so compaction would be inefficient (as Sean described) but it's on a small amount of data, so acceptable. If A but not B we have as efficient compaction behavior as possible (there may be some mixed TTL data but that's why one could tune the granularity)

  2. make table assignments explicit via storage-schemas.conf or similar file. Basically saying that metrics matching certain expressions should be assumed to have a certain positive TTL offset, and thus be routed into (a) separate table(s). This does raise some points around what if these rules are changed, MT needs to know where to look. But this is just like the problem for the actual storage-schemas that we also need (and plan) to solve in the near future.

how well does Cassandra handle the increase in number of tables?

I imagine this is probably not a big deal.

@replay
Copy link
Contributor Author

replay commented Dec 13, 2019

I would rather not suddenly cut out a very valid use case (forecasting) to address a concern about suboptimal compaction. that's why it's configurable to some degree, to not completely deny that use case.

The idea with using a distinct table for future data, assuming that on this one compaction will be inefficient, is interesting. It appears to get very complicated though, especially since we'll probably have to be able to deal with both cases A) & B) which you described.

The 2nd suggestion sounds good to me, to just allow the user to assign specific patterns to separate tables via storage-schemas.conf. Then it would also make sense to specify the future tolerance limit in the storage-schemas.conf, because then it would truly be freely configurable and it couldn't have negative hidden side effects on TWCS efficiency.
I also think that this seems to be the cleanest solution, because with the last suggestion that I described in my last comment there is still a possibility that different schemas negatively affect each other's efficiency, and for a user who didn't read into TWCS or how Metrictank groups data into tables it will be really hard to understand why. Allowing the user to configure tables per storage schema would solve this issue completely.

@shanson7
Copy link
Collaborator

I would rather not suddenly cut out a very valid use case (forecasting) to address a concern about suboptimal compaction.

I agree, but really this was just recently supported in the first place (#1448) and it caused non-deterministic TTL behavior. A solution that still partially supports and still allows controllable retention is at least a good mid-step.

As an admin of a large mixed-use cluster, it's very difficult to control who/when forecasted data is going to be inserted into the system.

@Dieterbe
Copy link
Contributor

I do want to address the compaction problems, of course! All I am saying is, I don't want it to be at the cost of forecasting. Perhaps we can merge something like @replay 's proposed solution now as a stop-gap. But:

  1. long term I want to allow people to make long term forecasts in a way that doesn't create compaction problems (e.g. via my proposed alternative tables).
  2. we need to be careful not to suddenly stop handling forecasting workloads from our cloud customers.

For 2 it may be enough to just have a mode that doesn't enforce the rule yet, but increments a metric when we would have dropped the data if the mode would have been active. Based on that we can see what the impact would be before we enforce it. (@replay 's idea). I think i like it.

@woodsaj
Copy link
Member

woodsaj commented Dec 16, 2019

I like the idea of users being able to specify a table name for specific retention policies. However, there is no direct relationship between storage-schemas and the backend store where data is persisted. ie the "table name" data is stored in as only specific to the Cassandra store plugin. The Bigtable store plugin does have the concept of a table_name but it means something very different.

so we would need to handle this in a more generic way, eg with a 'storeConfig' field in the storage-schemas config that can be passed to the backend store plugins directly.

e.g. something like:

[forecast]
pattern = forecast.*
retentions = 1s:35d:10min:7
reorderBuffer = 20
reorderBufferAllowUpdate = true
storeConfig = {cassandra: { table_prefix: "metric_forecast_"}}

There is a lot of complexity involved in implementing this though.

IMHO forecasted data is such a niche use case that we really dont need to invest any additional effort. With the proposed future-tolerance-ratio setting, you can already support forecasted data no matter how far into the future the datapoints are. You just need to set the retention of the data to be forecast_window / (future-tolerance-ratio / 100)

eg, if you want to allow datapoints that have timestamps 14days into the future, and the future-tolerance-ratio is set to 10, then the retention policy for the forecasted needs to be:

[forecast14d]
pattern = forecast.*
retentions = 10s:140d:10min:7

@Dieterbe
Copy link
Contributor

Dieterbe commented Dec 17, 2019

IMHO forecasted data is such a niche use case that we really dont need to invest any additional effort.

the numbers on our cloud platform confirm this. I have looked at the timestamp deltas for all of GrafanaCloud Graphite, and most instances have deltas of up to 30/60s, a couple around 8/10 minutes, and some large customers in-us east send 1h~3h ahead, to varying degrees of consistency. so possiby bugs or poorly configured clocks on their side.
given these instances have >=7d raw retention, this will be easy to support with reasonable ratio parameters.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants