-
Notifications
You must be signed in to change notification settings - Fork 21
proposal: TSDB Support for Cumulative CT (and Delta ST on the way) #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4f9eb07 to
03c37e3
Compare
033d077 to
704dee5
Compare
|
FYI: We met for 1h with the delta WG (@ArthurSens @carrieedwards @fionaliao @ywwg) for an initial discussion around this proposal decisions. Thanks for this productive time! Here are some notes:
Also updated proposal today with some learnings. Finally proposed a single feature flag for this work ( Still lots of TODOs and anyone is welcome to help! |
Signed-off-by: bwplotka <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this proposal! I added a bunch of comments, some of which are answered by the paragraph right after the comment 😅 . I think my main concern is nailing down the Goals section. This is not at all to question whether we should do the work, just that I think our statement of intent needs to be unequivocable.
|
|
||
| See the details, motivations and discussions about the delta temporality in [PROM-48](https://github.com/prometheus/proposals/pull/48). | ||
|
|
||
| The core TL;DR relevant for this proposal is that the delta temporality counter sample can be conceptually seen as a "mini-cumulative counter". Essentially delta is a single-sample (value) cumulative counter for a period between (sometimes inclusive sometimes exclusive depending on a system) start(ST)/create(CT) timestamp and a (end)timestamp (inclusive). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the first time Start Timestamp is mentioned, and for people reading linearly it will kind of pop up out of nowhere. Let's move the next section, or at least parts of it, up higher and get the terminology questions out of the way as soon as possible. And then later on we can clarify the subtleties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the way to do that is to acknowledge Otel explicitly, and briefly summarize how they define a delta sample
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mentioned "a delta start timestamp (ST)." in TL;DR, but you are right we could mention it clearly somewhere. Especially once we decide if we go with ST or CT fully, we could just use one and only have a quick mention about other. 👍🏽
|
|
||
| The core TL;DR relevant for this proposal is that the delta temporality counter sample can be conceptually seen as a "mini-cumulative counter". Essentially delta is a single-sample (value) cumulative counter for a period between (sometimes inclusive sometimes exclusive depending on a system) start(ST)/create(CT) timestamp and a (end)timestamp (inclusive). | ||
|
|
||
| In other words, instant query for `increase(<cumulative counter>[5m])` produces a single delta sample for a `[t-5m, t]` period (V: `increase(<counter>[5m])`, CT/ST: `now()-5m`, T: `now()`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if it's just me, but I am confused by the idea of producing a delta sample. To me, delta samples are things that are written to storage and not produced by promQL statements. For me, PromQL queries only ever produce instant/range scalar/vector results. And an increase() function would operate on one or more delta samples in the desired range, producing an "instant scalar" result, not a "delta sample."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be confusing, but I wonder if there is some value in mentioning this.
For instance what type samples from the recording rules using increase would have? How it could be used further? Putting type for this can be useful, no ?
| * Average: in the order of ~weeks/months for stable workloads, ~days/weeks for more dynamic environments (Kubernetes). | ||
| * Best case: it never changes (infinite count) e.g days_since_X_total. | ||
| * Worse case: it changes for every sample. | ||
| * For the delta we expect CT to change for every sample. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the definitions I can find in the otel docs say that start time is "strongly encouraged" but not required. (ref: https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto#L163-L186)
So I think we need to loosen this statement just a bit to consider the cases where CT does not change for every sample. perhaps: "For the delta we expect CT to change for every sample but can make a best-effort attempt to adapt to situations where the start time is missing."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason why start timestamp wasn't made a hard requirement was for prometheus compatibility. All SDKs are required to set the start timestamp. Deltas always set the start timestamp as well to my knowledge.
Expectation that delta start time changes each sample is documented here: https://github.com/open-telemetry/opentelemetry-specification/blob/1e04e1be0e17cae6d01c862049bbeb298e0ffa06/specification/metrics/data-model.md?plain=1#L421
When the aggregation temporality is "delta", we expect to have
no overlap in time windows for metric streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's going to be hard to leave with unknown CT for delta, so perhaps, given it's a new thing, we could start with restriction to always have it? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fionaliao Haven't we heard of cases where a delta doesn't have a CT? My understanding is that situation exists in the wild whether we want it to or not. (Is it too late to drop deltas with no CT because that would break people?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even without CTs, deltas could be used and produce useful results with the sum_over_time() function. In my opinion, dropping such values might be a step towards "perfect is the enemy of good" territory.
Maybe a compromise could be to return a warning/info from a rate-like function about missing CTs and potentially inaccurate results? I guess you might want to have this for both cumulative counters and deltas (though for deltas it's critical).
|
|
||
| TODO: Just a draft, to be discussed. | ||
| TODO: There are questions around: | ||
| * Should we do inclusive vs exclusive intervals? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do worry about exclusive/inclusive. Will this need to be a flag / config option? Google has a strict opinion but it sounds like other systems do not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry about Google, and definitely let's find the best balanced choice for now, without config flags - if anything it's baked in data, so I any read config would be hard to use (too deep user knowledge needed about their metrics, not very common)
What makes the most sense for Prometheus?
| TODO: Just a draft, to be discussed. | ||
| TODO: There are questions around: | ||
| * Should we do inclusive vs exclusive intervals? | ||
| * Given optionality of this feature, can we even reject sample on TSDB Append if CT is invalid? (MUST or SHOULD on interface?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might depend on how it's invalid? Would it be possible to accept invalid samples and solve the problem at query time? I have a hunch that people would prefer to have dirty data than have their data rejected. (I am unsure what the general Prometheus philosophy is on this question). As an example, there used to be a hard rule against OOO, but now there are a number of adaptations to support it.
|
|
||
| CT/ST notion was popularized by OpenTelemetry and early experience exposed a big challenge: CT/ST data is extremely unclean given early adoption, mixed instrumentation support and multiple (all imperfect) ways of ["auto-generation"](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstarttimeprocessor) (`subtract_initial_point` might the most universally "correct", but it added only recently). This means that handling (or reducing ) CT errors is an important detail for consumers of this data. | ||
|
|
||
| TODO: Just a draft, to be discussed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also have to consider changes to the exposition format? I am thinking of the extra cost of doubling the size of the exposition just to include a second timestamp, vs adding a new term to the exposition line in a way that old systems won't be able to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? CT is already part of OpenMetrics and PrometheusProto 🤔
| * Changing names often creates confusion | ||
| * Having a slight changed name already makes it clear that we talk about Prometheus semantics of the same thing | ||
|
|
||
| **Rejected** due to not enough arguments for renaming (feel free to challenge this!). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I thought we had more interest in renaming than this. Maybe an informal poll on slack is sufficient to get an accurate temperature-read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Thanks for mentioning. From notes and memory I think it was your vote to rename vs others to not rename or no opinion.
This is a starting point for discussion, we can argument otherwise. Do you have more arguments towards renaming that would help motivate us?
When writing this I think I got sold on not renaming. Both are technically correct and given Prometheus being cumulative native it only makes sense to leave like this. This does not mean we shouldn't go against my vote if we have consensus here to rename. How to find this consensus? (: What would convince you to keep CT naming? (:
Do you mind helping with the vote if you are passionate to challenge this? What are we missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall, during the discussion someone made the point that there are important distinctions between the two types of points:
- created timestamps communicate a specific idea: the time that a metric came into existence
- start timestamps are more general: the time since the last point (if any).
Sometimes it's useful to know about created timestamps specifically even if the metric also has start timestamps. One thing that would help convince me not to change the name is if this distinction is unnecessary.
I also think it's better to allow language to evolve with a product and I think having accurate names for fields is important for legibility. I always assume there will be more developers in the future who have to learn a system than there are existing developers, and one thing that is tiresome is to have to explain why a certain attribute has a name that does not correctly describe its purpose. "This is the created timestamp, which means the last time a point was received, even though that's not the same time as when it was created." I do not look forward to explaining this to all future learners of Prometheus.
The cumulative weight of legacy conventions and workarounds becomes heavy over time and could contribute to an impression that Prometheus is obsolete or not modern. If we are already taking the time to update and modernize the protocol, it's the perfect time to incur the change cost to make sure that all of the names still make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
created timestamps communicate a specific idea: the time that a metric came into existence
start timestamps are more general: the time since the last point (if any).
It's one perspective, but there are more e.g. created could also mean when delta sample was created vs when something that counts was created.
I agree you have to be a bit more creative to make create work for this.
The cumulative weight of legacy conventions and workarounds becomes heavy over time and could contribute to an impression that Prometheus is obsolete or not modern. If we are already taking the time to update and modernize the protocol, it's the perfect time to incur the change cost to make sure that all of the names still make sense.
Cool, would we do ST for OpenMetrics 2.0 too, following this pattern?
Let's do some survey with a clear Pros & Cons / context we could figure out. ST as "starting counting" could make sense for cumulative first systems too. I'm not too strongly opposed to changing IF we can consistency change other places to use ST too (PRW we can change, OM 2.0 could change, Proto could change tbh too (it's just field naming)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting to like ST rename idea if we can rename all CT uses in Prometheus, so:
- OpenMetrics 2.0 (I think we can)
- PRW 2 (last moment to do this; we already planned to break and move CT to per sample; renaming in proto is non breaking (only coding API is breaking and it's fine)).
- Prometheus Proto (renaming in proto is non breaking (only coding API is breaking and it's fine)).
- SDKs -> this is tricky as it's stable API
One could say it's fine to keep CT for exposition formats as we don't plan to add deltas there. For pull delta does not make sense really (?). Although we push exposition format in pushgateway, so delta would be useful there 🙃 .. so consistency would be nice.
Feature flag is ok to keep old name (zero timestamp) - as we plan to deprecate that.
So.. it could be possible 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support renaming to start time to make it make sense for deltas. I would definitely prefer not to have both, as I think it would cause confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * This unlocks the most amount of benefits (e.g. also delta) for the same amount of work, it makes code simpler. | ||
| * We don't know if we need special cumulative best case optimization (yet); also it would be also for some "best" cases. Once we know we can always add those optimizations. | ||
|
|
||
| 2. Similarly, we propose to not have special CT storage cases per metric types. TSDB storage is not metric type aware, plus some systems allow optional CTs on gauges (e.g. OpenTelemetry). We propose keeping that storage flexibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the start time on gauges is very confusing, but preserving it is probably the right thing to do.
|
This makes a lot of sense to me. I think performance/benchmarks are probably the biggest potential blocker. |
| Cons: | ||
| * Technically, one could argue that both CT and ST naming are correct in this context. Prometheus is cumulative-first system, which means | ||
| using CT would make sense. | ||
| * We already have CT in scrape protocols and ~Remote Write 2.0; do we need to switch them too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * We already have CT in scrape protocols and ~Remote Write 2.0; do we need to switch them too? | |
| * We already have CT in scrape protocols (we go deeper into this in [OM 2.0](https://github.com/prometheus/docs/pull/2636)) and ~Remote Write 2.0; do we need to switch them too? |
|
Back from some PTO/leave, will try to address comments and finalize interface and TSDB piece soon |
Co-authored-by: Owen Williams <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
Co-authored-by: Owen Williams <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding just a few stylish/correction comments. I'm midway through the document and still haven't reviewed the proposed interfaces.
Regarding the CT vs ST discussion, I see the point that we'll always have more new users than old ones, but I feel like the CT terminology is so ingrained in the ecosystem that even if we change it now, people will continue to call it CT. Of course this is based on "voices in my head" and there's no real confirmation that this is gonna happen in the future 😅
|
|
||
| > TL;DR: We propose to extend Prometheus TSDB storage sample definition to include an extra int64 that will represent the cumulative created timestamp (CT) and, for the future delta temporality ([PROM-48](https://github.com/prometheus/proposals/pull/48)), a delta start timestamp (ST). | ||
| > We propose introducing persisting CT logic behind a single flag `ct-storage`. | ||
| > Once implemented, we propose to deprecate the `created-timestamps-zero-injection` experimental feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| > Once implemented, we propose to deprecate the `created-timestamps-zero-injection` experimental feature. | |
| > Once implemented, we propose to remove the `created-timestamps-zero-injection` experimental feature. |
Do we need to go through a deprecation notice for feature flags? I have the impression that the point of feature flags is that they are safe to remove 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
experimental features esp have no guarantee of stability, so that seems correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it would be nice to give some notice, even for experiment flag, but if you feel remove is more clear wording, I don't mind 👍🏽
Co-authored-by: Arthur Silva Sens <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
Co-authored-by: Arthur Silva Sens <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
Co-authored-by: Arthur Silva Sens <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
Co-authored-by: Arthur Silva Sens <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
Co-authored-by: Owen Williams <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed first pass, thanks for reviews!
As discussed in various places (e.g. prometheus/prometheus#17036 (comment) and delta WG) we decided to create a formal proposal on how CT native Prometheus storage could look like and how to make it useful (unblock) delta temporality.