Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native histogram specification #2539

Merged
merged 7 commits into from
Dec 12, 2024
Merged

Add native histogram specification #2539

merged 7 commits into from
Dec 12, 2024

Conversation

beorn7
Copy link
Member

@beorn7 beorn7 commented Nov 4, 2024

Here is finally the opus magnum that was in the making for far too long. Partially addressing prometheus/prometheus#11561 .

It is partially more documentation than specification, but I assume the specification section is still the least bad place for it.

This is intentionally not yet complete. TODOs mark the missing parts, most of which cannot be resolved at the moment because implementation work is still missing and/or decisions still have to be made.

The intention here is to finally get something out here, even if it is pretty much still a "living document". It will probably take us another year or so to resolve all TODOs.

Dear reviewers, primary goal of the review is to detect what I got wrong here and what I have forgotten. Language improvements are of course welcome, but of secondary importance. And as said above, the TODOs should mostly stay for this first version. Of course, if you have an easy way of resolving one, be my guest.

Tagging potential reviewers here (simply focus on your areas of expertise):
@bwplotka @juliusv @krajorama @zenador @fatsheep9146 @NeerajGartia21 @Maniktherana @fpetkovski @vesari @carrieedwards @fionaliao @codesome @ArthurSens @csmarchbanks @fstab

It is partially more documentation than specification, but I assume
the specification section is still the least bad place for it.

This is intentionally not yet complete. TODOs mark the missing parts,
most of which cannot be resolved at the moment because implementation
work is still missing and/or decisions still have to be made.

The intention here is to finally get something out here, even if it is
pretty much still a "living document". It will probably take us
another year or so to resolve all TODOs.

Signed-off-by: beorn7 <[email protected]>
@beorn7
Copy link
Member Author

beorn7 commented Nov 4, 2024

https://deploy-preview-2539--prometheus-docs.netlify.app/docs/specs/native_histograms/ is the preview, very convenient to see the rendered layout.

@krajorama
Copy link
Member

cc @jhesketh @charleskorn have also dealt with native histograms due to their query engine work.

@krajorama krajorama self-assigned this Nov 6, 2024
Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments. Haven't gone through yet even once, on line 738 Instrumentation libraries.

content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done with first read through

content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved
To be squashed before merging.

Signed-off-by: beorn7 <[email protected]>
@beorn7
Copy link
Member Author

beorn7 commented Nov 13, 2024

Thanks everyone for the review so far.

In the commit I have just pushed I have addressed all the feedback and resolved the conversation where I assumed that my response is uncontroversial. The conversations still open might require more discussion.

I have not yet reworded the counter reset handling according to our recent discussions. I will do that tomorrow with a refreshed brain.

This now takes into account that we (currently) have to return
"unknown counter reset" for every 1st histogram from a counter
histogram chunk and links the tracking issue to eventually change
that.

In fact, the text is now slightly shorter and simpler.

Signed-off-by: beorn7 <[email protected]>
@beorn7
Copy link
Member Author

beorn7 commented Nov 14, 2024

OK, counter reset handling should be up to date now.

@beorn7
Copy link
Member Author

beorn7 commented Dec 4, 2024

Note: Apparently, I forgot to send all the 52 pending replies I wrote several weeks ago.
They are submitted now, so now it makes way more sense where I resolved a conversation and where not.

Squash this before merging.

Signed-off-by: beorn7 <[email protected]>
@beorn7
Copy link
Member Author

beorn7 commented Dec 5, 2024

OK, I think we are done. @krajorama please let me know if I have missed anything.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Quite an effort and level of the details. Thanks for this work!

Added a few comments, I wonder if we could prepare the reader for nhcb dimension in the glossary and flavors 🤔 Also should we version this spec (like we do with PRW)

I also have a general questions to counter reset hints. We could simplify by using the CT feature on native histograms instead of reset hints. Those feature do the same, just CT allows a bit more information to be passed (when exactly reset happened). I understand the reset hints are implemented and "baking" for stability, so perhaps it should be improved in some 2.0 🤔

content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
defines the way bucket boundaries are calculated. The currently valid values
are -53 and the range between and including -4 and +8. More schemas may be
added in the future. -53 is a schema for so-called _custom bucket boundaries_
or short _custom buckets_, while the other schema numbers represent the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add those names to glossary 🤗

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can consider extending the glossary to all the new terms introduced by this document. However, there would be a lot. If this was a printed document, it might still be helpful for cross reference, but since this is electronic, the reader will simply search for any term they don't remember. (The few terms in the glossary so far are those that might easily be misunderstood or confused.)

content/docs/specs/native_histograms.md Show resolved Hide resolved
content/docs/specs/native_histograms.md Show resolved Hide resolved
approach, which might be improved in the future to find a better trade-off
between chunk size and compression ratio.

### Counter reset considerations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite complex. What do you think about switching to Created Timestamp feature for it? Would it simplify it? We will have it for all counters soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CT is in itself a new and experimental feature. I wouldn't conflate them.

Even if CT were already established, I don't think it can be relied upon for NH ingestion. We need to detect counter reset as part of the encoding anyway. CT is more an additional source of information: We can mark (and store) a counter reset even if it is not detectable by conventional means, see below.

The counter reset information of the synthetic zero sample is always set to
`CounterReset`. (TODO: Currently, Prometheus probably sets it to
`UnknownCounterReset` for the first sample of a series, which is not wrong, but
I think setting it to `CounterReset` makes more sense.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think we want to switch to native CT handling, not synthetic zeros. So switching counter reset to CT would be one way of doing this, going forward 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as starting point, awesome work .

@beorn7
Copy link
Member Author

beorn7 commented Dec 10, 2024

Thanks @krajorama & @bwplotka

@bwplotka : I'll address the individual comments soon. Just a few general things in advance:

I wonder if we could prepare the reader for nhcb dimension in the glossary and flavors

I tried to include NHCB wherever it was already possible. NHCB is a schema, though, and thus an already defined dimension of the data model. I wouldn't put it under "flavor" (which is anyway a bit of a makeshift solution to have a word for the counter/gauge and integer/float dichotomies).

WRT glossary: My intention was to nail down the words that the reader might get wrong (because of historical baggage ("sparse histogram") or confusion with previously existing concepts (classic vs. native histograms)). Adding all the terms newly introduced by this document would be quite repetitive, as one of the main purposes of this document is to introduce and explain those terms.

Also should we version this spec (like we do with PRW)

Maybe. But not while it is not really done. It is kind-of 0.x right now. But even once it is finalized, I'm not sure if versioning makes sense because this is more a concept document than strictly a spec. Let's discuss once we have declared native histograms a stable feature.

I also have a general questions to counter reset hints. We could simplify by using the CT feature on native histograms instead of reset hints. Those feature do the same, just CT allows a bit more information to be passed (when exactly reset happened). I understand the reset hints are implemented and "baking" for stability, so perhaps it should be improved in some 2.0

Counter resets are more deeply baked into NH. For example, it is impossible to have a counter reset in the middle of a chunk, based on the way the encoding works. Thus, we cannot rely on an external feature, that might or might not be present, to mark counter resets. Or even if we did rely on it, we would still store the counter reset implicitly. In any case, we have to see how CT shakes out in practice, how clock skew is handled and all of that.

Signed-off-by: beorn7 <[email protected]>
Signed-off-by: beorn7 <[email protected]>
@beorn7
Copy link
Member Author

beorn7 commented Dec 10, 2024

One more thought about the terms I put into the glossary: I think I understand my reluctance to put all the new terms into it better now. What I want is that the reader can just read through the whole document in order, including the glossary. This means it should only contain terms the reader is already somewhat familiar with. A "conventional" glossary is more something you would put at the end of the document where the reader can refer to if they don't remember a certain term in the middle of the text. But the glossary I'm using here is more along the lines of "let's get some common misunderstandings out of the way first".

- The next negative bucket (index _i_+1 relative to the bucket from the
previous item) has an upper exclusive limit of `MinFloat64` and an lower
inclusive limit of `-Inf`. (It could be called a _negative overflow bucket_.)
- Buckets beyond the `+Inf` and `-Inf` buckets described above MUST NOT be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the positive overflow bucket and negative overflow bucket might be more accurate than +Inf and -Inf buckets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those terms are only introduced in parentheses with "they could be called". I'd rather refer to the limits as that's firmly established before. It's also very accurate because the only observed value that will go into the +Inf bucket is, well, +Inf, and likewise for -Inf.

@bwplotka
Copy link
Member

Counter resets are more deeply baked into NH. For example, it is impossible to have a counter reset in the middle of a chunk, based on the way the encoding works. Thus, we cannot rely on an external feature, that might or might not be present, to mark counter resets. Or even if we did rely on it, we would still store the counter reset implicitly. In any case, we have to see how CT shakes out in practice, how clock skew is handled and all of that.

We have exactly similar problem with CT to solve, so I think it makes sense to think how in future CT could replace reset hints. The only reason "why not" is that reset hints obviously take less bytes.

However, happy to stop pressing this topic into the stabilization and docs of 1.0 NH here, too early agree 👍🏽

@beorn7
Copy link
Member Author

beorn7 commented Dec 12, 2024

@bwplotka is there anything left to address right now? Can I merge this?

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka is there anything left to address right now? Can I merge this?

All good, thanks for an amazing and deep work here! 💪🏽

@beorn7 beorn7 merged commit bf0e251 into main Dec 12, 2024
5 checks passed
@beorn7 beorn7 deleted the beorn7/histogram branch December 12, 2024 12:59
@beorn7
Copy link
Member Author

beorn7 commented Dec 12, 2024

There we go. And this is just the beginning. 43 TODOs to address… 😅

from zero, a relatively large threshold of the zero bucket helps to avoid
many high resolution buckets for a range that is not of interest.

The threshold of the zero bucket SHOULD coincide with a boundary of a regular
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this should be "SHOULD NOT" instead of "SHOULD"? I take "coincide with" as "same or overlapped".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be SHOULD.

What I want to express is that the zero bucket threshold should be the the same as a boundary of one of the regular buckets. My understanding is that "coincide" is precisely that. But happy to be corrected by a native speaker.

Copy link
Contributor

@fatsheep9146 fatsheep9146 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, I understand you now. It should be SHOULD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants