Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding CloudEventSource metrics in Prometheus & OpenTelemetry #5259

Merged
merged 12 commits into from
Dec 22, 2023

Conversation

SpiritZhou
Copy link
Contributor

@SpiritZhou SpiritZhou commented Dec 6, 2023

Adding CloudEventSource metrics in Prometheus & OpenTelemetry:
keda_event_emitted_error_totals - Provides an indication of all the errors related to pushing events, per event sink
keda_event_emitted_totals - Provides an indication of all the events that have been emitted, per event sink
keda_event_sinks_totals - Provides an indication of all the event sinks created, per event sink type, per type (namespace/cluster)
keda_event_queue_status - Provides an indication of how many events are droped or still queue

Checklist

Fixes #3531

Signed-off-by: SpiritZhou <[email protected]>
Signed-off-by: SpiritZhou <[email protected]>
Signed-off-by: SpiritZhou <[email protected]>
Signed-off-by: SpiritZhou <[email protected]>
Signed-off-by: SpiritZhou <[email protected]>
Signed-off-by: SpiritZhou <[email protected]>
Signed-off-by: SpiritZhou <[email protected]>
@SpiritZhou SpiritZhou requested a review from a team as a code owner December 6, 2023 05:21
Copy link

github-actions bot commented Dec 6, 2023

Thank you for your contribution! 🙏 We will review your PR as soon as possible.

While you are waiting, make sure to:

Learn more about:

Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wonko, you have been working on metric naming refactor to follow best practices, could you take a look to these new metrics please?

@wonko
Copy link
Contributor

wonko commented Dec 6, 2023

@wonko, you have been working on metric naming refactor to follow best practices, could you take a look to these new metrics please?

On naming things:

Prometheus:

  • Indicate what is being measured (like the queue_depth)
  • Add the unit (messages) to your metrics
  • add total where we have an ever increasing counter
keda_cloudeventsource_errored_messages_total
keda_cloudeventsource_sent_messages_total
keda_cloudeventsource_queued_messages

Otel:

keda.cloudeventsource.messages.sent.count
keda.cloudeventsource.messages.errors.count
keda.cloudeventsource.messages.queued

On the help texts:

The texts in the comments are actually better than whatever is put in the help/descriptions.

Compare

https://github.com/kedacore/keda/pull/5259/files#diff-1b98f9b287bb8783cd25e2877e1897ecfca1e2a73c6852339679014d211c51ecR149

to

https://github.com/kedacore/keda/pull/5259/files#diff-9926d12b8e4d597d32b2cfbfec76cf18e448b776b73da5d014d57238c8089532R160

as an example. I'd suggest to try and rephrase the descriptions to be more comprehensive of what is being measured, and what the labels indicate.

Also, more general, I would suggest to wrap keda_cloudeventsource_emitted_messages_errors_total and keda_cloudeventsource_emitted_messages_total into one metric, as one is a subset of the other. You can use a label to differentiate between successful and errored messages. A "state" label would solve this. A such, a user can graph either the successful, the errored, or all messages (and start calculating percentage of success or error).

I have no insight in the inner workings, probably the queue-depth might also be a state on the "messages". If the metric actually involves a single message being either queue, sent or failed, it makes sense to wrap all this in a single metric.

I don't know how static/variable the name of the queue and the sink are. If they are highly distinctive (many different names over time), they might give some cardinality explosion on the metrics - hard to judge myself however)

I would drop the fake metric which exposes the name of the sink and the "active/non-active" state. This has no value as a metric, if you want to track the enabling/disabling of these sinks, this should be done through logging. If you want to graph the existance of a queue, you might want to get the label names of any queue which has emitted a message in the last X time.

Btw, this is all very much my personal preferences; @JorTurFer might want to indicate what is nice to have, or what is required in your PR.

@JorTurFer
Copy link
Member

Really interesting feedback @wonko
Thanks! 🙇
WDYT @kedacore/keda-core-contributors ?

@tomkerkhove
Copy link
Member

I am OK with the suggestions, but we should document the requirements as part of the refactoring so that contributors know what to expect. @wonko Are you opening in opening a proposal PR?

@wonko
Copy link
Contributor

wonko commented Dec 7, 2023

I am OK with the suggestions, but we should document the requirements as part of the refactoring so that contributors know what to expect. @wonko Are you opening in opening a proposal PR?

is there already a document which has (partial) info on these topics?

@zroubalik
Copy link
Member

I am OK with the suggestions, but we should document the requirements as part of the refactoring so that contributors know what to expect. @wonko Are you opening in opening a proposal PR?

is there already a document which has (partial) info on these topics?

I would a section somewhere around here https://github.com/kedacore/keda/blob/main/CONTRIBUTING.md#contributing-webhooks

Thanks a lot, that's really valuable.

@wonko
Copy link
Contributor

wonko commented Dec 8, 2023

I am OK with the suggestions, but we should document the requirements as part of the refactoring so that contributors know what to expect. @wonko Are you opening in opening a proposal PR?

is there already a document which has (partial) info on these topics?

I would a section somewhere around here https://github.com/kedacore/keda/blob/main/CONTRIBUTING.md#contributing-webhooks

Thanks a lot, that's really valuable.

See #5270

@tomkerkhove
Copy link
Member

tomkerkhove commented Dec 11, 2023

/run-e2e
Update: You can check the progress here

Signed-off-by: SpiritZhou <[email protected]>
@tomkerkhove
Copy link
Member

tomkerkhove commented Dec 11, 2023

/run-e2e
Update: You can check the progress here

@SpiritZhou
Copy link
Contributor Author

I saw the e2e failure is about azure-pipeline-test. Do you have any idea about it? @JorTurFer

@zroubalik
Copy link
Member

zroubalik commented Dec 21, 2023

/run-e2e sequential
Update: You can check the progress here

@zroubalik
Copy link
Member

@SpiritZhou do we have doc PR for this?

Signed-off-by: SpiritZhou <[email protected]>
@SpiritZhou
Copy link
Contributor Author

SpiritZhou commented Dec 22, 2023

@SpiritZhou do we have doc PR for this?

Docs updated.

@zroubalik
Copy link
Member

zroubalik commented Dec 22, 2023

/run-e2e sequential
Update: You can check the progress here

@zroubalik zroubalik merged commit 531dd6d into kedacore:main Dec 22, 2023
19 checks passed
sguruvar pushed a commit to sguruvar/keda that referenced this pull request Dec 24, 2023
toniiiik pushed a commit to toniiiik/keda that referenced this pull request Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide CloudEvents system metrics in Prometheus & OpenTelemetry
5 participants