Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#3666 - Queue Monitoring - Enable Prometheus Metrics #4012

Merged
merged 14 commits into from
Dec 2, 2024

Conversation

andrewsignori-aot
Copy link
Collaborator

@andrewsignori-aot andrewsignori-aot commented Nov 28, 2024

  • Enabled /metrics endpoint on queue-consumers to expose Prometheus metrics following BC Gov docs.
  • Added prom-client also following the BC Gov docs mentioned above and the one that seems largely adopted.
  • Created a Gauge metric to capture the most recent summary of every active queue allowing to add to the metrics a summary of active, completed, failed, delayed, and 'waiting' jobs. This represents the same status observed in the Bull Board. The garage metrics rely on Redis queries and will always get the most updated values from Redis.
    image
  • Created a Counter metric to capture all the local events triggered for a queue. This happens only in-memory and is captured every time the metrics endpoint is invoked. This metric is not needed to achieve queue monitoring but seems a great addition and useful data to support future analysis.
  • Enable default nodejs metrics following also Prometheus recommendations.
  • Both metrics are incremented/set using the labels queueName, queueEvent, and queueType to allow querying in Sysdig.

Sysdig POCs

The Sysdig configurations are not final and were created to support the validation of the code in this PR but should not be considered final or part of the PR evaluation.

Alerts generated for a queue with a failed job.

max(queue_job_counts_current_total{queueEvent="failed",kube_namespace_name="0c27fb-dev"}) by (queueName) > 0

image

image

Same alerts are configured to be sent using an email channel.

image

Sample Dashboard

The Queues Overview dashboard has some examples of data but should not be considered final or part of the PR evaluation.

image

Note: the sysdig users and roles were updated and added to this PR, and it is already deployed to both tools environments. If time allows, further effort can be made to enhance the current process but any action beyond the user list update is not part of this PR.

@andrewsignori-aot andrewsignori-aot changed the title #3666 - POC queue monitoring #3666 - Queue Monitoring - Enable Prometheus Metrics Nov 29, 2024
@andrewsignori-aot andrewsignori-aot marked this pull request as ready for review November 29, 2024 16:30
@dheepak-aot dheepak-aot self-requested a review November 29, 2024 23:28
Copy link

sonarqubecloud bot commented Dec 2, 2024

Copy link

github-actions bot commented Dec 2, 2024

Backend Unit Tests Coverage Report

Totals Coverage
Statements: 21.99% ( 3742 / 17014 )
Methods: 10.08% ( 214 / 2123 )
Lines: 25.33% ( 3250 / 12830 )
Branches: 13.49% ( 278 / 2061 )

Copy link

github-actions bot commented Dec 2, 2024

E2E Workflow Workers Coverage Report

Totals Coverage
Statements: 65.43% ( 583 / 891 )
Methods: 59.26% ( 64 / 108 )
Lines: 68.54% ( 464 / 677 )
Branches: 51.89% ( 55 / 106 )

Copy link

github-actions bot commented Dec 2, 2024

E2E Queue Consumers Coverage Report

Totals Coverage
Statements: 83.73% ( 1292 / 1543 )
Methods: 81.65% ( 129 / 158 )
Lines: 85.21% ( 1095 / 1285 )
Branches: 68% ( 68 / 100 )

Copy link

github-actions bot commented Dec 2, 2024

E2E SIMS API Coverage Report

Totals Coverage
Statements: 67.09% ( 5837 / 8700 )
Methods: 64.86% ( 720 / 1110 )
Lines: 71.08% ( 4588 / 6455 )
Branches: 46.61% ( 529 / 1135 )

if (!this.monitoredQueueProviders) {
const queues = await this.queueService.queueConfigurationModel();
this.monitoredQueueProviders = queues
.filter((queueModel) => queueModel.isActive)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not ignoring the FT queues here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full-time toggle is not checked here. If the queues are not executing they will not produce any event.
The events will be associated but never triggered. I do not see any harm in it. Please let me know if further discussion would be required on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required and I am good based on what we discussed.

*/
setGlobalMetricsConfigurations(): void {
register.setDefaultLabels({ app: DEFAULT_METRICS_APP_LABEL });
collectDefaultMetrics({ labels: { app: DEFAULT_METRICS_APP_LABEL } });
Copy link
Collaborator

@dheepak-aot dheepak-aot Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for collect default metrics, is it not necesscary to set the register. Is it because we are using default registry? let me know if I am missing something.

collectDefaultMetrics({
      labels: { app: DEFAULT_METRICS_APP_LABEL, register },
    });

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is no need to set the register unless using a custom one.

image
https://github.com/siimon/prom-client?tab=readme-ov-file#default-metrics

Copy link
Collaborator

@dheepak-aot dheepak-aot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work and amazing solution to share metrics to Sysdig.

Thanks for doing the change. 👍

Copy link
Collaborator

@sh16011993 sh16011993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Work @andrewsignori-aot 👍 Thank you for the PR walkthrough.

@andrewsignori-aot andrewsignori-aot added this pull request to the merge queue Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants