Queue Monitoring #3666

ninosamson · 2024-08-28T20:12:49Z

User Story
As a product team, we need to monitor the queues and associated SFTP folders to diagnose errors in processing the files, and confirm archiving behaviour for failed files.

Acceptance Criteria

Review monitoring processes queues (and associated SFTP folders) for errors, failures.
Investigate and implement basic alerting as available through bull framework.
If not available through Bull, create new ticket for implementation of alerting.
Nice to have.
- Create some POC towards the acceptable plan.
- Include the team in the discussions and get the team's understanding.
- Check the possibility of using global event handlers to send notifications, for instance, create an email.
- Check if sysdig can get some status from the queues.
- Check queues failing silently and the effort to force them to fail in the dashboard. If possible have them fixed.

Note for IMB team:

Following this ticket, SIMS IMB team will develop recommended process for investigation, tracking of incidents/tickets and retries to address failed jobs/files.
Also need to confirm and implement desired behaviour with regards to failed files (archive or move to another folder).
- Example: Currently PT and FT feedback integration files are not being archived if they are erroring, even though they are non-sequential.

@michesmith to create epic to track this and subsequent tickets.

- Enabled `/metrics` endpoint on `queue-consumers` to expose Prometheus metrics following [BC Gov docs](https://developer.gov.bc.ca/docs/default/component/platform-developer-docs/docs/app-monitoring/user-defined-monitoring/). - Added [prom-client](https://github.com/siimon/prom-client) also following the BC Gov docs mentioned above and the one that seems largely adopted. - Created a [Gauge metric](https://prometheus.io/docs/concepts/metric_types/#gauge) to capture the most recent summary of every active queue allowing to add to the metrics a summary of `active`, `completed`, `failed`, `delayed`, and 'waiting' jobs. This represents the same status observed in the Bull Board. The garage metrics rely on Redis queries and will always get the most updated values from Redis. ![image](https://github.com/user-attachments/assets/f5bd582a-4eba-41c5-ba03-77adb17b5584) - Created a [Counter metric](https://prometheus.io/docs/concepts/metric_types/#counter) to capture all the local events triggered for a queue. This happens only in-memory and is captured every time the `metrics` endpoint is invoked. This metric is not needed to achieve queue monitoring but seems a great addition and useful data to support future analysis. - Enable default [nodejs metrics](https://github.com/siimon/prom-client?tab=readme-ov-file#default-metrics) following also [Prometheus recommendations](https://prometheus.io/docs/instrumenting/writing_clientlibs/#standard-and-runtime-collectors). - Both metrics are incremented/set using the labels `queueName`, `queueEvent`, and `queueType` to allow querying in Sysdig. ## Sysdig POCs The Sysdig configurations are not final and were created to support the validation of the code in this PR but should not be considered final or part of the PR evaluation. ### Alerts generated for a queue with a failed job. ``` max(queue_job_counts_current_total{queueEvent="failed",kube_namespace_name="0c27fb-dev"}) by (queueName) > 0 ``` ![image](https://github.com/user-attachments/assets/4cd5fc0e-7001-4b76-a01b-5e50761252a7) ![image](https://github.com/user-attachments/assets/237f5d24-8da4-4a8c-9fcf-25784c554229) ### Same alerts are configured to be sent using an email channel. ![image](https://github.com/user-attachments/assets/811c5cdb-17f0-4cde-af43-dccc662c5d3e) ## Sample Dashboard The [Queues Overview](https://app.sysdigcloud.com/#/dashboards/419533?last=3600&scope=kubernetes.namespace.name%20%3D%20%3F%220c27fb-dev%22 ) dashboard has some examples of data but should not be considered final or part of the PR evaluation. ![image](https://github.com/user-attachments/assets/6a68d4f3-2890-4d28-b8e6-b718085955db) _Note:_ the sysdig users and roles were updated and added to this PR, and it is already deployed to both tools environments. If time allows, further effort can be made to enhance the current process but any action beyond the user list update is not part of this PR.

This PR is a proposal to unify the queue behavior. It contains changes for start and cancel assessments regular queues and the idea is to expand it to all schedulers. The below image shows these points to be shared and forced for all the queues, using the CAS scheduler as an example. 1 - Return type; 2 - Provide a `processSummary`; 3 - try/catch and default start message. 4 - Error check on `processSummary` to force the jobs to fail. 5 - Error handling on `catch`. 6 - Final log activities during `finally`. ![image](https://github.com/user-attachments/assets/4b31491b-f859-4ef6-a126-67ad60bf7bb0) _Note:_ The above will also ensure the E2E tests follow a closer approach. Sample log for a queue using the new `BaseQueue<T>` class. ![image](https://github.com/user-attachments/assets/f42193dc-5e57-4f8a-94a8-c4a78df39084) ### Side effects The idea is to have the BaseScheduler extend the BaseQueue which will force a change in every single scheduler (which would be the idea). To avoid the need to change every single scheduler and adapt every E2E, the methods required by BaseQueue can be implemented temporarily as below. The idea is to start the scheduler refactor still during this ticket. ```ts /** * To be implemented. */ processQueue(_job: Job<void>): Promise<string | string[]> { throw new Error("Method not implemented."); } /** * To be implemented. */ async process( _job: Job<void>, _processSummary: ProcessSummary, ): Promise<string | string[]> { throw new Error("Method not implemented."); } ``` ### Error Fix Fixed an issue with the logger service where `null` message was logged right after the regular message.

As per the conversation on the [Teams chat](https://teams.microsoft.com/l/message/19:[email protected]/1733175923577?tenantId=6fdb5200-3d0d-4a8a-b036-d3685e359adc&groupId=454b1d3b-af0f-4b44-b891-f4320c85d290&parentMessageId=1733175923577&teamName=External%3A%20AEST-SIMS&channelName=DEVS&createdTime=1733175923577), adjusted the queues to use the out-of-box cleanup configuration. - Configuration reference: https://github.com/OptimalBits/bull/blob/develop/REFERENCE.md#keepjobs-options - For simplicity, not changing DB settings in this PR, just adapting the existing one to allow the removal of the methods. Currently, only schedulers were executing the cleanup, with this change, all the queues will have the completed tasks removed as per the existing DB setting.

- Extended the `BaseScheduler` from the `BaseQueue` to enforce all the same basic features like `porcessSummary` and error handling, as agreed during PR #4020. - Started the refactor os some scheduler to be used as a baseline for the upcoming refactors. - cas-supplier-integration.scheduler.ts - cra-process-integration.scheduler.ts - cra-response-integration.scheduler.ts - Added the below code to every scheduler to be refactored. _Note:_ **this is also responsible for the Sonarcloud duplication issue.** **There are 23 files that have exactly the same change.** ```ts /** * To be removed once the method {@link process} is implemented. * This method "hides" the {@link Process} decorator from the base class. */ async processQueue(): Promise<string | string[]> { throw new Error("Method not implemented."); } /** * When implemented in a derived class, process the queue job. * To be implemented. */ protected async process(): Promise<string | string[]> { throw new Error("Method not implemented."); } ```

This ticket is to add missing team members to the sysdig as agreed during the development of the ticket #3666. _Please note the ticket is not associated with the PR because it is closed and does not seem worth opening or creating a ticket only for this task._ The current way to update and apply the changes was manual and demanded manual editing of the file to be applied to each license plate. The intention of this PR is to make it simple. - Converted sysdig team to a template to allow adding a parameter. - Create a make command to execute the template in the current two license plates. Technical documentation about Sysdig team setup: https://developer.gov.bc.ca/docs/default/component/platform-developer-docs/docs/app-monitoring/sysdig-monitor-setup-team/?utm_source=digital&utm_medium=web&utm_campaign=sysdig-monitor

ninosamson self-assigned this Aug 28, 2024

ninosamson added the Business Items under Business Consideration label Aug 28, 2024

ninosamson changed the title ~~SFTP File Handling~~ SFTP File Handling Process Recommendations Aug 29, 2024

ninosamson assigned JasonCTang Aug 29, 2024

ninosamson changed the title ~~SFTP File Handling Process Recommendations~~ SFTP File Monitoring Aug 29, 2024

ninosamson changed the title ~~SFTP File Monitoring~~ Queue/SFTP Monitoring Aug 29, 2024

ninosamson added Queue Consumers Integration labels Aug 29, 2024

ninosamson changed the title ~~Queue/SFTP Monitoring~~ Queue Monitoring Sep 3, 2024

ninosamson added Dev & Architecture Development and Architecture and removed Business Items under Business Consideration labels Sep 3, 2024

andrewsignori-aot removed the Dev & Architecture Development and Architecture label Sep 16, 2024

ninosamson assigned andrewsignori-aot and unassigned JasonCTang Nov 20, 2024

andrewsignori-aot added a commit that referenced this issue Dec 10, 2024

Merge branch 'main' into feature/#3666-add-base-queue-to-schedulers

eaec81f

andrewsignori-aot added a commit that referenced this issue Dec 11, 2024

Merge branch 'main' into feature/#3666-add-base-queue-to-schedulers

bd1dfaf

andrewsignori-aot mentioned this issue Dec 11, 2024

Queue Monitoring - Schedulers Refactor #4076

Closed

3 tasks

ninosamson closed this as completed Dec 18, 2024

AnnaPBashkatova added this to the 2.1 Post Part-Time Primera Importante Mucho milestone Jan 7, 2025

andrewsignori-aot mentioned this issue Jan 14, 2025

#3666 - Queue Monitoring - Sysdig Team Update #4233

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue Monitoring #3666

Queue Monitoring #3666

ninosamson commented Aug 28, 2024 •

edited

Loading

Queue Monitoring #3666

Queue Monitoring #3666

Comments

ninosamson commented Aug 28, 2024 • edited Loading

ninosamson commented Aug 28, 2024 •

edited

Loading