-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queue Monitoring #3666
Comments
ninosamson
changed the title
SFTP File Handling
SFTP File Handling Process Recommendations
Aug 29, 2024
ninosamson
changed the title
SFTP File Handling Process Recommendations
SFTP File Monitoring
Aug 29, 2024
ninosamson
added
Dev & Architecture
Development and Architecture
and removed
Business
Items under Business Consideration
labels
Sep 3, 2024
github-merge-queue bot
pushed a commit
that referenced
this issue
Dec 2, 2024
- Enabled `/metrics` endpoint on `queue-consumers` to expose Prometheus metrics following [BC Gov docs](https://developer.gov.bc.ca/docs/default/component/platform-developer-docs/docs/app-monitoring/user-defined-monitoring/). - Added [prom-client](https://github.com/siimon/prom-client) also following the BC Gov docs mentioned above and the one that seems largely adopted. - Created a [Gauge metric](https://prometheus.io/docs/concepts/metric_types/#gauge) to capture the most recent summary of every active queue allowing to add to the metrics a summary of `active`, `completed`, `failed`, `delayed`, and 'waiting' jobs. This represents the same status observed in the Bull Board. The garage metrics rely on Redis queries and will always get the most updated values from Redis. ![image](https://github.com/user-attachments/assets/f5bd582a-4eba-41c5-ba03-77adb17b5584) - Created a [Counter metric](https://prometheus.io/docs/concepts/metric_types/#counter) to capture all the local events triggered for a queue. This happens only in-memory and is captured every time the `metrics` endpoint is invoked. This metric is not needed to achieve queue monitoring but seems a great addition and useful data to support future analysis. - Enable default [nodejs metrics](https://github.com/siimon/prom-client?tab=readme-ov-file#default-metrics) following also [Prometheus recommendations](https://prometheus.io/docs/instrumenting/writing_clientlibs/#standard-and-runtime-collectors). - Both metrics are incremented/set using the labels `queueName`, `queueEvent`, and `queueType` to allow querying in Sysdig. ## Sysdig POCs The Sysdig configurations are not final and were created to support the validation of the code in this PR but should not be considered final or part of the PR evaluation. ### Alerts generated for a queue with a failed job. ``` max(queue_job_counts_current_total{queueEvent="failed",kube_namespace_name="0c27fb-dev"}) by (queueName) > 0 ``` ![image](https://github.com/user-attachments/assets/4cd5fc0e-7001-4b76-a01b-5e50761252a7) ![image](https://github.com/user-attachments/assets/237f5d24-8da4-4a8c-9fcf-25784c554229) ### Same alerts are configured to be sent using an email channel. ![image](https://github.com/user-attachments/assets/811c5cdb-17f0-4cde-af43-dccc662c5d3e) ## Sample Dashboard The [Queues Overview](https://app.sysdigcloud.com/#/dashboards/419533?last=3600&scope=kubernetes.namespace.name%20%3D%20%3F%220c27fb-dev%22 ) dashboard has some examples of data but should not be considered final or part of the PR evaluation. ![image](https://github.com/user-attachments/assets/6a68d4f3-2890-4d28-b8e6-b718085955db) _Note:_ the sysdig users and roles were updated and added to this PR, and it is already deployed to both tools environments. If time allows, further effort can be made to enhance the current process but any action beyond the user list update is not part of this PR.
github-merge-queue bot
pushed a commit
that referenced
this issue
Dec 3, 2024
This PR is a proposal to unify the queue behavior. It contains changes for start and cancel assessments regular queues and the idea is to expand it to all schedulers. The below image shows these points to be shared and forced for all the queues, using the CAS scheduler as an example. 1 - Return type; 2 - Provide a `processSummary`; 3 - try/catch and default start message. 4 - Error check on `processSummary` to force the jobs to fail. 5 - Error handling on `catch`. 6 - Final log activities during `finally`. ![image](https://github.com/user-attachments/assets/4b31491b-f859-4ef6-a126-67ad60bf7bb0) _Note:_ The above will also ensure the E2E tests follow a closer approach. Sample log for a queue using the new `BaseQueue<T>` class. ![image](https://github.com/user-attachments/assets/f42193dc-5e57-4f8a-94a8-c4a78df39084) ### Side effects The idea is to have the BaseScheduler extend the BaseQueue which will force a change in every single scheduler (which would be the idea). To avoid the need to change every single scheduler and adapt every E2E, the methods required by BaseQueue can be implemented temporarily as below. The idea is to start the scheduler refactor still during this ticket. ```ts /** * To be implemented. */ processQueue(_job: Job<void>): Promise<string | string[]> { throw new Error("Method not implemented."); } /** * To be implemented. */ async process( _job: Job<void>, _processSummary: ProcessSummary, ): Promise<string | string[]> { throw new Error("Method not implemented."); } ``` ### Error Fix Fixed an issue with the logger service where `null` message was logged right after the regular message.
github-merge-queue bot
pushed a commit
that referenced
this issue
Dec 4, 2024
As per the conversation on the [Teams chat](https://teams.microsoft.com/l/message/19:[email protected]/1733175923577?tenantId=6fdb5200-3d0d-4a8a-b036-d3685e359adc&groupId=454b1d3b-af0f-4b44-b891-f4320c85d290&parentMessageId=1733175923577&teamName=External%3A%20AEST-SIMS&channelName=DEVS&createdTime=1733175923577), adjusted the queues to use the out-of-box cleanup configuration. - Configuration reference: https://github.com/OptimalBits/bull/blob/develop/REFERENCE.md#keepjobs-options - For simplicity, not changing DB settings in this PR, just adapting the existing one to allow the removal of the methods. Currently, only schedulers were executing the cleanup, with this change, all the queues will have the completed tasks removed as per the existing DB setting.
andrewsignori-aot
added a commit
that referenced
this issue
Dec 10, 2024
andrewsignori-aot
added a commit
that referenced
this issue
Dec 11, 2024
3 tasks
andrewsignori-aot
added a commit
that referenced
this issue
Dec 12, 2024
- Extended the `BaseScheduler` from the `BaseQueue` to enforce all the same basic features like `porcessSummary` and error handling, as agreed during PR #4020. - Started the refactor os some scheduler to be used as a baseline for the upcoming refactors. - cas-supplier-integration.scheduler.ts - cra-process-integration.scheduler.ts - cra-response-integration.scheduler.ts - Added the below code to every scheduler to be refactored. _Note:_ **this is also responsible for the Sonarcloud duplication issue.** **There are 23 files that have exactly the same change.** ```ts /** * To be removed once the method {@link process} is implemented. * This method "hides" the {@link Process} decorator from the base class. */ async processQueue(): Promise<string | string[]> { throw new Error("Method not implemented."); } /** * When implemented in a derived class, process the queue job. * To be implemented. */ protected async process(): Promise<string | string[]> { throw new Error("Method not implemented."); } ```
github-merge-queue bot
pushed a commit
that referenced
this issue
Jan 14, 2025
This ticket is to add missing team members to the sysdig as agreed during the development of the ticket #3666. _Please note the ticket is not associated with the PR because it is closed and does not seem worth opening or creating a ticket only for this task._ The current way to update and apply the changes was manual and demanded manual editing of the file to be applied to each license plate. The intention of this PR is to make it simple. - Converted sysdig team to a template to allow adding a parameter. - Create a make command to execute the template in the current two license plates. Technical documentation about Sysdig team setup: https://developer.gov.bc.ca/docs/default/component/platform-developer-docs/docs/app-monitoring/sysdig-monitor-setup-team/?utm_source=digital&utm_medium=web&utm_campaign=sysdig-monitor
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
User Story
As a product team, we need to monitor the queues and associated SFTP folders to diagnose errors in processing the files, and confirm archiving behaviour for failed files.
Acceptance Criteria
Note for IMB team:
@michesmith to create epic to track this and subsequent tickets.
The text was updated successfully, but these errors were encountered: