A Spring Boot app to listen on an AWS queue and send case notes to probation.
./gradlew build
/health/ping
: will respond{"status":"UP"}
to all requests. This should be used by dependent systems to check connectivity to keyworker, rather than calling the/health
endpoint./health
: provides information about the application health and its dependencies. This should only be used by keyworker health monitoring (e.g. pager duty) and not other systems who wish to find out the state of keyworker./info
: provides information about the version of deployed application.
Case notes to probation is best tested by the DPS front end. To manually smoke test / regression test:
- Navigate to DPS and search for a prisoner
- Add an OMIC case note to the prisoner
- Add a keyworker case note to the prisoner
- Wait 5 minutes or so and then check app insights to see that the case note has been sent to probation:
requests
| where cloud_RoleName == "community-api"
| where name == "PUT /secure/nomisCaseNotes/{nomisId}/{caseNotesId}"
For offenders that don't yet exist in Delius this will create a 404 which will then be ignored.
Localstack has been introduced for some integration tests and it is also possible to run the application against localstack.
- In the root of the localstack project, run command
sudo rm -rf /tmp/localstack && docker-compose down && docker-compose up
to clear down and then bring up localstack
- Start the Spring Boot app with profile='localstack'
- You can now use the aws CLI to send messages to the queue
- The queue's health status should appear at the local healthcheck: http://localhost:8082/health
- Note that you will also need local copies of Oauth server, Case notes API and Delius API running to do anything useful
With localstack now up and running (see previous section), run
./gradlew test
When we fail to process a case note due to an unexpected error an exception will be thrown and the case note will be moved to the DLQ.
If the failure was due to a recoverable error - e.g. network issues - then the DLQ message can and should be retried.
However, if the error is not recoverable - e.g. some new error scenario we weren't expecting - then we need to investigate the error and either:
- fix the bug that is causing the error OR
- handle and log the error so that the exception is no longer thrown and the message does not end up on the DLQ
- Import the swagger collection into Postman - link to API docs at the top of this README.
- Obtain an access token with
ROLE_CASE_NOTE_QUEUE_ADMIN
role - #dps_tech_team will be able to help with that - Call the
/queue-admin/retry-all-dlq
endpoint to transfer all DLQ entries back onto the main queue - this should get rid of any messages with recoverable errors - Check that the messages have gone from the dlq by going to https://case-notes-to-probation.prison.service.justice.gov.uk/health
For messages that don't then disappear from the dlq:
- cd into the
scripts
directory and run thecopy-dlq.sh
script which copies the contents of the DLQ locally and summarises insummary.csv
- run an AppInsights Logs query looking for exceptions shortly after the timestamp found in the csv
- if there was an error calling a DPS service, check the logs for that service and possibly check the data in DPS
- if there was an error calling a Delius service, check the Delius AWS logs and possibly check the data in Delius
- identify mitigation for the error - fix bug or ignore error
- once this code change is in production transfer the DLQ messages onto the main queue again and all should now be handled without exceptions
We've had issues in the past where a pod stopped reading from the queue but nobody noticed. Eventually all 4 pods stopped reading and we stopped sending case notes to probation. Nobody noticed for a couple of weeks.
To warn us if this happens again we've created an alert in Application Insights that fires if any of the pods stop producing telemetry events. The alert is called Case Notes to Probation - office hours inactivity alert
. Note that the alert only fires during office hours as low volumes outside office hours trigger false positives.
If the alert fires click on the View Search
link which should run the query that failed in Application Insights. Run command kubectl -n case-notes-to-probation-prod get pods
and compare the pods running to the pods from the query results. Restart the pod that doesn't appear in the query with command kubectl -n case-notes-to-probation-prod delete pod <insert pod name here>
.