-
Notifications
You must be signed in to change notification settings - Fork 2
User Request Error Tracing in Production Environments
One of the benefits of SWODLR being asynchronous microservice based is also one of the features which may make it harder to trace down execution errors within production environments. To help with this, SWODLR has a few different ways to help operators traceback user requests in the system back to the logs that are associated with that request.
To start, a user should provide an operator with a product ID that is associated with their specific request. This ID is a UUID which can either be found within SWODLR-UI or the SWODLR API. With this ID now known, operators can now investigate CloudWatch log groups further to find where errors have occured.
Example Product ID: 2b8dc090-a4d2-4e33-9f1d-4cdbe713b878
SWODLR's user request flow several stages detailed in the table below. Next to each step is a description of the step and the log groups which should be searched first in a tracing of a user request.
Stage | Explanation | Log Groups |
---|---|---|
NEW | A user's request has just been received and is now being queued for execution | service-swodlr-uat-raster-create-bootstrap |
GENERATING | A user's request is now executing on the SDS and a product is being generated | service-swodlr-uat-raster-create-submit_evaluate service-swodlr-uat-raster-create-submit_raster service-swodlr-uat-raster-create-wait_for_complete |
READY | A user's request has finished generating and is now awaiting publishing | service-swodlr-uat-raster-create-wait _for_complete service-swodlr-uat-raster-publish_data |
ERROR | An error occurred during product generation; the user should be given some basic information about this error. If the error was caused by the SDS, no further tracing can be performed on the SWODLR end of the system. |
service-swodlr-uat-raster-create-submit_evaluate service-swodlr-uat-raster-create-submit_raster service-swodlr-uat-raster-create-wait_for_complete |
AVAILABLE | The product is now available for downloading |
This is an example of a SWODLR log entry. The log message is prefixed by log level, timestamp, lambda execution id, and user request information such as product_id
and job_id
. A log message that is tied to a user's request will always contain a product_id field in its logs while more general execution logs may not.
[ERROR] 2024-02-26T20:54:01.836Z 4ea0379e-e4ad-40a9-a955-06eae359c42f [product_id: 2b8dc090-a4d2-4e33-9f1d-4cdbe713b878, job_id: 55edeef4-62b0-482c-9c6b-56e237295fc3] Failed to get job info
Traceback (most recent call last):
File "/var/task/podaac/swodlr_raster_create/wait_for_complete.py", line 34, in handle_jobs
job_info = utils.mozart_client.get_job_by_id(job_id).get_info()
File "/var/task/otello/mozart.py", line 154, in get_job_by_id
raise Exception(req.text)
Exception: {
"success": false,
"message": "job info not found: 55edeef4-62b0-482c-9c6b-56e237295fc3",
"result": null
}
SWODLR API provides timestamps at which every status within the system is updated. Combining this data with AWS's Log Group filtering allows operators to efficiently search through logs.
For example, let's say that there's a user who's product has generated an error within the system. The ID for that product is 2b8dc090-a4d2-4e33-9f1d-4cdbe713b878
. We can query SWODLR API for the current and past statuses of that product.
{
l2RasterProduct(id: "2b8dc090-a4d2-4e33-9f1d-4cdbe713b878") {
status(limit: 10) {
id
timestamp
state
reason
}
}
}
{
"data": {
"l2RasterProduct": {
"status": [
{
"id": "ecbada18-3fd1-407f-9e95-faf2d726527f",
"timestamp": "2024-02-26T20:59:08.195",
"state": "GENERATING",
"reason": null
},
{
"id": "f7d2f162-7cf5-4de3-932b-767866c7e476",
"timestamp": "2024-02-26T20:53:57.807182",
"state": "NEW",
"reason": null
}
]
}
}
}
We can see that the product was left in the GENERATING
stage since it was last seen by SWODLR. Now taking both the product ID, the timestamp of the last event, and the Log Groups in the above table, we can search through Cloudwatch for signs of where the pipeline has gone awry. We'll first start by looking at the submit_evaluate
log group.
First set the search to the approximate time at which the status you're investigating occurred.

Within the search, enter in: %product_id: [PUT_PRODUCT_ID_HERE]%
. This search filter will search for the occurrence of the product id which is appended as part of every log associated with a product generation. For this example, we'll be using %product_id: 2b8dc090-a4d2-4e33-9f1d-4cdbe713b878%
.

This should now provide you with logs from which the behavior of the system can be investigated further. Happy error tracing!