User Request Error Tracing in Production Environments

Introduction

One of the benefits of SWODLR being asynchronous microservice based is also one of the features which may make it harder to trace down execution errors within production environments. To help with this, SWODLR has a few different ways to help operators traceback user requests in the system back to the logs that are associated with that request.

Prerequisites

To start, a user should provide an operator with a product ID that is associated with their specific request. This ID is a UUID which can either be found within SWODLR-UI or the SWODLR API. With this ID now known, operators can now investigate CloudWatch log groups further to find where errors have occured.

Example Product ID: 2b8dc090-a4d2-4e33-9f1d-4cdbe713b878

Execution Flow

SWODLR's user request flow several stages detailed in the table below. Next to each step is a description of the step and the log groups which should be searched first in a tracing of a user request.

Stage	Explanation	Log Groups
NEW	A user's request has just been received and is now being queued for execution	service-swodlr-uat-raster-create-bootstrap
GENERATING	A user's request is now executing on the SDS and a product is being generated	service-swodlr-uat-raster-create-submit_evaluate service-swodlr-uat-raster-create-submit_raster service-swodlr-uat-raster-create-wait_for_complete
READY	A user's request has finished generating and is now awaiting publishing	service-swodlr-uat-raster-create-wait _for_complete service-swodlr-uat-raster-publish_data
ERROR	An error occurred during product generation; the user should be given some basic information about this error. If the error was caused by the SDS, no further tracing can be performed on the SWODLR end of the system.	service-swodlr-uat-raster-create-submit_evaluate service-swodlr-uat-raster-create-submit_raster service-swodlr-uat-raster-create-wait_for_complete
AVAILABLE	The product is now available for downloading

Example Log Entry

This is an example of a SWODLR log entry. The log message is prefixed by log level, timestamp, lambda execution id, and user request information such as product_id and job_id. A log message that is tied to a user's request will always contain a product_id field in its logs while more general execution logs may not.

[ERROR]	2024-02-26T20:54:01.836Z	4ea0379e-e4ad-40a9-a955-06eae359c42f	[product_id: 2b8dc090-a4d2-4e33-9f1d-4cdbe713b878, job_id: 55edeef4-62b0-482c-9c6b-56e237295fc3] Failed to get job info
Traceback (most recent call last):
  File "/var/task/podaac/swodlr_raster_create/wait_for_complete.py", line 34, in handle_jobs
    job_info = utils.mozart_client.get_job_by_id(job_id).get_info()
  File "/var/task/otello/mozart.py", line 154, in get_job_by_id
    raise Exception(req.text)
Exception: {
    "success": false,
    "message": "job info not found: 55edeef4-62b0-482c-9c6b-56e237295fc3",
    "result": null
}

Efficiently Tracing Errors

SWODLR API provides timestamps at which every status within the system is updated. Combining this data with AWS's Log Group filtering allows operators to efficiently search through logs.

For example, let's say that there's a user who's product has generated an error within the system. The ID for that product is 2b8dc090-a4d2-4e33-9f1d-4cdbe713b878. We can query SWODLR API for the current and past statuses of that product.

Request

{
  l2RasterProduct(id: "2b8dc090-a4d2-4e33-9f1d-4cdbe713b878") {
    status(limit: 10) {
      id
      timestamp
      state
      reason
    }
  }
}

Response

{
	"data": {
		"l2RasterProduct": {
			"status": [
				{
					"id": "ecbada18-3fd1-407f-9e95-faf2d726527f",
					"timestamp": "2024-02-26T20:59:08.195",
					"state": "GENERATING",
					"reason": null
				},
				{
					"id": "f7d2f162-7cf5-4de3-932b-767866c7e476",
					"timestamp": "2024-02-26T20:53:57.807182",
					"state": "NEW",
					"reason": null
				}
			]
		}
	}
}

We can see that the product was left in the GENERATING stage since it was last seen by SWODLR. Now taking both the product ID, the timestamp of the last event, and the Log Groups in the above table, we can search through Cloudwatch for signs of where the pipeline has gone awry. We'll first start by looking at the submit_evaluate log group.

First set the search to the approximate time at which the status you're investigating occurred.

Within the search, enter in: %product_id: [PUT_PRODUCT_ID_HERE]%. This search filter will search for the occurrence of the product id which is appended as part of every log associated with a product generation. For this example, we'll be using %product_id: 2b8dc090-a4d2-4e33-9f1d-4cdbe713b878%.

This should now provide you with logs from which the behavior of the system can be investigated further. Happy error tracing!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly